betelgeusebytes/k8s/observability-stack/README_old.md

11 KiB

State-of-the-Art Observability Stack for Kubernetes

This deployment provides a comprehensive, production-ready observability solution using the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) with unified collection through Grafana Alloy.

Architecture Overview

Core Components

  1. Grafana (v11.4.0) - Unified visualization platform

    • Pre-configured datasources for Prometheus, Loki, and Tempo
    • Automatic correlation between logs, metrics, and traces
    • Modern UI with TraceQL editor support
  2. Prometheus (v2.54.1) - Metrics collection and storage

    • 7-day retention
    • Comprehensive Kubernetes service discovery
    • Scrapes: API server, nodes, cadvisor, pods, services
  3. Grafana Loki (v3.2.1) - Log aggregation

    • 7-day retention with compaction
    • TSDB index for efficient queries
    • Automatic correlation with traces
  4. Grafana Tempo (v2.6.1) - Distributed tracing

    • 7-day retention
    • Multiple protocol support: OTLP, Jaeger, Zipkin
    • Metrics generation from traces
    • Automatic correlation with logs and metrics
  5. Grafana Alloy (v1.5.1) - Unified observability agent

    • Replaces Promtail, Vector, Fluent Bit
    • Collects logs from all pods
    • OTLP receiver for traces
    • Runs as DaemonSet on all nodes
  6. kube-state-metrics (v2.13.0) - Kubernetes object metrics

    • Deployment, Pod, Service, Node metrics
    • Essential for cluster monitoring
  7. node-exporter (v1.8.2) - Node-level system metrics

    • CPU, memory, disk, network metrics
    • Runs on all nodes via DaemonSet

Key Features

  • Unified Observability: Logs, metrics, and traces in one platform
  • Automatic Correlation: Click from logs to traces to metrics seamlessly
  • 7-Day Retention: Optimized for single-node cluster
  • Local SSD Storage: Fast, persistent storage on hetzner-2 node
  • OTLP Support: Modern OpenTelemetry protocol support
  • TLS Enabled: Secure access via NGINX Ingress with Let's Encrypt
  • Low Resource Footprint: Optimized for single-node deployment

Storage Layout

All data stored on local SSD at /mnt/local-ssd/:

/mnt/local-ssd/
├── prometheus/    (50Gi)  - Metrics data
├── loki/          (100Gi) - Log data
├── tempo/         (50Gi)  - Trace data
└── grafana/       (10Gi)  - Dashboards and settings

Deployment Instructions

Prerequisites

  1. Kubernetes cluster with NGINX Ingress Controller
  2. cert-manager installed with Let's Encrypt issuer
  3. DNS record: grafana.betelgeusebytes.io → your cluster IP
  4. Node labeled: kubernetes.io/hostname=hetzner-2

Step 0: Remove Existing Monitoring (If Applicable)

If you have an existing monitoring stack (Prometheus, Grafana, Loki, Fluent Bit, etc.), remove it first to avoid conflicts:

./remove-old-monitoring.sh

This interactive script will help you safely remove:

  • Existing Prometheus/Grafana/Loki/Tempo deployments
  • Helm releases for monitoring components
  • Fluent Bit, Vector, or other log collectors
  • Related ConfigMaps, PVCs, and RBAC resources
  • Prometheus Operator CRDs (if applicable)

Note: The main deployment script (deploy.sh) will also prompt you to run cleanup if needed.

Step 1: Prepare Storage Directories

SSH into the hetzner-2 node and create directories:

sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
sudo chown -R 10001:10001 /mnt/local-ssd/loki
sudo chown -R root:root /mnt/local-ssd/tempo
sudo chown -R 472:472 /mnt/local-ssd/grafana

Step 2: Deploy the Stack

chmod +x deploy.sh
./deploy.sh

Or deploy manually:

kubectl apply -f 00-namespace.yaml
kubectl apply -f 01-persistent-volumes.yaml
kubectl apply -f 02-persistent-volume-claims.yaml
kubectl apply -f 03-prometheus-config.yaml
kubectl apply -f 04-loki-config.yaml
kubectl apply -f 05-tempo-config.yaml
kubectl apply -f 06-alloy-config.yaml
kubectl apply -f 07-grafana-datasources.yaml
kubectl apply -f 08-rbac.yaml
kubectl apply -f 10-prometheus.yaml
kubectl apply -f 11-loki.yaml
kubectl apply -f 12-tempo.yaml
kubectl apply -f 13-grafana.yaml
kubectl apply -f 14-alloy.yaml
kubectl apply -f 15-kube-state-metrics.yaml
kubectl apply -f 16-node-exporter.yaml
kubectl apply -f 20-grafana-ingress.yaml

Step 3: Verify Deployment

kubectl get pods -n observability
kubectl get pv
kubectl get pvc -n observability

All pods should be in Running state:

  • grafana-0
  • loki-0
  • prometheus-0
  • tempo-0
  • alloy-xxxxx (one per node)
  • kube-state-metrics-xxxxx
  • node-exporter-xxxxx (one per node)

Step 4: Access Grafana

  1. Open: https://grafana.betelgeusebytes.io
  2. Login with default credentials:
    • Username: admin
    • Password: admin
  3. IMPORTANT: Change the password on first login!

Using the Stack

Exploring Logs (Loki)

  1. In Grafana, go to Explore
  2. Select Loki datasource
  3. Example queries:
    {namespace="observability"}
    {namespace="observability", app="prometheus"}
    {namespace="default"} |= "error"
    {pod="my-app-xxx"} | json | level="error"
    

Exploring Metrics (Prometheus)

  1. In Grafana, go to Explore
  2. Select Prometheus datasource
  3. Example queries:
    up
    node_memory_MemAvailable_bytes
    rate(container_cpu_usage_seconds_total[5m])
    kube_pod_status_phase{namespace="observability"}
    

Exploring Traces (Tempo)

  1. In Grafana, go to Explore
  2. Select Tempo datasource
  3. Search by:
    • Service name
    • Duration
    • Tags
  4. Click on a trace to see detailed span timeline

Correlations

The stack automatically correlates:

  • Logs → Traces: Click traceID in logs to view trace
  • Traces → Logs: Click on trace to see related logs
  • Traces → Metrics: Tempo generates metrics from traces

Instrumenting Your Applications

For Logs

Logs are automatically collected from all pods by Alloy. Emit structured JSON logs:

{"level":"info","message":"Request processed","duration_ms":42}

For Traces

Send traces to Tempo using OTLP:

# Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://tempo.observability.svc.cluster.local:4317")
    )
)
trace.set_tracer_provider(provider)

For Metrics

Expose metrics in Prometheus format and add annotations to your pod:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Monitoring Endpoints

Internal service endpoints:

  • Prometheus: http://prometheus.observability.svc.cluster.local:9090
  • Loki: http://loki.observability.svc.cluster.local:3100
  • Tempo:
    • HTTP: http://tempo.observability.svc.cluster.local:3200
    • OTLP gRPC: tempo.observability.svc.cluster.local:4317
    • OTLP HTTP: tempo.observability.svc.cluster.local:4318
  • Grafana: http://grafana.observability.svc.cluster.local:3000

Troubleshooting

Check Pod Status

kubectl get pods -n observability
kubectl describe pod <pod-name> -n observability

View Logs

kubectl logs -n observability -l app=grafana
kubectl logs -n observability -l app=prometheus
kubectl logs -n observability -l app=loki
kubectl logs -n observability -l app=tempo
kubectl logs -n observability -l app=alloy

Check Storage

kubectl get pv
kubectl get pvc -n observability

Test Connectivity

# From inside cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://prometheus.observability.svc.cluster.local:9090/-/healthy

Common Issues

Pods stuck in Pending

  • Check if storage directories exist on hetzner-2
  • Verify PV/PVC bindings: kubectl describe pvc -n observability

Loki won't start

  • Check permissions on /mnt/local-ssd/loki (should be 10001:10001)
  • View logs: kubectl logs -n observability loki-0

No logs appearing

  • Check Alloy pods are running: kubectl get pods -n observability -l app=alloy
  • View Alloy logs: kubectl logs -n observability -l app=alloy

Grafana can't reach datasources

  • Verify services: kubectl get svc -n observability
  • Check datasource URLs in Grafana UI

Updating Configuration

Update Prometheus Scrape Config

kubectl edit configmap prometheus-config -n observability
kubectl rollout restart statefulset/prometheus -n observability

Update Loki Retention

kubectl edit configmap loki-config -n observability
kubectl rollout restart statefulset/loki -n observability

Update Alloy Collection Rules

kubectl edit configmap alloy-config -n observability
kubectl rollout restart daemonset/alloy -n observability

Resource Usage

Expected resource consumption:

Component CPU Request CPU Limit Memory Request Memory Limit
Prometheus 500m 2000m 2Gi 4Gi
Loki 500m 2000m 1Gi 2Gi
Tempo 500m 2000m 1Gi 2Gi
Grafana 250m 1000m 512Mi 1Gi
Alloy (per node) 100m 500m 256Mi 512Mi
kube-state-metrics 100m 200m 128Mi 256Mi
node-exporter (per node) 100m 200m 128Mi 256Mi

Total (single node): ~2.1 CPU cores, ~7.5Gi memory

Security Considerations

  1. Change default Grafana password immediately after deployment
  2. Consider adding authentication for internal services if exposed
  3. Review and restrict RBAC permissions as needed
  4. Enable audit logging in Loki for sensitive namespaces
  5. Consider adding NetworkPolicies to restrict traffic

Documentation

This deployment includes comprehensive guides:

  • README.md: Complete deployment and configuration guide (this file)
  • MONITORING-GUIDE.md: URLs, access, and how to monitor new applications
  • DEPLOYMENT-CHECKLIST.md: Step-by-step deployment checklist
  • QUICKREF.md: Quick reference for daily operations
  • demo-app.yaml: Example fully instrumented application
  • deploy.sh: Automated deployment script
  • status.sh: Health check script
  • cleanup.sh: Complete stack removal
  • remove-old-monitoring.sh: Remove existing monitoring before deployment
  • 21-optional-ingresses.yaml: Optional external access to Prometheus/Loki/Tempo

Future Enhancements

  • Add Alertmanager for alerting
  • Configure Grafana SMTP for email notifications
  • Add custom dashboards for your applications
  • Implement Grafana RBAC for team access
  • Consider Mimir for long-term metrics storage
  • Add backup/restore procedures

Support

For issues or questions:

  1. Check pod logs first
  2. Review Grafana datasource configuration
  3. Verify network connectivity between components
  4. Check storage and resource availability

Version Information

  • Grafana: 11.4.0
  • Prometheus: 2.54.1
  • Loki: 3.2.1
  • Tempo: 2.6.1
  • Alloy: 1.5.1
  • kube-state-metrics: 2.13.0
  • node-exporter: 1.8.2

Last updated: January 2025