11 KiB

Raw Blame History

State-of-the-Art Observability Stack for Kubernetes

This deployment provides a comprehensive, production-ready observability solution using the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) with unified collection through Grafana Alloy.

Architecture Overview

Core Components

Grafana (v11.4.0) - Unified visualization platform
- Pre-configured datasources for Prometheus, Loki, and Tempo
- Automatic correlation between logs, metrics, and traces
- Modern UI with TraceQL editor support
Prometheus (v2.54.1) - Metrics collection and storage
- 7-day retention
- Comprehensive Kubernetes service discovery
- Scrapes: API server, nodes, cadvisor, pods, services
Grafana Loki (v3.2.1) - Log aggregation
- 7-day retention with compaction
- TSDB index for efficient queries
- Automatic correlation with traces
Grafana Tempo (v2.6.1) - Distributed tracing
- 7-day retention
- Multiple protocol support: OTLP, Jaeger, Zipkin
- Metrics generation from traces
- Automatic correlation with logs and metrics
Grafana Alloy (v1.5.1) - Unified observability agent
- Replaces Promtail, Vector, Fluent Bit
- Collects logs from all pods
- OTLP receiver for traces
- Runs as DaemonSet on all nodes
kube-state-metrics (v2.13.0) - Kubernetes object metrics
- Deployment, Pod, Service, Node metrics
- Essential for cluster monitoring
node-exporter (v1.8.2) - Node-level system metrics
- CPU, memory, disk, network metrics
- Runs on all nodes via DaemonSet

Key Features

Unified Observability: Logs, metrics, and traces in one platform
Automatic Correlation: Click from logs to traces to metrics seamlessly
7-Day Retention: Optimized for single-node cluster
Local SSD Storage: Fast, persistent storage on hetzner-2 node
OTLP Support: Modern OpenTelemetry protocol support
TLS Enabled: Secure access via NGINX Ingress with Let's Encrypt
Low Resource Footprint: Optimized for single-node deployment

Storage Layout

All data stored on local SSD at /mnt/local-ssd/:

/mnt/local-ssd/
├── prometheus/    (50Gi)  - Metrics data
├── loki/          (100Gi) - Log data
├── tempo/         (50Gi)  - Trace data
└── grafana/       (10Gi)  - Dashboards and settings

Deployment Instructions

Prerequisites

Kubernetes cluster with NGINX Ingress Controller
cert-manager installed with Let's Encrypt issuer
DNS record: grafana.betelgeusebytes.io → your cluster IP
Node labeled: kubernetes.io/hostname=hetzner-2

Step 0: Remove Existing Monitoring (If Applicable)

If you have an existing monitoring stack (Prometheus, Grafana, Loki, Fluent Bit, etc.), remove it first to avoid conflicts:

./remove-old-monitoring.sh

This interactive script will help you safely remove:

Existing Prometheus/Grafana/Loki/Tempo deployments
Helm releases for monitoring components
Fluent Bit, Vector, or other log collectors
Related ConfigMaps, PVCs, and RBAC resources
Prometheus Operator CRDs (if applicable)

Note: The main deployment script (deploy.sh) will also prompt you to run cleanup if needed.

Step 1: Prepare Storage Directories

SSH into the hetzner-2 node and create directories:

sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
sudo chown -R 10001:10001 /mnt/local-ssd/loki
sudo chown -R root:root /mnt/local-ssd/tempo
sudo chown -R 472:472 /mnt/local-ssd/grafana

Step 2: Deploy the Stack

chmod +x deploy.sh
./deploy.sh

Or deploy manually:

kubectl apply -f 00-namespace.yaml
kubectl apply -f 01-persistent-volumes.yaml
kubectl apply -f 02-persistent-volume-claims.yaml
kubectl apply -f 03-prometheus-config.yaml
kubectl apply -f 04-loki-config.yaml
kubectl apply -f 05-tempo-config.yaml
kubectl apply -f 06-alloy-config.yaml
kubectl apply -f 07-grafana-datasources.yaml
kubectl apply -f 08-rbac.yaml
kubectl apply -f 10-prometheus.yaml
kubectl apply -f 11-loki.yaml
kubectl apply -f 12-tempo.yaml
kubectl apply -f 13-grafana.yaml
kubectl apply -f 14-alloy.yaml
kubectl apply -f 15-kube-state-metrics.yaml
kubectl apply -f 16-node-exporter.yaml
kubectl apply -f 20-grafana-ingress.yaml

Step 3: Verify Deployment

kubectl get pods -n observability
kubectl get pv
kubectl get pvc -n observability

All pods should be in Running state:

grafana-0
loki-0
prometheus-0
tempo-0
alloy-xxxxx (one per node)
kube-state-metrics-xxxxx
node-exporter-xxxxx (one per node)

Step 4: Access Grafana

Open: https://grafana.betelgeusebytes.io
Login with default credentials:
- Username: admin
- Password: admin
IMPORTANT: Change the password on first login!

Using the Stack

Exploring Logs (Loki)

In Grafana, go to Explore
Select Loki datasource

Example queries:

{namespace="observability"}
{namespace="observability", app="prometheus"}
{namespace="default"} |= "error"
{pod="my-app-xxx"} | json | level="error"

Exploring Metrics (Prometheus)

In Grafana, go to Explore
Select Prometheus datasource

Example queries:

up
node_memory_MemAvailable_bytes
rate(container_cpu_usage_seconds_total[5m])
kube_pod_status_phase{namespace="observability"}

Exploring Traces (Tempo)

In Grafana, go to Explore
Select Tempo datasource
Search by:
- Service name
- Duration
- Tags
Click on a trace to see detailed span timeline

Correlations

The stack automatically correlates:

Logs → Traces: Click traceID in logs to view trace
Traces → Logs: Click on trace to see related logs
Traces → Metrics: Tempo generates metrics from traces

Instrumenting Your Applications

For Logs

Logs are automatically collected from all pods by Alloy. Emit structured JSON logs:

{"level":"info","message":"Request processed","duration_ms":42}

For Traces

Send traces to Tempo using OTLP:

# Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://tempo.observability.svc.cluster.local:4317")
    )
)
trace.set_tracer_provider(provider)

For Metrics

Expose metrics in Prometheus format and add annotations to your pod:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Monitoring Endpoints

Internal service endpoints:

Prometheus: http://prometheus.observability.svc.cluster.local:9090
Loki: http://loki.observability.svc.cluster.local:3100
Tempo:
- HTTP: http://tempo.observability.svc.cluster.local:3200
- OTLP gRPC: tempo.observability.svc.cluster.local:4317
- OTLP HTTP: tempo.observability.svc.cluster.local:4318
Grafana: http://grafana.observability.svc.cluster.local:3000

Troubleshooting

Check Pod Status

kubectl get pods -n observability
kubectl describe pod <pod-name> -n observability

View Logs

kubectl logs -n observability -l app=grafana
kubectl logs -n observability -l app=prometheus
kubectl logs -n observability -l app=loki
kubectl logs -n observability -l app=tempo
kubectl logs -n observability -l app=alloy

Check Storage

kubectl get pv
kubectl get pvc -n observability

Test Connectivity

# From inside cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://prometheus.observability.svc.cluster.local:9090/-/healthy

Common Issues

Pods stuck in Pending

Check if storage directories exist on hetzner-2
Verify PV/PVC bindings: kubectl describe pvc -n observability

Loki won't start

Check permissions on /mnt/local-ssd/loki (should be 10001:10001)
View logs: kubectl logs -n observability loki-0

No logs appearing

Check Alloy pods are running: kubectl get pods -n observability -l app=alloy
View Alloy logs: kubectl logs -n observability -l app=alloy

Grafana can't reach datasources

Verify services: kubectl get svc -n observability
Check datasource URLs in Grafana UI

Updating Configuration

Update Prometheus Scrape Config

kubectl edit configmap prometheus-config -n observability
kubectl rollout restart statefulset/prometheus -n observability

Update Loki Retention

kubectl edit configmap loki-config -n observability
kubectl rollout restart statefulset/loki -n observability

Update Alloy Collection Rules

kubectl edit configmap alloy-config -n observability
kubectl rollout restart daemonset/alloy -n observability

Resource Usage

Expected resource consumption:

Component	CPU Request	CPU Limit	Memory Request	Memory Limit
Prometheus	500m	2000m	2Gi	4Gi
Loki	500m	2000m	1Gi	2Gi
Tempo	500m	2000m	1Gi	2Gi
Grafana	250m	1000m	512Mi	1Gi
Alloy (per node)	100m	500m	256Mi	512Mi
kube-state-metrics	100m	200m	128Mi	256Mi
node-exporter (per node)	100m	200m	128Mi	256Mi

Total (single node): ~2.1 CPU cores, ~7.5Gi memory

Security Considerations

Change default Grafana password immediately after deployment
Consider adding authentication for internal services if exposed
Review and restrict RBAC permissions as needed
Enable audit logging in Loki for sensitive namespaces
Consider adding NetworkPolicies to restrict traffic

Documentation

This deployment includes comprehensive guides:

README.md: Complete deployment and configuration guide (this file)
MONITORING-GUIDE.md: URLs, access, and how to monitor new applications
DEPLOYMENT-CHECKLIST.md: Step-by-step deployment checklist
QUICKREF.md: Quick reference for daily operations
demo-app.yaml: Example fully instrumented application
deploy.sh: Automated deployment script
status.sh: Health check script
cleanup.sh: Complete stack removal
remove-old-monitoring.sh: Remove existing monitoring before deployment
21-optional-ingresses.yaml: Optional external access to Prometheus/Loki/Tempo

Future Enhancements

Add Alertmanager for alerting
Configure Grafana SMTP for email notifications
Add custom dashboards for your applications
Implement Grafana RBAC for team access
Consider Mimir for long-term metrics storage
Add backup/restore procedures

Support

For issues or questions:

Check pod logs first
Review Grafana datasource configuration
Verify network connectivity between components
Check storage and resource availability

Version Information

Grafana: 11.4.0
Prometheus: 2.54.1
Loki: 3.2.1
Tempo: 2.6.1
Alloy: 1.5.1
kube-state-metrics: 2.13.0
node-exporter: 1.8.2

Last updated: January 2025

11 KiB Raw Blame History