|
|
||
|---|---|---|
| .. | ||
| 00-namespace.yaml | ||
| 01-persistent-volumes.yaml | ||
| 02-persistent-volume-claims.yaml | ||
| 03-prometheus-config.yaml | ||
| 04-loki-config.yaml | ||
| 05-tempo-config.yaml | ||
| 06-alloy-config.yaml | ||
| 07-grafana-datasources.yaml | ||
| 08-rbac.yaml | ||
| 10-prometheus.yaml | ||
| 11-loki.yaml | ||
| 12-tempo.yaml | ||
| 13-grafana.yaml | ||
| 14-alloy.yaml | ||
| 15-kube-state-metrics.yaml | ||
| 16-node-exporter.yaml | ||
| 20-grafana-ingress.yaml | ||
| 21-optional-ingresses.yaml | ||
| DEPLOYMENT-CHECKLIST.md | ||
| DNS-SETUP.md | ||
| MONITORING-GUIDE.md | ||
| QUICKREF.md | ||
| README.md | ||
| cleanup.sh | ||
| demo-app.yaml | ||
| deploy.sh | ||
| remove-old-monitoring.sh | ||
| status.sh | ||
README.md
State-of-the-Art Observability Stack for Kubernetes
This deployment provides a comprehensive, production-ready observability solution using the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) with unified collection through Grafana Alloy.
Architecture Overview
Core Components
-
Grafana (v11.4.0) - Unified visualization platform
- Pre-configured datasources for Prometheus, Loki, and Tempo
- Automatic correlation between logs, metrics, and traces
- Modern UI with TraceQL editor support
-
Prometheus (v2.54.1) - Metrics collection and storage
- 7-day retention
- Comprehensive Kubernetes service discovery
- Scrapes: API server, nodes, cadvisor, pods, services
-
Grafana Loki (v3.2.1) - Log aggregation
- 7-day retention with compaction
- TSDB index for efficient queries
- Automatic correlation with traces
-
Grafana Tempo (v2.6.1) - Distributed tracing
- 7-day retention
- Multiple protocol support: OTLP, Jaeger, Zipkin
- Metrics generation from traces
- Automatic correlation with logs and metrics
-
Grafana Alloy (v1.5.1) - Unified observability agent
- Replaces Promtail, Vector, Fluent Bit
- Collects logs from all pods
- OTLP receiver for traces
- Runs as DaemonSet on all nodes
-
kube-state-metrics (v2.13.0) - Kubernetes object metrics
- Deployment, Pod, Service, Node metrics
- Essential for cluster monitoring
-
node-exporter (v1.8.2) - Node-level system metrics
- CPU, memory, disk, network metrics
- Runs on all nodes via DaemonSet
Key Features
- Unified Observability: Logs, metrics, and traces in one platform
- Automatic Correlation: Click from logs to traces to metrics seamlessly
- 7-Day Retention: Optimized for single-node cluster
- Local SSD Storage: Fast, persistent storage on hetzner-2 node
- OTLP Support: Modern OpenTelemetry protocol support
- TLS Enabled: Secure access via NGINX Ingress with Let's Encrypt
- Low Resource Footprint: Optimized for single-node deployment
Storage Layout
All data stored on local SSD at /mnt/local-ssd/:
/mnt/local-ssd/
├── prometheus/ (50Gi) - Metrics data
├── loki/ (100Gi) - Log data
├── tempo/ (50Gi) - Trace data
└── grafana/ (10Gi) - Dashboards and settings
Deployment Instructions
Prerequisites
- Kubernetes cluster with NGINX Ingress Controller
- cert-manager installed with Let's Encrypt issuer
- DNS record:
grafana.betelgeusebytes.io→ your cluster IP - Node labeled:
kubernetes.io/hostname=hetzner-2
Step 0: Remove Existing Monitoring (If Applicable)
If you have an existing monitoring stack (Prometheus, Grafana, Loki, Fluent Bit, etc.), remove it first to avoid conflicts:
./remove-old-monitoring.sh
This interactive script will help you safely remove:
- Existing Prometheus/Grafana/Loki/Tempo deployments
- Helm releases for monitoring components
- Fluent Bit, Vector, or other log collectors
- Related ConfigMaps, PVCs, and RBAC resources
- Prometheus Operator CRDs (if applicable)
Note: The main deployment script (deploy.sh) will also prompt you to run cleanup if needed.
Step 1: Prepare Storage Directories
SSH into the hetzner-2 node and create directories:
sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
sudo chown -R 10001:10001 /mnt/local-ssd/loki
sudo chown -R root:root /mnt/local-ssd/tempo
sudo chown -R 472:472 /mnt/local-ssd/grafana
Step 2: Deploy the Stack
chmod +x deploy.sh
./deploy.sh
Or deploy manually:
kubectl apply -f 00-namespace.yaml
kubectl apply -f 01-persistent-volumes.yaml
kubectl apply -f 02-persistent-volume-claims.yaml
kubectl apply -f 03-prometheus-config.yaml
kubectl apply -f 04-loki-config.yaml
kubectl apply -f 05-tempo-config.yaml
kubectl apply -f 06-alloy-config.yaml
kubectl apply -f 07-grafana-datasources.yaml
kubectl apply -f 08-rbac.yaml
kubectl apply -f 10-prometheus.yaml
kubectl apply -f 11-loki.yaml
kubectl apply -f 12-tempo.yaml
kubectl apply -f 13-grafana.yaml
kubectl apply -f 14-alloy.yaml
kubectl apply -f 15-kube-state-metrics.yaml
kubectl apply -f 16-node-exporter.yaml
kubectl apply -f 20-grafana-ingress.yaml
Step 3: Verify Deployment
kubectl get pods -n observability
kubectl get pv
kubectl get pvc -n observability
All pods should be in Running state:
- grafana-0
- loki-0
- prometheus-0
- tempo-0
- alloy-xxxxx (one per node)
- kube-state-metrics-xxxxx
- node-exporter-xxxxx (one per node)
Step 4: Access Grafana
- Open: https://grafana.betelgeusebytes.io
- Login with default credentials:
- Username:
admin - Password:
admin
- Username:
- IMPORTANT: Change the password on first login!
Using the Stack
Exploring Logs (Loki)
- In Grafana, go to Explore
- Select Loki datasource
- Example queries:
{namespace="observability"} {namespace="observability", app="prometheus"} {namespace="default"} |= "error" {pod="my-app-xxx"} | json | level="error"
Exploring Metrics (Prometheus)
- In Grafana, go to Explore
- Select Prometheus datasource
- Example queries:
up node_memory_MemAvailable_bytes rate(container_cpu_usage_seconds_total[5m]) kube_pod_status_phase{namespace="observability"}
Exploring Traces (Tempo)
- In Grafana, go to Explore
- Select Tempo datasource
- Search by:
- Service name
- Duration
- Tags
- Click on a trace to see detailed span timeline
Correlations
The stack automatically correlates:
- Logs → Traces: Click traceID in logs to view trace
- Traces → Logs: Click on trace to see related logs
- Traces → Metrics: Tempo generates metrics from traces
Instrumenting Your Applications
For Logs
Logs are automatically collected from all pods by Alloy. Emit structured JSON logs:
{"level":"info","message":"Request processed","duration_ms":42}
For Traces
Send traces to Tempo using OTLP:
# Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://tempo.observability.svc.cluster.local:4317")
)
)
trace.set_tracer_provider(provider)
For Metrics
Expose metrics in Prometheus format and add annotations to your pod:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
Monitoring Endpoints
Internal service endpoints:
- Prometheus:
http://prometheus.observability.svc.cluster.local:9090 - Loki:
http://loki.observability.svc.cluster.local:3100 - Tempo:
- HTTP:
http://tempo.observability.svc.cluster.local:3200 - OTLP gRPC:
tempo.observability.svc.cluster.local:4317 - OTLP HTTP:
tempo.observability.svc.cluster.local:4318
- HTTP:
- Grafana:
http://grafana.observability.svc.cluster.local:3000
Troubleshooting
Check Pod Status
kubectl get pods -n observability
kubectl describe pod <pod-name> -n observability
View Logs
kubectl logs -n observability -l app=grafana
kubectl logs -n observability -l app=prometheus
kubectl logs -n observability -l app=loki
kubectl logs -n observability -l app=tempo
kubectl logs -n observability -l app=alloy
Check Storage
kubectl get pv
kubectl get pvc -n observability
Test Connectivity
# From inside cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://prometheus.observability.svc.cluster.local:9090/-/healthy
Common Issues
Pods stuck in Pending
- Check if storage directories exist on hetzner-2
- Verify PV/PVC bindings:
kubectl describe pvc -n observability
Loki won't start
- Check permissions on
/mnt/local-ssd/loki(should be 10001:10001) - View logs:
kubectl logs -n observability loki-0
No logs appearing
- Check Alloy pods are running:
kubectl get pods -n observability -l app=alloy - View Alloy logs:
kubectl logs -n observability -l app=alloy
Grafana can't reach datasources
- Verify services:
kubectl get svc -n observability - Check datasource URLs in Grafana UI
Updating Configuration
Update Prometheus Scrape Config
kubectl edit configmap prometheus-config -n observability
kubectl rollout restart statefulset/prometheus -n observability
Update Loki Retention
kubectl edit configmap loki-config -n observability
kubectl rollout restart statefulset/loki -n observability
Update Alloy Collection Rules
kubectl edit configmap alloy-config -n observability
kubectl rollout restart daemonset/alloy -n observability
Resource Usage
Expected resource consumption:
| Component | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| Prometheus | 500m | 2000m | 2Gi | 4Gi |
| Loki | 500m | 2000m | 1Gi | 2Gi |
| Tempo | 500m | 2000m | 1Gi | 2Gi |
| Grafana | 250m | 1000m | 512Mi | 1Gi |
| Alloy (per node) | 100m | 500m | 256Mi | 512Mi |
| kube-state-metrics | 100m | 200m | 128Mi | 256Mi |
| node-exporter (per node) | 100m | 200m | 128Mi | 256Mi |
Total (single node): ~2.1 CPU cores, ~7.5Gi memory
Security Considerations
- Change default Grafana password immediately after deployment
- Consider adding authentication for internal services if exposed
- Review and restrict RBAC permissions as needed
- Enable audit logging in Loki for sensitive namespaces
- Consider adding NetworkPolicies to restrict traffic
Documentation
This deployment includes comprehensive guides:
- README.md: Complete deployment and configuration guide (this file)
- MONITORING-GUIDE.md: URLs, access, and how to monitor new applications
- DEPLOYMENT-CHECKLIST.md: Step-by-step deployment checklist
- QUICKREF.md: Quick reference for daily operations
- demo-app.yaml: Example fully instrumented application
- deploy.sh: Automated deployment script
- status.sh: Health check script
- cleanup.sh: Complete stack removal
- remove-old-monitoring.sh: Remove existing monitoring before deployment
- 21-optional-ingresses.yaml: Optional external access to Prometheus/Loki/Tempo
Future Enhancements
- Add Alertmanager for alerting
- Configure Grafana SMTP for email notifications
- Add custom dashboards for your applications
- Implement Grafana RBAC for team access
- Consider Mimir for long-term metrics storage
- Add backup/restore procedures
Support
For issues or questions:
- Check pod logs first
- Review Grafana datasource configuration
- Verify network connectivity between components
- Check storage and resource availability
Version Information
- Grafana: 11.4.0
- Prometheus: 2.54.1
- Loki: 3.2.1
- Tempo: 2.6.1
- Alloy: 1.5.1
- kube-state-metrics: 2.13.0
- node-exporter: 1.8.2
Last updated: January 2025