# Observability Stack Quick Reference ## Before You Start ### Remove Old Monitoring Stack If you have existing monitoring components, remove them first: ```bash ./remove-old-monitoring.sh ``` This will safely remove: - Prometheus, Grafana, Loki, Tempo deployments - Fluent Bit, Vector, or other log collectors - Helm releases - ConfigMaps, PVCs, RBAC resources - Prometheus Operator CRDs ## Quick Access - **Grafana UI**: https://grafana.betelgeusebytes.io - **Default Login**: admin / admin (change immediately!) ## Essential Commands ### Check Status ```bash # Quick status check ./status.sh # View all pods kubectl get pods -n observability -o wide # Check specific component kubectl get pods -n observability -l app=prometheus kubectl get pods -n observability -l app=loki kubectl get pods -n observability -l app=tempo kubectl get pods -n observability -l app=grafana # Check storage kubectl get pv kubectl get pvc -n observability ``` ### View Logs ```bash # Grafana kubectl logs -n observability -l app=grafana -f # Prometheus kubectl logs -n observability -l app=prometheus -f # Loki kubectl logs -n observability -l app=loki -f # Tempo kubectl logs -n observability -l app=tempo -f # Alloy (log collector) kubectl logs -n observability -l app=alloy -f ``` ### Restart Components ```bash # Restart Prometheus kubectl rollout restart statefulset/prometheus -n observability # Restart Loki kubectl rollout restart statefulset/loki -n observability # Restart Tempo kubectl rollout restart statefulset/tempo -n observability # Restart Grafana kubectl rollout restart statefulset/grafana -n observability # Restart Alloy kubectl rollout restart daemonset/alloy -n observability ``` ### Update Configurations ```bash # Edit Prometheus config kubectl edit configmap prometheus-config -n observability kubectl rollout restart statefulset/prometheus -n observability # Edit Loki config kubectl edit configmap loki-config -n observability kubectl rollout restart statefulset/loki -n observability # Edit Tempo config kubectl edit configmap tempo-config -n observability kubectl rollout restart statefulset/tempo -n observability # Edit Alloy config kubectl edit configmap alloy-config -n observability kubectl rollout restart daemonset/alloy -n observability # Edit Grafana datasources kubectl edit configmap grafana-datasources -n observability kubectl rollout restart statefulset/grafana -n observability ``` ## Common LogQL Queries (Loki) ### Basic Queries ```logql # All logs from observability namespace {namespace="observability"} # Logs from specific app {namespace="observability", app="prometheus"} # Filter by log level {namespace="default"} |= "error" {namespace="default"} | json | level="error" # Exclude certain logs {namespace="default"} != "health check" # Multiple filters {namespace="default"} |= "error" != "ignore" ``` ### Advanced Queries ```logql # Rate of errors rate({namespace="default"} |= "error" [5m]) # Count logs by level sum by (level) (count_over_time({namespace="default"} | json [5m])) # Top 10 error messages topk(10, count by (message) ( {namespace="default"} | json | level="error" )) ``` ## Common PromQL Queries (Prometheus) ### Cluster Health ```promql # All targets up/down up # Pods by phase kube_pod_status_phase{namespace="observability"} # Node memory available node_memory_MemAvailable_bytes # Node CPU usage rate(node_cpu_seconds_total{mode="user"}[5m]) ``` ### Container Metrics ```promql # CPU usage by container rate(container_cpu_usage_seconds_total[5m]) # Memory usage by container container_memory_usage_bytes # Network traffic rate(container_network_transmit_bytes_total[5m]) rate(container_network_receive_bytes_total[5m]) ``` ### Application Metrics ```promql # HTTP request rate rate(http_requests_total[5m]) # Request duration histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Error rate rate(http_requests_total{status=~"5.."}[5m]) ``` ## Trace Search (Tempo) In Grafana Explore with Tempo datasource: - **Search by service**: Select from dropdown - **Search by duration**: "> 1s", "< 100ms" - **Search by tag**: `http.status_code=500` - **TraceQL**: `{span.http.method="POST" && span.http.status_code>=400}` ## Correlations ### From Logs to Traces 1. View logs in Loki 2. Click on a log line with a trace ID 3. Click the "Tempo" link 4. Trace opens in Tempo ### From Traces to Logs 1. View trace in Tempo 2. Click on a span 3. Click "Logs for this span" 4. Related logs appear ### From Traces to Metrics 1. View trace in Tempo 2. Service graph shows metrics 3. Click service to see metrics ## Demo Application Deploy the demo app to test the stack: ```bash kubectl apply -f demo-app.yaml # Wait for it to start kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s # Test it kubectl port-forward -n observability svc/demo-app 8080:8080 # In another terminal curl http://localhost:8080/ curl http://localhost:8080/items curl http://localhost:8080/item/0 curl http://localhost:8080/slow curl http://localhost:8080/error ``` Now view in Grafana: - **Logs**: Search `{app="demo-app"}` in Loki - **Traces**: Search "demo-app" service in Tempo - **Metrics**: Query `flask_http_request_total` in Prometheus ## Storage Management ### Check Disk Usage ```bash # On hetzner-2 node df -h /mnt/local-ssd/ # Detailed usage du -sh /mnt/local-ssd/* ``` ### Cleanup Old Data Data is automatically deleted after 7 days. To manually adjust retention: **Prometheus** (in 03-prometheus-config.yaml): ```yaml args: - '--storage.tsdb.retention.time=7d' ``` **Loki** (in 04-loki-config.yaml): ```yaml limits_config: retention_period: 168h # 7 days ``` **Tempo** (in 05-tempo-config.yaml): ```yaml compactor: compaction: block_retention: 168h # 7 days ``` ## Troubleshooting ### No Logs Appearing ```bash # Check Alloy is running kubectl get pods -n observability -l app=alloy # Check Alloy logs kubectl logs -n observability -l app=alloy # Check Loki kubectl logs -n observability -l app=loki # Test Loki endpoint kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ curl http://loki.observability.svc.cluster.local:3100/ready ``` ### No Traces Appearing ```bash # Check Tempo is running kubectl get pods -n observability -l app=tempo # Check Tempo logs kubectl logs -n observability -l app=tempo # Test Tempo endpoint kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ curl http://tempo.observability.svc.cluster.local:3200/ready # Verify your app sends to correct endpoint # Should be: tempo.observability.svc.cluster.local:4317 (gRPC) # or: tempo.observability.svc.cluster.local:4318 (HTTP) ``` ### Grafana Can't Connect to Datasources ```bash # Check all services are running kubectl get svc -n observability # Test from Grafana pod kubectl exec -it -n observability grafana-0 -- \ wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy kubectl exec -it -n observability grafana-0 -- \ wget -O- http://loki.observability.svc.cluster.local:3100/ready kubectl exec -it -n observability grafana-0 -- \ wget -O- http://tempo.observability.svc.cluster.local:3200/ready ``` ### High Resource Usage ```bash # Check resource usage kubectl top pods -n observability kubectl top nodes # Scale down if needed (for testing) kubectl scale statefulset/prometheus -n observability --replicas=0 kubectl scale statefulset/loki -n observability --replicas=0 ``` ## Backup and Restore ### Backup Grafana Dashboards ```bash # Export all dashboards via API kubectl port-forward -n observability svc/grafana 3000:3000 # In another terminal curl -H "Authorization: Bearer " \ http://localhost:3000/api/search?type=dash-db | jq ``` ### Backup Configurations ```bash # Backup all ConfigMaps kubectl get configmap -n observability -o yaml > configmaps-backup.yaml # Backup specific config kubectl get configmap prometheus-config -n observability -o yaml > prometheus-config-backup.yaml ``` ## Useful Dashboards in Grafana After login, import these dashboard IDs: - **315**: Kubernetes cluster monitoring - **7249**: Kubernetes cluster - **13639**: Loki dashboard - **12611**: Tempo dashboard - **3662**: Prometheus 2.0 stats - **1860**: Node Exporter Full Go to: Dashboards → Import → Enter ID → Load ## Performance Tuning ### For Higher Load Increase resources in respective YAML files: ```yaml resources: requests: cpu: 1000m # from 500m memory: 4Gi # from 2Gi limits: cpu: 4000m # from 2000m memory: 8Gi # from 4Gi ``` ### For Lower Resource Usage - Reduce scrape intervals in Prometheus config - Reduce log retention periods - Reduce trace sampling rate ## Security Checklist - [ ] Change Grafana admin password - [ ] Review RBAC permissions - [ ] Enable audit logging - [ ] Consider adding NetworkPolicies - [ ] Review ingress TLS configuration - [ ] Backup configurations regularly ## Getting Help 1. Check component logs first 2. Review configurations 3. Test network connectivity 4. Check resource availability 5. Review Grafana datasource settings