9.0 KiB
9.0 KiB
Observability Stack Quick Reference
Before You Start
Remove Old Monitoring Stack
If you have existing monitoring components, remove them first:
./remove-old-monitoring.sh
This will safely remove:
- Prometheus, Grafana, Loki, Tempo deployments
- Fluent Bit, Vector, or other log collectors
- Helm releases
- ConfigMaps, PVCs, RBAC resources
- Prometheus Operator CRDs
Quick Access
- Grafana UI: https://grafana.betelgeusebytes.io
- Default Login: admin / admin (change immediately!)
Essential Commands
Check Status
# Quick status check
./status.sh
# View all pods
kubectl get pods -n observability -o wide
# Check specific component
kubectl get pods -n observability -l app=prometheus
kubectl get pods -n observability -l app=loki
kubectl get pods -n observability -l app=tempo
kubectl get pods -n observability -l app=grafana
# Check storage
kubectl get pv
kubectl get pvc -n observability
View Logs
# Grafana
kubectl logs -n observability -l app=grafana -f
# Prometheus
kubectl logs -n observability -l app=prometheus -f
# Loki
kubectl logs -n observability -l app=loki -f
# Tempo
kubectl logs -n observability -l app=tempo -f
# Alloy (log collector)
kubectl logs -n observability -l app=alloy -f
Restart Components
# Restart Prometheus
kubectl rollout restart statefulset/prometheus -n observability
# Restart Loki
kubectl rollout restart statefulset/loki -n observability
# Restart Tempo
kubectl rollout restart statefulset/tempo -n observability
# Restart Grafana
kubectl rollout restart statefulset/grafana -n observability
# Restart Alloy
kubectl rollout restart daemonset/alloy -n observability
Update Configurations
# Edit Prometheus config
kubectl edit configmap prometheus-config -n observability
kubectl rollout restart statefulset/prometheus -n observability
# Edit Loki config
kubectl edit configmap loki-config -n observability
kubectl rollout restart statefulset/loki -n observability
# Edit Tempo config
kubectl edit configmap tempo-config -n observability
kubectl rollout restart statefulset/tempo -n observability
# Edit Alloy config
kubectl edit configmap alloy-config -n observability
kubectl rollout restart daemonset/alloy -n observability
# Edit Grafana datasources
kubectl edit configmap grafana-datasources -n observability
kubectl rollout restart statefulset/grafana -n observability
Common LogQL Queries (Loki)
Basic Queries
# All logs from observability namespace
{namespace="observability"}
# Logs from specific app
{namespace="observability", app="prometheus"}
# Filter by log level
{namespace="default"} |= "error"
{namespace="default"} | json | level="error"
# Exclude certain logs
{namespace="default"} != "health check"
# Multiple filters
{namespace="default"} |= "error" != "ignore"
Advanced Queries
# Rate of errors
rate({namespace="default"} |= "error" [5m])
# Count logs by level
sum by (level) (count_over_time({namespace="default"} | json [5m]))
# Top 10 error messages
topk(10, count by (message) (
{namespace="default"} | json | level="error"
))
Common PromQL Queries (Prometheus)
Cluster Health
# All targets up/down
up
# Pods by phase
kube_pod_status_phase{namespace="observability"}
# Node memory available
node_memory_MemAvailable_bytes
# Node CPU usage
rate(node_cpu_seconds_total{mode="user"}[5m])
Container Metrics
# CPU usage by container
rate(container_cpu_usage_seconds_total[5m])
# Memory usage by container
container_memory_usage_bytes
# Network traffic
rate(container_network_transmit_bytes_total[5m])
rate(container_network_receive_bytes_total[5m])
Application Metrics
# HTTP request rate
rate(http_requests_total[5m])
# Request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
Trace Search (Tempo)
In Grafana Explore with Tempo datasource:
- Search by service: Select from dropdown
- Search by duration: "> 1s", "< 100ms"
- Search by tag:
http.status_code=500 - TraceQL:
{span.http.method="POST" && span.http.status_code>=400}
Correlations
From Logs to Traces
- View logs in Loki
- Click on a log line with a trace ID
- Click the "Tempo" link
- Trace opens in Tempo
From Traces to Logs
- View trace in Tempo
- Click on a span
- Click "Logs for this span"
- Related logs appear
From Traces to Metrics
- View trace in Tempo
- Service graph shows metrics
- Click service to see metrics
Demo Application
Deploy the demo app to test the stack:
kubectl apply -f demo-app.yaml
# Wait for it to start
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s
# Test it
kubectl port-forward -n observability svc/demo-app 8080:8080
# In another terminal
curl http://localhost:8080/
curl http://localhost:8080/items
curl http://localhost:8080/item/0
curl http://localhost:8080/slow
curl http://localhost:8080/error
Now view in Grafana:
- Logs: Search
{app="demo-app"}in Loki - Traces: Search "demo-app" service in Tempo
- Metrics: Query
flask_http_request_totalin Prometheus
Storage Management
Check Disk Usage
# On hetzner-2 node
df -h /mnt/local-ssd/
# Detailed usage
du -sh /mnt/local-ssd/*
Cleanup Old Data
Data is automatically deleted after 7 days. To manually adjust retention:
Prometheus (in 03-prometheus-config.yaml):
args:
- '--storage.tsdb.retention.time=7d'
Loki (in 04-loki-config.yaml):
limits_config:
retention_period: 168h # 7 days
Tempo (in 05-tempo-config.yaml):
compactor:
compaction:
block_retention: 168h # 7 days
Troubleshooting
No Logs Appearing
# Check Alloy is running
kubectl get pods -n observability -l app=alloy
# Check Alloy logs
kubectl logs -n observability -l app=alloy
# Check Loki
kubectl logs -n observability -l app=loki
# Test Loki endpoint
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://loki.observability.svc.cluster.local:3100/ready
No Traces Appearing
# Check Tempo is running
kubectl get pods -n observability -l app=tempo
# Check Tempo logs
kubectl logs -n observability -l app=tempo
# Test Tempo endpoint
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://tempo.observability.svc.cluster.local:3200/ready
# Verify your app sends to correct endpoint
# Should be: tempo.observability.svc.cluster.local:4317 (gRPC)
# or: tempo.observability.svc.cluster.local:4318 (HTTP)
Grafana Can't Connect to Datasources
# Check all services are running
kubectl get svc -n observability
# Test from Grafana pod
kubectl exec -it -n observability grafana-0 -- \
wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy
kubectl exec -it -n observability grafana-0 -- \
wget -O- http://loki.observability.svc.cluster.local:3100/ready
kubectl exec -it -n observability grafana-0 -- \
wget -O- http://tempo.observability.svc.cluster.local:3200/ready
High Resource Usage
# Check resource usage
kubectl top pods -n observability
kubectl top nodes
# Scale down if needed (for testing)
kubectl scale statefulset/prometheus -n observability --replicas=0
kubectl scale statefulset/loki -n observability --replicas=0
Backup and Restore
Backup Grafana Dashboards
# Export all dashboards via API
kubectl port-forward -n observability svc/grafana 3000:3000
# In another terminal
curl -H "Authorization: Bearer <API_KEY>" \
http://localhost:3000/api/search?type=dash-db | jq
Backup Configurations
# Backup all ConfigMaps
kubectl get configmap -n observability -o yaml > configmaps-backup.yaml
# Backup specific config
kubectl get configmap prometheus-config -n observability -o yaml > prometheus-config-backup.yaml
Useful Dashboards in Grafana
After login, import these dashboard IDs:
- 315: Kubernetes cluster monitoring
- 7249: Kubernetes cluster
- 13639: Loki dashboard
- 12611: Tempo dashboard
- 3662: Prometheus 2.0 stats
- 1860: Node Exporter Full
Go to: Dashboards → Import → Enter ID → Load
Performance Tuning
For Higher Load
Increase resources in respective YAML files:
resources:
requests:
cpu: 1000m # from 500m
memory: 4Gi # from 2Gi
limits:
cpu: 4000m # from 2000m
memory: 8Gi # from 4Gi
For Lower Resource Usage
- Reduce scrape intervals in Prometheus config
- Reduce log retention periods
- Reduce trace sampling rate
Security Checklist
- Change Grafana admin password
- Review RBAC permissions
- Enable audit logging
- Consider adding NetworkPolicies
- Review ingress TLS configuration
- Backup configurations regularly
Getting Help
- Check component logs first
- Review configurations
- Test network connectivity
- Check resource availability
- Review Grafana datasource settings