betelgeusebytes/k8s/observability-stack/QUICKREF.md

9.0 KiB

Observability Stack Quick Reference

Before You Start

Remove Old Monitoring Stack

If you have existing monitoring components, remove them first:

./remove-old-monitoring.sh

This will safely remove:

  • Prometheus, Grafana, Loki, Tempo deployments
  • Fluent Bit, Vector, or other log collectors
  • Helm releases
  • ConfigMaps, PVCs, RBAC resources
  • Prometheus Operator CRDs

Quick Access

Essential Commands

Check Status

# Quick status check
./status.sh

# View all pods
kubectl get pods -n observability -o wide

# Check specific component
kubectl get pods -n observability -l app=prometheus
kubectl get pods -n observability -l app=loki
kubectl get pods -n observability -l app=tempo
kubectl get pods -n observability -l app=grafana

# Check storage
kubectl get pv
kubectl get pvc -n observability

View Logs

# Grafana
kubectl logs -n observability -l app=grafana -f

# Prometheus
kubectl logs -n observability -l app=prometheus -f

# Loki
kubectl logs -n observability -l app=loki -f

# Tempo
kubectl logs -n observability -l app=tempo -f

# Alloy (log collector)
kubectl logs -n observability -l app=alloy -f

Restart Components

# Restart Prometheus
kubectl rollout restart statefulset/prometheus -n observability

# Restart Loki
kubectl rollout restart statefulset/loki -n observability

# Restart Tempo
kubectl rollout restart statefulset/tempo -n observability

# Restart Grafana
kubectl rollout restart statefulset/grafana -n observability

# Restart Alloy
kubectl rollout restart daemonset/alloy -n observability

Update Configurations

# Edit Prometheus config
kubectl edit configmap prometheus-config -n observability
kubectl rollout restart statefulset/prometheus -n observability

# Edit Loki config
kubectl edit configmap loki-config -n observability
kubectl rollout restart statefulset/loki -n observability

# Edit Tempo config
kubectl edit configmap tempo-config -n observability
kubectl rollout restart statefulset/tempo -n observability

# Edit Alloy config
kubectl edit configmap alloy-config -n observability
kubectl rollout restart daemonset/alloy -n observability

# Edit Grafana datasources
kubectl edit configmap grafana-datasources -n observability
kubectl rollout restart statefulset/grafana -n observability

Common LogQL Queries (Loki)

Basic Queries

# All logs from observability namespace
{namespace="observability"}

# Logs from specific app
{namespace="observability", app="prometheus"}

# Filter by log level
{namespace="default"} |= "error"
{namespace="default"} | json | level="error"

# Exclude certain logs
{namespace="default"} != "health check"

# Multiple filters
{namespace="default"} |= "error" != "ignore"

Advanced Queries

# Rate of errors
rate({namespace="default"} |= "error" [5m])

# Count logs by level
sum by (level) (count_over_time({namespace="default"} | json [5m]))

# Top 10 error messages
topk(10, count by (message) (
  {namespace="default"} | json | level="error"
))

Common PromQL Queries (Prometheus)

Cluster Health

# All targets up/down
up

# Pods by phase
kube_pod_status_phase{namespace="observability"}

# Node memory available
node_memory_MemAvailable_bytes

# Node CPU usage
rate(node_cpu_seconds_total{mode="user"}[5m])

Container Metrics

# CPU usage by container
rate(container_cpu_usage_seconds_total[5m])

# Memory usage by container
container_memory_usage_bytes

# Network traffic
rate(container_network_transmit_bytes_total[5m])
rate(container_network_receive_bytes_total[5m])

Application Metrics

# HTTP request rate
rate(http_requests_total[5m])

# Request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

Trace Search (Tempo)

In Grafana Explore with Tempo datasource:

  • Search by service: Select from dropdown
  • Search by duration: "> 1s", "< 100ms"
  • Search by tag: http.status_code=500
  • TraceQL: {span.http.method="POST" && span.http.status_code>=400}

Correlations

From Logs to Traces

  1. View logs in Loki
  2. Click on a log line with a trace ID
  3. Click the "Tempo" link
  4. Trace opens in Tempo

From Traces to Logs

  1. View trace in Tempo
  2. Click on a span
  3. Click "Logs for this span"
  4. Related logs appear

From Traces to Metrics

  1. View trace in Tempo
  2. Service graph shows metrics
  3. Click service to see metrics

Demo Application

Deploy the demo app to test the stack:

kubectl apply -f demo-app.yaml

# Wait for it to start
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s

# Test it
kubectl port-forward -n observability svc/demo-app 8080:8080

# In another terminal
curl http://localhost:8080/
curl http://localhost:8080/items
curl http://localhost:8080/item/0
curl http://localhost:8080/slow
curl http://localhost:8080/error

Now view in Grafana:

  • Logs: Search {app="demo-app"} in Loki
  • Traces: Search "demo-app" service in Tempo
  • Metrics: Query flask_http_request_total in Prometheus

Storage Management

Check Disk Usage

# On hetzner-2 node
df -h /mnt/local-ssd/

# Detailed usage
du -sh /mnt/local-ssd/*

Cleanup Old Data

Data is automatically deleted after 7 days. To manually adjust retention:

Prometheus (in 03-prometheus-config.yaml):

args:
  - '--storage.tsdb.retention.time=7d'

Loki (in 04-loki-config.yaml):

limits_config:
  retention_period: 168h  # 7 days

Tempo (in 05-tempo-config.yaml):

compactor:
  compaction:
    block_retention: 168h  # 7 days

Troubleshooting

No Logs Appearing

# Check Alloy is running
kubectl get pods -n observability -l app=alloy

# Check Alloy logs
kubectl logs -n observability -l app=alloy

# Check Loki
kubectl logs -n observability -l app=loki

# Test Loki endpoint
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://loki.observability.svc.cluster.local:3100/ready

No Traces Appearing

# Check Tempo is running
kubectl get pods -n observability -l app=tempo

# Check Tempo logs
kubectl logs -n observability -l app=tempo

# Test Tempo endpoint
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://tempo.observability.svc.cluster.local:3200/ready

# Verify your app sends to correct endpoint
# Should be: tempo.observability.svc.cluster.local:4317 (gRPC)
#        or: tempo.observability.svc.cluster.local:4318 (HTTP)

Grafana Can't Connect to Datasources

# Check all services are running
kubectl get svc -n observability

# Test from Grafana pod
kubectl exec -it -n observability grafana-0 -- \
  wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy

kubectl exec -it -n observability grafana-0 -- \
  wget -O- http://loki.observability.svc.cluster.local:3100/ready

kubectl exec -it -n observability grafana-0 -- \
  wget -O- http://tempo.observability.svc.cluster.local:3200/ready

High Resource Usage

# Check resource usage
kubectl top pods -n observability
kubectl top nodes

# Scale down if needed (for testing)
kubectl scale statefulset/prometheus -n observability --replicas=0
kubectl scale statefulset/loki -n observability --replicas=0

Backup and Restore

Backup Grafana Dashboards

# Export all dashboards via API
kubectl port-forward -n observability svc/grafana 3000:3000

# In another terminal
curl -H "Authorization: Bearer <API_KEY>" \
  http://localhost:3000/api/search?type=dash-db | jq

Backup Configurations

# Backup all ConfigMaps
kubectl get configmap -n observability -o yaml > configmaps-backup.yaml

# Backup specific config
kubectl get configmap prometheus-config -n observability -o yaml > prometheus-config-backup.yaml

Useful Dashboards in Grafana

After login, import these dashboard IDs:

  • 315: Kubernetes cluster monitoring
  • 7249: Kubernetes cluster
  • 13639: Loki dashboard
  • 12611: Tempo dashboard
  • 3662: Prometheus 2.0 stats
  • 1860: Node Exporter Full

Go to: Dashboards → Import → Enter ID → Load

Performance Tuning

For Higher Load

Increase resources in respective YAML files:

resources:
  requests:
    cpu: 1000m      # from 500m
    memory: 4Gi     # from 2Gi
  limits:
    cpu: 4000m      # from 2000m
    memory: 8Gi     # from 4Gi

For Lower Resource Usage

  • Reduce scrape intervals in Prometheus config
  • Reduce log retention periods
  • Reduce trace sampling rate

Security Checklist

  • Change Grafana admin password
  • Review RBAC permissions
  • Enable audit logging
  • Consider adding NetworkPolicies
  • Review ingress TLS configuration
  • Backup configurations regularly

Getting Help

  1. Check component logs first
  2. Review configurations
  3. Test network connectivity
  4. Check resource availability
  5. Review Grafana datasource settings