betelgeusebytes/k8s/observability-stack/DEPLOYMENT-CHECKLIST.md

10 KiB

Observability Stack Deployment Checklist

Use this checklist to ensure a smooth deployment of the observability stack.

Pre-Deployment

Check for Existing Monitoring Stack

  • Check if you have existing monitoring components:
# Check for monitoring namespaces
kubectl get namespaces | grep -E "(monitoring|prometheus|grafana|loki|tempo)"

# Check for monitoring pods in common namespaces
kubectl get pods -n monitoring 2>/dev/null || true
kubectl get pods -n prometheus 2>/dev/null || true
kubectl get pods -n grafana 2>/dev/null || true
kubectl get pods -A | grep -E "(prometheus|grafana|loki|tempo|fluent-bit|vector)"

# Check for Helm releases
helm list -A | grep -E "(prometheus|grafana|loki|tempo)"
  • If existing monitoring is found, remove it first:
./remove-old-monitoring.sh

OR run the deployment script which will prompt you:

./deploy.sh  # Will ask if you want to clean up first

Prerequisites

  • Kubernetes cluster is running
  • NGINX Ingress Controller is installed
  • cert-manager is installed with Let's Encrypt ClusterIssuer
  • DNS record grafana.betelgeusebytes.io points to cluster IP
  • Node is labeled kubernetes.io/hostname=hetzner-2
  • kubectl is configured and working

Verify Prerequisites

# Check cluster
kubectl cluster-info

# Check NGINX Ingress
kubectl get pods -n ingress-nginx

# Check cert-manager
kubectl get pods -n cert-manager

# Check node label
kubectl get nodes --show-labels | grep hetzner-2

# Check DNS (from external machine)
dig grafana.betelgeusebytes.io

Deployment Steps

Step 1: Prepare Storage

  • SSH into hetzner-2 node
  • Create directories:
sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
  • Set correct permissions:
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
sudo chown -R 10001:10001 /mnt/local-ssd/loki
sudo chown -R root:root /mnt/local-ssd/tempo
sudo chown -R 472:472 /mnt/local-ssd/grafana
  • Verify permissions:
ls -la /mnt/local-ssd/

Step 2: Review Configuration

  • Review 03-prometheus-config.yaml - verify scrape targets
  • Review 04-loki-config.yaml - verify retention (7 days)
  • Review 05-tempo-config.yaml - verify retention (7 days)
  • Review 06-alloy-config.yaml - verify endpoints
  • Review 20-grafana-ingress.yaml - verify domain name

Step 3: Deploy the Stack

  • Navigate to observability-stack directory
cd /path/to/observability-stack
  • Make scripts executable (already done):
chmod +x *.sh
  • Run deployment script:
./deploy.sh

OR deploy manually:

kubectl apply -f 00-namespace.yaml
kubectl apply -f 01-persistent-volumes.yaml
kubectl apply -f 02-persistent-volume-claims.yaml
kubectl apply -f 03-prometheus-config.yaml
kubectl apply -f 04-loki-config.yaml
kubectl apply -f 05-tempo-config.yaml
kubectl apply -f 06-alloy-config.yaml
kubectl apply -f 07-grafana-datasources.yaml
kubectl apply -f 08-rbac.yaml
kubectl apply -f 10-prometheus.yaml
kubectl apply -f 11-loki.yaml
kubectl apply -f 12-tempo.yaml
kubectl apply -f 13-grafana.yaml
kubectl apply -f 14-alloy.yaml
kubectl apply -f 15-kube-state-metrics.yaml
kubectl apply -f 16-node-exporter.yaml
kubectl apply -f 20-grafana-ingress.yaml

Step 4: Verify Deployment

  • Run status check:
./status.sh
  • Check all PersistentVolumes are Bound:
kubectl get pv
  • Check all PersistentVolumeClaims are Bound:
kubectl get pvc -n observability
  • Check all pods are Running:
kubectl get pods -n observability

Expected pods:

  • prometheus-0

  • loki-0

  • tempo-0

  • grafana-0

  • alloy-xxxxx (one per node)

  • kube-state-metrics-xxxxx

  • node-exporter-xxxxx (one per node)

  • Check services are created:

kubectl get svc -n observability
  • Check ingress is created:
kubectl get ingress -n observability
  • Verify TLS certificate is issued:
kubectl get certificate -n observability
kubectl describe certificate grafana-tls -n observability

Step 5: Test Connectivity

  • Test Prometheus endpoint:
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
  curl http://prometheus.observability.svc.cluster.local:9090/-/healthy
  • Test Loki endpoint:
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
  curl http://loki.observability.svc.cluster.local:3100/ready
  • Test Tempo endpoint:
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
  curl http://tempo.observability.svc.cluster.local:3200/ready
  • Test Grafana endpoint:
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
  curl http://grafana.observability.svc.cluster.local:3000/api/health

Post-Deployment Configuration

Step 6: Access Grafana

  • Open browser to: https://grafana.betelgeusebytes.io
  • Login with default credentials:
    • Username: admin
    • Password: admin
  • CRITICAL: Change admin password immediately
  • Verify datasources are configured:
    • Go to Configuration → Data Sources
    • Should see: Prometheus (default), Loki, Tempo
    • Click "Test" on each datasource

Step 7: Verify Data Collection

  • Check Prometheus has targets:
    • In Grafana, Explore → Prometheus
    • Query: up
    • Should see multiple targets with value=1
  • Check Loki is receiving logs:
    • In Grafana, Explore → Loki
    • Query: {namespace="observability"}
    • Should see logs from observability stack
  • Check kube-state-metrics:
    • In Grafana, Explore → Prometheus
    • Query: kube_pod_status_phase
    • Should see pod status metrics

Step 8: Import Dashboards (Optional)

  • Import Kubernetes cluster dashboard:
    • Dashboards → Import → ID: 315
  • Import Node Exporter dashboard:
    • Dashboards → Import → ID: 1860
  • Import Loki dashboard:
    • Dashboards → Import → ID: 13639

Step 9: Test with Demo App (Optional)

  • Deploy demo application:
kubectl apply -f demo-app.yaml
  • Wait for pod to be ready:
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s
  • Test the endpoints:
kubectl port-forward -n observability svc/demo-app 8080:8080
# In another terminal:
curl http://localhost:8080/
curl http://localhost:8080/items
curl http://localhost:8080/slow
curl http://localhost:8080/error
  • Verify in Grafana:
    • Logs: {app="demo-app"}
    • Metrics: flask_http_request_total
    • Traces: Search for "demo-app" service in Tempo

Monitoring and Maintenance

Daily Checks

  • Check pod status: kubectl get pods -n observability
  • Check resource usage: kubectl top pods -n observability
  • Check disk usage on hetzner-2: df -h /mnt/local-ssd/

Weekly Checks

  • Review Grafana for any alerts or anomalies
  • Verify TLS certificate is valid
  • Check logs for any errors:
kubectl logs -n observability -l app=prometheus --tail=100
kubectl logs -n observability -l app=loki --tail=100
kubectl logs -n observability -l app=tempo --tail=100
kubectl logs -n observability -l app=grafana --tail=100

Monthly Checks

  • Review retention policies (7 days is appropriate)
  • Check storage growth trends
  • Review and update dashboards
  • Backup Grafana dashboards and configs

Troubleshooting Guide

Pod Won't Start

  1. Check events: kubectl describe pod <pod-name> -n observability
  2. Check logs: kubectl logs <pod-name> -n observability
  3. Check storage: kubectl get pv and kubectl get pvc -n observability
  4. Verify node has space: SSH to hetzner-2 and run df -h

No Logs Appearing

  1. Check Alloy pods: kubectl get pods -n observability -l app=alloy
  2. Check Alloy logs: kubectl logs -n observability -l app=alloy
  3. Check Loki is running: kubectl get pods -n observability -l app=loki
  4. Test Loki endpoint from Alloy pod

No Metrics Appearing

  1. Check Prometheus targets: Port-forward and visit http://localhost:9090/targets
  2. Check service discovery: Look for "kubernetes-*" targets
  3. Verify RBAC: kubectl get clusterrolebinding prometheus
  4. Check kube-state-metrics: kubectl get pods -n observability -l app=kube-state-metrics

Grafana Can't Connect to Datasources

  1. Test from Grafana pod:
kubectl exec -it grafana-0 -n observability -- wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy
  1. Check datasource configuration in Grafana UI
  2. Verify services exist: kubectl get svc -n observability

High Resource Usage

  1. Check actual usage: kubectl top pods -n observability
  2. Check node capacity: kubectl top nodes
  3. Consider reducing retention periods
  4. Review and adjust resource limits

Rollback Procedure

If something goes wrong:

  1. Remove the deployment:
./cleanup.sh
  1. Fix the issue in configuration files

  2. Redeploy:

./deploy.sh

Success Criteria

All checked items below indicate successful deployment:

  • All pods are in Running state
  • All PVCs are Bound
  • Grafana is accessible at https://grafana.betelgeusebytes.io
  • All three datasources (Prometheus, Loki, Tempo) test successfully
  • Prometheus shows targets as "up"
  • Loki shows logs from observability namespace
  • TLS certificate is valid and auto-renewing
  • Admin password has been changed
  • Resource usage is within acceptable limits

Documentation References

  • README.md: Comprehensive documentation
  • QUICKREF.md: Quick reference for common operations
  • demo-app.yaml: Example instrumented application
  • deploy.sh: Automated deployment script
  • cleanup.sh: Removal script
  • status.sh: Status checking script

Next Steps After Deployment

  1. Import useful dashboards from Grafana.com
  2. Configure alerts (requires Alertmanager - not included)
  3. Instrument your applications to send logs/metrics/traces
  4. Create custom dashboards for your specific needs
  5. Set up backup procedures for Grafana dashboards
  6. Document your team's observability practices

Notes

  • Default retention: 7 days for all components
  • Default resources are optimized for single-node cluster
  • Scale up resources if monitoring high-traffic applications
  • Always backup before making configuration changes
  • Test changes in a non-production environment first

Deployment Date: _______________ Deployed By: _______________ Grafana Version: 11.4.0 Stack Version: January 2025