# Observability Stack Deployment Checklist Use this checklist to ensure a smooth deployment of the observability stack. ## Pre-Deployment ### Check for Existing Monitoring Stack - [ ] Check if you have existing monitoring components: ```bash # Check for monitoring namespaces kubectl get namespaces | grep -E "(monitoring|prometheus|grafana|loki|tempo)" # Check for monitoring pods in common namespaces kubectl get pods -n monitoring 2>/dev/null || true kubectl get pods -n prometheus 2>/dev/null || true kubectl get pods -n grafana 2>/dev/null || true kubectl get pods -A | grep -E "(prometheus|grafana|loki|tempo|fluent-bit|vector)" # Check for Helm releases helm list -A | grep -E "(prometheus|grafana|loki|tempo)" ``` - [ ] If existing monitoring is found, remove it first: ```bash ./remove-old-monitoring.sh ``` **OR** run the deployment script which will prompt you: ```bash ./deploy.sh # Will ask if you want to clean up first ``` ### Prerequisites - [ ] Kubernetes cluster is running - [ ] NGINX Ingress Controller is installed - [ ] cert-manager is installed with Let's Encrypt ClusterIssuer - [ ] DNS record `grafana.betelgeusebytes.io` points to cluster IP - [ ] Node is labeled `kubernetes.io/hostname=hetzner-2` - [ ] kubectl is configured and working ### Verify Prerequisites ```bash # Check cluster kubectl cluster-info # Check NGINX Ingress kubectl get pods -n ingress-nginx # Check cert-manager kubectl get pods -n cert-manager # Check node label kubectl get nodes --show-labels | grep hetzner-2 # Check DNS (from external machine) dig grafana.betelgeusebytes.io ``` ## Deployment Steps ### Step 1: Prepare Storage - [ ] SSH into hetzner-2 node - [ ] Create directories: ```bash sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana} ``` - [ ] Set correct permissions: ```bash sudo chown -R 65534:65534 /mnt/local-ssd/prometheus sudo chown -R 10001:10001 /mnt/local-ssd/loki sudo chown -R root:root /mnt/local-ssd/tempo sudo chown -R 472:472 /mnt/local-ssd/grafana ``` - [ ] Verify permissions: ```bash ls -la /mnt/local-ssd/ ``` ### Step 2: Review Configuration - [ ] Review `03-prometheus-config.yaml` - verify scrape targets - [ ] Review `04-loki-config.yaml` - verify retention (7 days) - [ ] Review `05-tempo-config.yaml` - verify retention (7 days) - [ ] Review `06-alloy-config.yaml` - verify endpoints - [ ] Review `20-grafana-ingress.yaml` - verify domain name ### Step 3: Deploy the Stack - [ ] Navigate to observability-stack directory ```bash cd /path/to/observability-stack ``` - [ ] Make scripts executable (already done): ```bash chmod +x *.sh ``` - [ ] Run deployment script: ```bash ./deploy.sh ``` OR deploy manually: ```bash kubectl apply -f 00-namespace.yaml kubectl apply -f 01-persistent-volumes.yaml kubectl apply -f 02-persistent-volume-claims.yaml kubectl apply -f 03-prometheus-config.yaml kubectl apply -f 04-loki-config.yaml kubectl apply -f 05-tempo-config.yaml kubectl apply -f 06-alloy-config.yaml kubectl apply -f 07-grafana-datasources.yaml kubectl apply -f 08-rbac.yaml kubectl apply -f 10-prometheus.yaml kubectl apply -f 11-loki.yaml kubectl apply -f 12-tempo.yaml kubectl apply -f 13-grafana.yaml kubectl apply -f 14-alloy.yaml kubectl apply -f 15-kube-state-metrics.yaml kubectl apply -f 16-node-exporter.yaml kubectl apply -f 20-grafana-ingress.yaml ``` ### Step 4: Verify Deployment - [ ] Run status check: ```bash ./status.sh ``` - [ ] Check all PersistentVolumes are Bound: ```bash kubectl get pv ``` - [ ] Check all PersistentVolumeClaims are Bound: ```bash kubectl get pvc -n observability ``` - [ ] Check all pods are Running: ```bash kubectl get pods -n observability ``` Expected pods: - [x] prometheus-0 - [x] loki-0 - [x] tempo-0 - [x] grafana-0 - [x] alloy-xxxxx (one per node) - [x] kube-state-metrics-xxxxx - [x] node-exporter-xxxxx (one per node) - [ ] Check services are created: ```bash kubectl get svc -n observability ``` - [ ] Check ingress is created: ```bash kubectl get ingress -n observability ``` - [ ] Verify TLS certificate is issued: ```bash kubectl get certificate -n observability kubectl describe certificate grafana-tls -n observability ``` ### Step 5: Test Connectivity - [ ] Test Prometheus endpoint: ```bash kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \ curl http://prometheus.observability.svc.cluster.local:9090/-/healthy ``` - [ ] Test Loki endpoint: ```bash kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \ curl http://loki.observability.svc.cluster.local:3100/ready ``` - [ ] Test Tempo endpoint: ```bash kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \ curl http://tempo.observability.svc.cluster.local:3200/ready ``` - [ ] Test Grafana endpoint: ```bash kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \ curl http://grafana.observability.svc.cluster.local:3000/api/health ``` ## Post-Deployment Configuration ### Step 6: Access Grafana - [ ] Open browser to: https://grafana.betelgeusebytes.io - [ ] Login with default credentials: - Username: `admin` - Password: `admin` - [ ] **CRITICAL**: Change admin password immediately - [ ] Verify datasources are configured: - Go to Configuration → Data Sources - Should see: Prometheus (default), Loki, Tempo - Click "Test" on each datasource ### Step 7: Verify Data Collection - [ ] Check Prometheus has targets: - In Grafana, Explore → Prometheus - Query: `up` - Should see multiple targets with value=1 - [ ] Check Loki is receiving logs: - In Grafana, Explore → Loki - Query: `{namespace="observability"}` - Should see logs from observability stack - [ ] Check kube-state-metrics: - In Grafana, Explore → Prometheus - Query: `kube_pod_status_phase` - Should see pod status metrics ### Step 8: Import Dashboards (Optional) - [ ] Import Kubernetes cluster dashboard: - Dashboards → Import → ID: 315 - [ ] Import Node Exporter dashboard: - Dashboards → Import → ID: 1860 - [ ] Import Loki dashboard: - Dashboards → Import → ID: 13639 ### Step 9: Test with Demo App (Optional) - [ ] Deploy demo application: ```bash kubectl apply -f demo-app.yaml ``` - [ ] Wait for pod to be ready: ```bash kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s ``` - [ ] Test the endpoints: ```bash kubectl port-forward -n observability svc/demo-app 8080:8080 # In another terminal: curl http://localhost:8080/ curl http://localhost:8080/items curl http://localhost:8080/slow curl http://localhost:8080/error ``` - [ ] Verify in Grafana: - Logs: `{app="demo-app"}` - Metrics: `flask_http_request_total` - Traces: Search for "demo-app" service in Tempo ## Monitoring and Maintenance ### Daily Checks - [ ] Check pod status: `kubectl get pods -n observability` - [ ] Check resource usage: `kubectl top pods -n observability` - [ ] Check disk usage on hetzner-2: `df -h /mnt/local-ssd/` ### Weekly Checks - [ ] Review Grafana for any alerts or anomalies - [ ] Verify TLS certificate is valid - [ ] Check logs for any errors: ```bash kubectl logs -n observability -l app=prometheus --tail=100 kubectl logs -n observability -l app=loki --tail=100 kubectl logs -n observability -l app=tempo --tail=100 kubectl logs -n observability -l app=grafana --tail=100 ``` ### Monthly Checks - [ ] Review retention policies (7 days is appropriate) - [ ] Check storage growth trends - [ ] Review and update dashboards - [ ] Backup Grafana dashboards and configs ## Troubleshooting Guide ### Pod Won't Start 1. Check events: `kubectl describe pod -n observability` 2. Check logs: `kubectl logs -n observability` 3. Check storage: `kubectl get pv` and `kubectl get pvc -n observability` 4. Verify node has space: SSH to hetzner-2 and run `df -h` ### No Logs Appearing 1. Check Alloy pods: `kubectl get pods -n observability -l app=alloy` 2. Check Alloy logs: `kubectl logs -n observability -l app=alloy` 3. Check Loki is running: `kubectl get pods -n observability -l app=loki` 4. Test Loki endpoint from Alloy pod ### No Metrics Appearing 1. Check Prometheus targets: Port-forward and visit http://localhost:9090/targets 2. Check service discovery: Look for "kubernetes-*" targets 3. Verify RBAC: `kubectl get clusterrolebinding prometheus` 4. Check kube-state-metrics: `kubectl get pods -n observability -l app=kube-state-metrics` ### Grafana Can't Connect to Datasources 1. Test from Grafana pod: ```bash kubectl exec -it grafana-0 -n observability -- wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy ``` 2. Check datasource configuration in Grafana UI 3. Verify services exist: `kubectl get svc -n observability` ### High Resource Usage 1. Check actual usage: `kubectl top pods -n observability` 2. Check node capacity: `kubectl top nodes` 3. Consider reducing retention periods 4. Review and adjust resource limits ## Rollback Procedure If something goes wrong: 1. Remove the deployment: ```bash ./cleanup.sh ``` 2. Fix the issue in configuration files 3. Redeploy: ```bash ./deploy.sh ``` ## Success Criteria All checked items below indicate successful deployment: - [x] All pods are in Running state - [x] All PVCs are Bound - [x] Grafana is accessible at https://grafana.betelgeusebytes.io - [x] All three datasources (Prometheus, Loki, Tempo) test successfully - [x] Prometheus shows targets as "up" - [x] Loki shows logs from observability namespace - [x] TLS certificate is valid and auto-renewing - [x] Admin password has been changed - [x] Resource usage is within acceptable limits ## Documentation References - **README.md**: Comprehensive documentation - **QUICKREF.md**: Quick reference for common operations - **demo-app.yaml**: Example instrumented application - **deploy.sh**: Automated deployment script - **cleanup.sh**: Removal script - **status.sh**: Status checking script ## Next Steps After Deployment 1. Import useful dashboards from Grafana.com 2. Configure alerts (requires Alertmanager - not included) 3. Instrument your applications to send logs/metrics/traces 4. Create custom dashboards for your specific needs 5. Set up backup procedures for Grafana dashboards 6. Document your team's observability practices ## Notes - Default retention: 7 days for all components - Default resources are optimized for single-node cluster - Scale up resources if monitoring high-traffic applications - Always backup before making configuration changes - Test changes in a non-production environment first --- **Deployment Date**: _______________ **Deployed By**: _______________ **Grafana Version**: 11.4.0 **Stack Version**: January 2025