10 KiB
10 KiB
Observability Stack Deployment Checklist
Use this checklist to ensure a smooth deployment of the observability stack.
Pre-Deployment
Check for Existing Monitoring Stack
- Check if you have existing monitoring components:
# Check for monitoring namespaces
kubectl get namespaces | grep -E "(monitoring|prometheus|grafana|loki|tempo)"
# Check for monitoring pods in common namespaces
kubectl get pods -n monitoring 2>/dev/null || true
kubectl get pods -n prometheus 2>/dev/null || true
kubectl get pods -n grafana 2>/dev/null || true
kubectl get pods -A | grep -E "(prometheus|grafana|loki|tempo|fluent-bit|vector)"
# Check for Helm releases
helm list -A | grep -E "(prometheus|grafana|loki|tempo)"
- If existing monitoring is found, remove it first:
./remove-old-monitoring.sh
OR run the deployment script which will prompt you:
./deploy.sh # Will ask if you want to clean up first
Prerequisites
- Kubernetes cluster is running
- NGINX Ingress Controller is installed
- cert-manager is installed with Let's Encrypt ClusterIssuer
- DNS record
grafana.betelgeusebytes.iopoints to cluster IP - Node is labeled
kubernetes.io/hostname=hetzner-2 - kubectl is configured and working
Verify Prerequisites
# Check cluster
kubectl cluster-info
# Check NGINX Ingress
kubectl get pods -n ingress-nginx
# Check cert-manager
kubectl get pods -n cert-manager
# Check node label
kubectl get nodes --show-labels | grep hetzner-2
# Check DNS (from external machine)
dig grafana.betelgeusebytes.io
Deployment Steps
Step 1: Prepare Storage
- SSH into hetzner-2 node
- Create directories:
sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
- Set correct permissions:
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
sudo chown -R 10001:10001 /mnt/local-ssd/loki
sudo chown -R root:root /mnt/local-ssd/tempo
sudo chown -R 472:472 /mnt/local-ssd/grafana
- Verify permissions:
ls -la /mnt/local-ssd/
Step 2: Review Configuration
- Review
03-prometheus-config.yaml- verify scrape targets - Review
04-loki-config.yaml- verify retention (7 days) - Review
05-tempo-config.yaml- verify retention (7 days) - Review
06-alloy-config.yaml- verify endpoints - Review
20-grafana-ingress.yaml- verify domain name
Step 3: Deploy the Stack
- Navigate to observability-stack directory
cd /path/to/observability-stack
- Make scripts executable (already done):
chmod +x *.sh
- Run deployment script:
./deploy.sh
OR deploy manually:
kubectl apply -f 00-namespace.yaml
kubectl apply -f 01-persistent-volumes.yaml
kubectl apply -f 02-persistent-volume-claims.yaml
kubectl apply -f 03-prometheus-config.yaml
kubectl apply -f 04-loki-config.yaml
kubectl apply -f 05-tempo-config.yaml
kubectl apply -f 06-alloy-config.yaml
kubectl apply -f 07-grafana-datasources.yaml
kubectl apply -f 08-rbac.yaml
kubectl apply -f 10-prometheus.yaml
kubectl apply -f 11-loki.yaml
kubectl apply -f 12-tempo.yaml
kubectl apply -f 13-grafana.yaml
kubectl apply -f 14-alloy.yaml
kubectl apply -f 15-kube-state-metrics.yaml
kubectl apply -f 16-node-exporter.yaml
kubectl apply -f 20-grafana-ingress.yaml
Step 4: Verify Deployment
- Run status check:
./status.sh
- Check all PersistentVolumes are Bound:
kubectl get pv
- Check all PersistentVolumeClaims are Bound:
kubectl get pvc -n observability
- Check all pods are Running:
kubectl get pods -n observability
Expected pods:
-
prometheus-0
-
loki-0
-
tempo-0
-
grafana-0
-
alloy-xxxxx (one per node)
-
kube-state-metrics-xxxxx
-
node-exporter-xxxxx (one per node)
-
Check services are created:
kubectl get svc -n observability
- Check ingress is created:
kubectl get ingress -n observability
- Verify TLS certificate is issued:
kubectl get certificate -n observability
kubectl describe certificate grafana-tls -n observability
Step 5: Test Connectivity
- Test Prometheus endpoint:
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
curl http://prometheus.observability.svc.cluster.local:9090/-/healthy
- Test Loki endpoint:
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
curl http://loki.observability.svc.cluster.local:3100/ready
- Test Tempo endpoint:
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
curl http://tempo.observability.svc.cluster.local:3200/ready
- Test Grafana endpoint:
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
curl http://grafana.observability.svc.cluster.local:3000/api/health
Post-Deployment Configuration
Step 6: Access Grafana
- Open browser to: https://grafana.betelgeusebytes.io
- Login with default credentials:
- Username:
admin - Password:
admin
- Username:
- CRITICAL: Change admin password immediately
- Verify datasources are configured:
- Go to Configuration → Data Sources
- Should see: Prometheus (default), Loki, Tempo
- Click "Test" on each datasource
Step 7: Verify Data Collection
- Check Prometheus has targets:
- In Grafana, Explore → Prometheus
- Query:
up - Should see multiple targets with value=1
- Check Loki is receiving logs:
- In Grafana, Explore → Loki
- Query:
{namespace="observability"} - Should see logs from observability stack
- Check kube-state-metrics:
- In Grafana, Explore → Prometheus
- Query:
kube_pod_status_phase - Should see pod status metrics
Step 8: Import Dashboards (Optional)
- Import Kubernetes cluster dashboard:
- Dashboards → Import → ID: 315
- Import Node Exporter dashboard:
- Dashboards → Import → ID: 1860
- Import Loki dashboard:
- Dashboards → Import → ID: 13639
Step 9: Test with Demo App (Optional)
- Deploy demo application:
kubectl apply -f demo-app.yaml
- Wait for pod to be ready:
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s
- Test the endpoints:
kubectl port-forward -n observability svc/demo-app 8080:8080
# In another terminal:
curl http://localhost:8080/
curl http://localhost:8080/items
curl http://localhost:8080/slow
curl http://localhost:8080/error
- Verify in Grafana:
- Logs:
{app="demo-app"} - Metrics:
flask_http_request_total - Traces: Search for "demo-app" service in Tempo
- Logs:
Monitoring and Maintenance
Daily Checks
- Check pod status:
kubectl get pods -n observability - Check resource usage:
kubectl top pods -n observability - Check disk usage on hetzner-2:
df -h /mnt/local-ssd/
Weekly Checks
- Review Grafana for any alerts or anomalies
- Verify TLS certificate is valid
- Check logs for any errors:
kubectl logs -n observability -l app=prometheus --tail=100
kubectl logs -n observability -l app=loki --tail=100
kubectl logs -n observability -l app=tempo --tail=100
kubectl logs -n observability -l app=grafana --tail=100
Monthly Checks
- Review retention policies (7 days is appropriate)
- Check storage growth trends
- Review and update dashboards
- Backup Grafana dashboards and configs
Troubleshooting Guide
Pod Won't Start
- Check events:
kubectl describe pod <pod-name> -n observability - Check logs:
kubectl logs <pod-name> -n observability - Check storage:
kubectl get pvandkubectl get pvc -n observability - Verify node has space: SSH to hetzner-2 and run
df -h
No Logs Appearing
- Check Alloy pods:
kubectl get pods -n observability -l app=alloy - Check Alloy logs:
kubectl logs -n observability -l app=alloy - Check Loki is running:
kubectl get pods -n observability -l app=loki - Test Loki endpoint from Alloy pod
No Metrics Appearing
- Check Prometheus targets: Port-forward and visit http://localhost:9090/targets
- Check service discovery: Look for "kubernetes-*" targets
- Verify RBAC:
kubectl get clusterrolebinding prometheus - Check kube-state-metrics:
kubectl get pods -n observability -l app=kube-state-metrics
Grafana Can't Connect to Datasources
- Test from Grafana pod:
kubectl exec -it grafana-0 -n observability -- wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy
- Check datasource configuration in Grafana UI
- Verify services exist:
kubectl get svc -n observability
High Resource Usage
- Check actual usage:
kubectl top pods -n observability - Check node capacity:
kubectl top nodes - Consider reducing retention periods
- Review and adjust resource limits
Rollback Procedure
If something goes wrong:
- Remove the deployment:
./cleanup.sh
-
Fix the issue in configuration files
-
Redeploy:
./deploy.sh
Success Criteria
All checked items below indicate successful deployment:
- All pods are in Running state
- All PVCs are Bound
- Grafana is accessible at https://grafana.betelgeusebytes.io
- All three datasources (Prometheus, Loki, Tempo) test successfully
- Prometheus shows targets as "up"
- Loki shows logs from observability namespace
- TLS certificate is valid and auto-renewing
- Admin password has been changed
- Resource usage is within acceptable limits
Documentation References
- README.md: Comprehensive documentation
- QUICKREF.md: Quick reference for common operations
- demo-app.yaml: Example instrumented application
- deploy.sh: Automated deployment script
- cleanup.sh: Removal script
- status.sh: Status checking script
Next Steps After Deployment
- Import useful dashboards from Grafana.com
- Configure alerts (requires Alertmanager - not included)
- Instrument your applications to send logs/metrics/traces
- Create custom dashboards for your specific needs
- Set up backup procedures for Grafana dashboards
- Document your team's observability practices
Notes
- Default retention: 7 days for all components
- Default resources are optimized for single-node cluster
- Scale up resources if monitoring high-traffic applications
- Always backup before making configuration changes
- Test changes in a non-production environment first
Deployment Date: _______________ Deployed By: _______________ Grafana Version: 11.4.0 Stack Version: January 2025