betelgeusebytes/k8s/observability-stack/DEPLOYMENT-CHECKLIST.md

360 lines
10 KiB
Markdown

# Observability Stack Deployment Checklist
Use this checklist to ensure a smooth deployment of the observability stack.
## Pre-Deployment
### Check for Existing Monitoring Stack
- [ ] Check if you have existing monitoring components:
```bash
# Check for monitoring namespaces
kubectl get namespaces | grep -E "(monitoring|prometheus|grafana|loki|tempo)"
# Check for monitoring pods in common namespaces
kubectl get pods -n monitoring 2>/dev/null || true
kubectl get pods -n prometheus 2>/dev/null || true
kubectl get pods -n grafana 2>/dev/null || true
kubectl get pods -A | grep -E "(prometheus|grafana|loki|tempo|fluent-bit|vector)"
# Check for Helm releases
helm list -A | grep -E "(prometheus|grafana|loki|tempo)"
```
- [ ] If existing monitoring is found, remove it first:
```bash
./remove-old-monitoring.sh
```
**OR** run the deployment script which will prompt you:
```bash
./deploy.sh # Will ask if you want to clean up first
```
### Prerequisites
- [ ] Kubernetes cluster is running
- [ ] NGINX Ingress Controller is installed
- [ ] cert-manager is installed with Let's Encrypt ClusterIssuer
- [ ] DNS record `grafana.betelgeusebytes.io` points to cluster IP
- [ ] Node is labeled `kubernetes.io/hostname=hetzner-2`
- [ ] kubectl is configured and working
### Verify Prerequisites
```bash
# Check cluster
kubectl cluster-info
# Check NGINX Ingress
kubectl get pods -n ingress-nginx
# Check cert-manager
kubectl get pods -n cert-manager
# Check node label
kubectl get nodes --show-labels | grep hetzner-2
# Check DNS (from external machine)
dig grafana.betelgeusebytes.io
```
## Deployment Steps
### Step 1: Prepare Storage
- [ ] SSH into hetzner-2 node
- [ ] Create directories:
```bash
sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
```
- [ ] Set correct permissions:
```bash
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
sudo chown -R 10001:10001 /mnt/local-ssd/loki
sudo chown -R root:root /mnt/local-ssd/tempo
sudo chown -R 472:472 /mnt/local-ssd/grafana
```
- [ ] Verify permissions:
```bash
ls -la /mnt/local-ssd/
```
### Step 2: Review Configuration
- [ ] Review `03-prometheus-config.yaml` - verify scrape targets
- [ ] Review `04-loki-config.yaml` - verify retention (7 days)
- [ ] Review `05-tempo-config.yaml` - verify retention (7 days)
- [ ] Review `06-alloy-config.yaml` - verify endpoints
- [ ] Review `20-grafana-ingress.yaml` - verify domain name
### Step 3: Deploy the Stack
- [ ] Navigate to observability-stack directory
```bash
cd /path/to/observability-stack
```
- [ ] Make scripts executable (already done):
```bash
chmod +x *.sh
```
- [ ] Run deployment script:
```bash
./deploy.sh
```
OR deploy manually:
```bash
kubectl apply -f 00-namespace.yaml
kubectl apply -f 01-persistent-volumes.yaml
kubectl apply -f 02-persistent-volume-claims.yaml
kubectl apply -f 03-prometheus-config.yaml
kubectl apply -f 04-loki-config.yaml
kubectl apply -f 05-tempo-config.yaml
kubectl apply -f 06-alloy-config.yaml
kubectl apply -f 07-grafana-datasources.yaml
kubectl apply -f 08-rbac.yaml
kubectl apply -f 10-prometheus.yaml
kubectl apply -f 11-loki.yaml
kubectl apply -f 12-tempo.yaml
kubectl apply -f 13-grafana.yaml
kubectl apply -f 14-alloy.yaml
kubectl apply -f 15-kube-state-metrics.yaml
kubectl apply -f 16-node-exporter.yaml
kubectl apply -f 20-grafana-ingress.yaml
```
### Step 4: Verify Deployment
- [ ] Run status check:
```bash
./status.sh
```
- [ ] Check all PersistentVolumes are Bound:
```bash
kubectl get pv
```
- [ ] Check all PersistentVolumeClaims are Bound:
```bash
kubectl get pvc -n observability
```
- [ ] Check all pods are Running:
```bash
kubectl get pods -n observability
```
Expected pods:
- [x] prometheus-0
- [x] loki-0
- [x] tempo-0
- [x] grafana-0
- [x] alloy-xxxxx (one per node)
- [x] kube-state-metrics-xxxxx
- [x] node-exporter-xxxxx (one per node)
- [ ] Check services are created:
```bash
kubectl get svc -n observability
```
- [ ] Check ingress is created:
```bash
kubectl get ingress -n observability
```
- [ ] Verify TLS certificate is issued:
```bash
kubectl get certificate -n observability
kubectl describe certificate grafana-tls -n observability
```
### Step 5: Test Connectivity
- [ ] Test Prometheus endpoint:
```bash
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
curl http://prometheus.observability.svc.cluster.local:9090/-/healthy
```
- [ ] Test Loki endpoint:
```bash
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
curl http://loki.observability.svc.cluster.local:3100/ready
```
- [ ] Test Tempo endpoint:
```bash
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
curl http://tempo.observability.svc.cluster.local:3200/ready
```
- [ ] Test Grafana endpoint:
```bash
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
curl http://grafana.observability.svc.cluster.local:3000/api/health
```
## Post-Deployment Configuration
### Step 6: Access Grafana
- [ ] Open browser to: https://grafana.betelgeusebytes.io
- [ ] Login with default credentials:
- Username: `admin`
- Password: `admin`
- [ ] **CRITICAL**: Change admin password immediately
- [ ] Verify datasources are configured:
- Go to Configuration → Data Sources
- Should see: Prometheus (default), Loki, Tempo
- Click "Test" on each datasource
### Step 7: Verify Data Collection
- [ ] Check Prometheus has targets:
- In Grafana, Explore → Prometheus
- Query: `up`
- Should see multiple targets with value=1
- [ ] Check Loki is receiving logs:
- In Grafana, Explore → Loki
- Query: `{namespace="observability"}`
- Should see logs from observability stack
- [ ] Check kube-state-metrics:
- In Grafana, Explore → Prometheus
- Query: `kube_pod_status_phase`
- Should see pod status metrics
### Step 8: Import Dashboards (Optional)
- [ ] Import Kubernetes cluster dashboard:
- Dashboards → Import → ID: 315
- [ ] Import Node Exporter dashboard:
- Dashboards → Import → ID: 1860
- [ ] Import Loki dashboard:
- Dashboards → Import → ID: 13639
### Step 9: Test with Demo App (Optional)
- [ ] Deploy demo application:
```bash
kubectl apply -f demo-app.yaml
```
- [ ] Wait for pod to be ready:
```bash
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s
```
- [ ] Test the endpoints:
```bash
kubectl port-forward -n observability svc/demo-app 8080:8080
# In another terminal:
curl http://localhost:8080/
curl http://localhost:8080/items
curl http://localhost:8080/slow
curl http://localhost:8080/error
```
- [ ] Verify in Grafana:
- Logs: `{app="demo-app"}`
- Metrics: `flask_http_request_total`
- Traces: Search for "demo-app" service in Tempo
## Monitoring and Maintenance
### Daily Checks
- [ ] Check pod status: `kubectl get pods -n observability`
- [ ] Check resource usage: `kubectl top pods -n observability`
- [ ] Check disk usage on hetzner-2: `df -h /mnt/local-ssd/`
### Weekly Checks
- [ ] Review Grafana for any alerts or anomalies
- [ ] Verify TLS certificate is valid
- [ ] Check logs for any errors:
```bash
kubectl logs -n observability -l app=prometheus --tail=100
kubectl logs -n observability -l app=loki --tail=100
kubectl logs -n observability -l app=tempo --tail=100
kubectl logs -n observability -l app=grafana --tail=100
```
### Monthly Checks
- [ ] Review retention policies (7 days is appropriate)
- [ ] Check storage growth trends
- [ ] Review and update dashboards
- [ ] Backup Grafana dashboards and configs
## Troubleshooting Guide
### Pod Won't Start
1. Check events: `kubectl describe pod <pod-name> -n observability`
2. Check logs: `kubectl logs <pod-name> -n observability`
3. Check storage: `kubectl get pv` and `kubectl get pvc -n observability`
4. Verify node has space: SSH to hetzner-2 and run `df -h`
### No Logs Appearing
1. Check Alloy pods: `kubectl get pods -n observability -l app=alloy`
2. Check Alloy logs: `kubectl logs -n observability -l app=alloy`
3. Check Loki is running: `kubectl get pods -n observability -l app=loki`
4. Test Loki endpoint from Alloy pod
### No Metrics Appearing
1. Check Prometheus targets: Port-forward and visit http://localhost:9090/targets
2. Check service discovery: Look for "kubernetes-*" targets
3. Verify RBAC: `kubectl get clusterrolebinding prometheus`
4. Check kube-state-metrics: `kubectl get pods -n observability -l app=kube-state-metrics`
### Grafana Can't Connect to Datasources
1. Test from Grafana pod:
```bash
kubectl exec -it grafana-0 -n observability -- wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy
```
2. Check datasource configuration in Grafana UI
3. Verify services exist: `kubectl get svc -n observability`
### High Resource Usage
1. Check actual usage: `kubectl top pods -n observability`
2. Check node capacity: `kubectl top nodes`
3. Consider reducing retention periods
4. Review and adjust resource limits
## Rollback Procedure
If something goes wrong:
1. Remove the deployment:
```bash
./cleanup.sh
```
2. Fix the issue in configuration files
3. Redeploy:
```bash
./deploy.sh
```
## Success Criteria
All checked items below indicate successful deployment:
- [x] All pods are in Running state
- [x] All PVCs are Bound
- [x] Grafana is accessible at https://grafana.betelgeusebytes.io
- [x] All three datasources (Prometheus, Loki, Tempo) test successfully
- [x] Prometheus shows targets as "up"
- [x] Loki shows logs from observability namespace
- [x] TLS certificate is valid and auto-renewing
- [x] Admin password has been changed
- [x] Resource usage is within acceptable limits
## Documentation References
- **README.md**: Comprehensive documentation
- **QUICKREF.md**: Quick reference for common operations
- **demo-app.yaml**: Example instrumented application
- **deploy.sh**: Automated deployment script
- **cleanup.sh**: Removal script
- **status.sh**: Status checking script
## Next Steps After Deployment
1. Import useful dashboards from Grafana.com
2. Configure alerts (requires Alertmanager - not included)
3. Instrument your applications to send logs/metrics/traces
4. Create custom dashboards for your specific needs
5. Set up backup procedures for Grafana dashboards
6. Document your team's observability practices
## Notes
- Default retention: 7 days for all components
- Default resources are optimized for single-node cluster
- Scale up resources if monitoring high-traffic applications
- Always backup before making configuration changes
- Test changes in a non-production environment first
---
**Deployment Date**: _______________
**Deployed By**: _______________
**Grafana Version**: 11.4.0
**Stack Version**: January 2025