360 lines
10 KiB
Markdown
360 lines
10 KiB
Markdown
# Observability Stack Deployment Checklist
|
|
|
|
Use this checklist to ensure a smooth deployment of the observability stack.
|
|
|
|
## Pre-Deployment
|
|
|
|
### Check for Existing Monitoring Stack
|
|
- [ ] Check if you have existing monitoring components:
|
|
```bash
|
|
# Check for monitoring namespaces
|
|
kubectl get namespaces | grep -E "(monitoring|prometheus|grafana|loki|tempo)"
|
|
|
|
# Check for monitoring pods in common namespaces
|
|
kubectl get pods -n monitoring 2>/dev/null || true
|
|
kubectl get pods -n prometheus 2>/dev/null || true
|
|
kubectl get pods -n grafana 2>/dev/null || true
|
|
kubectl get pods -A | grep -E "(prometheus|grafana|loki|tempo|fluent-bit|vector)"
|
|
|
|
# Check for Helm releases
|
|
helm list -A | grep -E "(prometheus|grafana|loki|tempo)"
|
|
```
|
|
|
|
- [ ] If existing monitoring is found, remove it first:
|
|
```bash
|
|
./remove-old-monitoring.sh
|
|
```
|
|
|
|
**OR** run the deployment script which will prompt you:
|
|
```bash
|
|
./deploy.sh # Will ask if you want to clean up first
|
|
```
|
|
|
|
### Prerequisites
|
|
- [ ] Kubernetes cluster is running
|
|
- [ ] NGINX Ingress Controller is installed
|
|
- [ ] cert-manager is installed with Let's Encrypt ClusterIssuer
|
|
- [ ] DNS record `grafana.betelgeusebytes.io` points to cluster IP
|
|
- [ ] Node is labeled `kubernetes.io/hostname=hetzner-2`
|
|
- [ ] kubectl is configured and working
|
|
|
|
### Verify Prerequisites
|
|
```bash
|
|
# Check cluster
|
|
kubectl cluster-info
|
|
|
|
# Check NGINX Ingress
|
|
kubectl get pods -n ingress-nginx
|
|
|
|
# Check cert-manager
|
|
kubectl get pods -n cert-manager
|
|
|
|
# Check node label
|
|
kubectl get nodes --show-labels | grep hetzner-2
|
|
|
|
# Check DNS (from external machine)
|
|
dig grafana.betelgeusebytes.io
|
|
```
|
|
|
|
## Deployment Steps
|
|
|
|
### Step 1: Prepare Storage
|
|
- [ ] SSH into hetzner-2 node
|
|
- [ ] Create directories:
|
|
```bash
|
|
sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
|
|
```
|
|
- [ ] Set correct permissions:
|
|
```bash
|
|
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
|
|
sudo chown -R 10001:10001 /mnt/local-ssd/loki
|
|
sudo chown -R root:root /mnt/local-ssd/tempo
|
|
sudo chown -R 472:472 /mnt/local-ssd/grafana
|
|
```
|
|
- [ ] Verify permissions:
|
|
```bash
|
|
ls -la /mnt/local-ssd/
|
|
```
|
|
|
|
### Step 2: Review Configuration
|
|
- [ ] Review `03-prometheus-config.yaml` - verify scrape targets
|
|
- [ ] Review `04-loki-config.yaml` - verify retention (7 days)
|
|
- [ ] Review `05-tempo-config.yaml` - verify retention (7 days)
|
|
- [ ] Review `06-alloy-config.yaml` - verify endpoints
|
|
- [ ] Review `20-grafana-ingress.yaml` - verify domain name
|
|
|
|
### Step 3: Deploy the Stack
|
|
- [ ] Navigate to observability-stack directory
|
|
```bash
|
|
cd /path/to/observability-stack
|
|
```
|
|
- [ ] Make scripts executable (already done):
|
|
```bash
|
|
chmod +x *.sh
|
|
```
|
|
- [ ] Run deployment script:
|
|
```bash
|
|
./deploy.sh
|
|
```
|
|
OR deploy manually:
|
|
```bash
|
|
kubectl apply -f 00-namespace.yaml
|
|
kubectl apply -f 01-persistent-volumes.yaml
|
|
kubectl apply -f 02-persistent-volume-claims.yaml
|
|
kubectl apply -f 03-prometheus-config.yaml
|
|
kubectl apply -f 04-loki-config.yaml
|
|
kubectl apply -f 05-tempo-config.yaml
|
|
kubectl apply -f 06-alloy-config.yaml
|
|
kubectl apply -f 07-grafana-datasources.yaml
|
|
kubectl apply -f 08-rbac.yaml
|
|
kubectl apply -f 10-prometheus.yaml
|
|
kubectl apply -f 11-loki.yaml
|
|
kubectl apply -f 12-tempo.yaml
|
|
kubectl apply -f 13-grafana.yaml
|
|
kubectl apply -f 14-alloy.yaml
|
|
kubectl apply -f 15-kube-state-metrics.yaml
|
|
kubectl apply -f 16-node-exporter.yaml
|
|
kubectl apply -f 20-grafana-ingress.yaml
|
|
```
|
|
|
|
### Step 4: Verify Deployment
|
|
- [ ] Run status check:
|
|
```bash
|
|
./status.sh
|
|
```
|
|
- [ ] Check all PersistentVolumes are Bound:
|
|
```bash
|
|
kubectl get pv
|
|
```
|
|
- [ ] Check all PersistentVolumeClaims are Bound:
|
|
```bash
|
|
kubectl get pvc -n observability
|
|
```
|
|
- [ ] Check all pods are Running:
|
|
```bash
|
|
kubectl get pods -n observability
|
|
```
|
|
Expected pods:
|
|
- [x] prometheus-0
|
|
- [x] loki-0
|
|
- [x] tempo-0
|
|
- [x] grafana-0
|
|
- [x] alloy-xxxxx (one per node)
|
|
- [x] kube-state-metrics-xxxxx
|
|
- [x] node-exporter-xxxxx (one per node)
|
|
|
|
- [ ] Check services are created:
|
|
```bash
|
|
kubectl get svc -n observability
|
|
```
|
|
- [ ] Check ingress is created:
|
|
```bash
|
|
kubectl get ingress -n observability
|
|
```
|
|
- [ ] Verify TLS certificate is issued:
|
|
```bash
|
|
kubectl get certificate -n observability
|
|
kubectl describe certificate grafana-tls -n observability
|
|
```
|
|
|
|
### Step 5: Test Connectivity
|
|
- [ ] Test Prometheus endpoint:
|
|
```bash
|
|
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
|
|
curl http://prometheus.observability.svc.cluster.local:9090/-/healthy
|
|
```
|
|
- [ ] Test Loki endpoint:
|
|
```bash
|
|
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
|
|
curl http://loki.observability.svc.cluster.local:3100/ready
|
|
```
|
|
- [ ] Test Tempo endpoint:
|
|
```bash
|
|
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
|
|
curl http://tempo.observability.svc.cluster.local:3200/ready
|
|
```
|
|
- [ ] Test Grafana endpoint:
|
|
```bash
|
|
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
|
|
curl http://grafana.observability.svc.cluster.local:3000/api/health
|
|
```
|
|
|
|
## Post-Deployment Configuration
|
|
|
|
### Step 6: Access Grafana
|
|
- [ ] Open browser to: https://grafana.betelgeusebytes.io
|
|
- [ ] Login with default credentials:
|
|
- Username: `admin`
|
|
- Password: `admin`
|
|
- [ ] **CRITICAL**: Change admin password immediately
|
|
- [ ] Verify datasources are configured:
|
|
- Go to Configuration → Data Sources
|
|
- Should see: Prometheus (default), Loki, Tempo
|
|
- Click "Test" on each datasource
|
|
|
|
### Step 7: Verify Data Collection
|
|
- [ ] Check Prometheus has targets:
|
|
- In Grafana, Explore → Prometheus
|
|
- Query: `up`
|
|
- Should see multiple targets with value=1
|
|
- [ ] Check Loki is receiving logs:
|
|
- In Grafana, Explore → Loki
|
|
- Query: `{namespace="observability"}`
|
|
- Should see logs from observability stack
|
|
- [ ] Check kube-state-metrics:
|
|
- In Grafana, Explore → Prometheus
|
|
- Query: `kube_pod_status_phase`
|
|
- Should see pod status metrics
|
|
|
|
### Step 8: Import Dashboards (Optional)
|
|
- [ ] Import Kubernetes cluster dashboard:
|
|
- Dashboards → Import → ID: 315
|
|
- [ ] Import Node Exporter dashboard:
|
|
- Dashboards → Import → ID: 1860
|
|
- [ ] Import Loki dashboard:
|
|
- Dashboards → Import → ID: 13639
|
|
|
|
### Step 9: Test with Demo App (Optional)
|
|
- [ ] Deploy demo application:
|
|
```bash
|
|
kubectl apply -f demo-app.yaml
|
|
```
|
|
- [ ] Wait for pod to be ready:
|
|
```bash
|
|
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s
|
|
```
|
|
- [ ] Test the endpoints:
|
|
```bash
|
|
kubectl port-forward -n observability svc/demo-app 8080:8080
|
|
# In another terminal:
|
|
curl http://localhost:8080/
|
|
curl http://localhost:8080/items
|
|
curl http://localhost:8080/slow
|
|
curl http://localhost:8080/error
|
|
```
|
|
- [ ] Verify in Grafana:
|
|
- Logs: `{app="demo-app"}`
|
|
- Metrics: `flask_http_request_total`
|
|
- Traces: Search for "demo-app" service in Tempo
|
|
|
|
## Monitoring and Maintenance
|
|
|
|
### Daily Checks
|
|
- [ ] Check pod status: `kubectl get pods -n observability`
|
|
- [ ] Check resource usage: `kubectl top pods -n observability`
|
|
- [ ] Check disk usage on hetzner-2: `df -h /mnt/local-ssd/`
|
|
|
|
### Weekly Checks
|
|
- [ ] Review Grafana for any alerts or anomalies
|
|
- [ ] Verify TLS certificate is valid
|
|
- [ ] Check logs for any errors:
|
|
```bash
|
|
kubectl logs -n observability -l app=prometheus --tail=100
|
|
kubectl logs -n observability -l app=loki --tail=100
|
|
kubectl logs -n observability -l app=tempo --tail=100
|
|
kubectl logs -n observability -l app=grafana --tail=100
|
|
```
|
|
|
|
### Monthly Checks
|
|
- [ ] Review retention policies (7 days is appropriate)
|
|
- [ ] Check storage growth trends
|
|
- [ ] Review and update dashboards
|
|
- [ ] Backup Grafana dashboards and configs
|
|
|
|
## Troubleshooting Guide
|
|
|
|
### Pod Won't Start
|
|
1. Check events: `kubectl describe pod <pod-name> -n observability`
|
|
2. Check logs: `kubectl logs <pod-name> -n observability`
|
|
3. Check storage: `kubectl get pv` and `kubectl get pvc -n observability`
|
|
4. Verify node has space: SSH to hetzner-2 and run `df -h`
|
|
|
|
### No Logs Appearing
|
|
1. Check Alloy pods: `kubectl get pods -n observability -l app=alloy`
|
|
2. Check Alloy logs: `kubectl logs -n observability -l app=alloy`
|
|
3. Check Loki is running: `kubectl get pods -n observability -l app=loki`
|
|
4. Test Loki endpoint from Alloy pod
|
|
|
|
### No Metrics Appearing
|
|
1. Check Prometheus targets: Port-forward and visit http://localhost:9090/targets
|
|
2. Check service discovery: Look for "kubernetes-*" targets
|
|
3. Verify RBAC: `kubectl get clusterrolebinding prometheus`
|
|
4. Check kube-state-metrics: `kubectl get pods -n observability -l app=kube-state-metrics`
|
|
|
|
### Grafana Can't Connect to Datasources
|
|
1. Test from Grafana pod:
|
|
```bash
|
|
kubectl exec -it grafana-0 -n observability -- wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy
|
|
```
|
|
2. Check datasource configuration in Grafana UI
|
|
3. Verify services exist: `kubectl get svc -n observability`
|
|
|
|
### High Resource Usage
|
|
1. Check actual usage: `kubectl top pods -n observability`
|
|
2. Check node capacity: `kubectl top nodes`
|
|
3. Consider reducing retention periods
|
|
4. Review and adjust resource limits
|
|
|
|
## Rollback Procedure
|
|
|
|
If something goes wrong:
|
|
|
|
1. Remove the deployment:
|
|
```bash
|
|
./cleanup.sh
|
|
```
|
|
|
|
2. Fix the issue in configuration files
|
|
|
|
3. Redeploy:
|
|
```bash
|
|
./deploy.sh
|
|
```
|
|
|
|
## Success Criteria
|
|
|
|
All checked items below indicate successful deployment:
|
|
|
|
- [x] All pods are in Running state
|
|
- [x] All PVCs are Bound
|
|
- [x] Grafana is accessible at https://grafana.betelgeusebytes.io
|
|
- [x] All three datasources (Prometheus, Loki, Tempo) test successfully
|
|
- [x] Prometheus shows targets as "up"
|
|
- [x] Loki shows logs from observability namespace
|
|
- [x] TLS certificate is valid and auto-renewing
|
|
- [x] Admin password has been changed
|
|
- [x] Resource usage is within acceptable limits
|
|
|
|
## Documentation References
|
|
|
|
- **README.md**: Comprehensive documentation
|
|
- **QUICKREF.md**: Quick reference for common operations
|
|
- **demo-app.yaml**: Example instrumented application
|
|
- **deploy.sh**: Automated deployment script
|
|
- **cleanup.sh**: Removal script
|
|
- **status.sh**: Status checking script
|
|
|
|
## Next Steps After Deployment
|
|
|
|
1. Import useful dashboards from Grafana.com
|
|
2. Configure alerts (requires Alertmanager - not included)
|
|
3. Instrument your applications to send logs/metrics/traces
|
|
4. Create custom dashboards for your specific needs
|
|
5. Set up backup procedures for Grafana dashboards
|
|
6. Document your team's observability practices
|
|
|
|
## Notes
|
|
|
|
- Default retention: 7 days for all components
|
|
- Default resources are optimized for single-node cluster
|
|
- Scale up resources if monitoring high-traffic applications
|
|
- Always backup before making configuration changes
|
|
- Test changes in a non-production environment first
|
|
|
|
---
|
|
|
|
**Deployment Date**: _______________
|
|
**Deployed By**: _______________
|
|
**Grafana Version**: 11.4.0
|
|
**Stack Version**: January 2025
|