10 KiB

Raw Permalink Blame History

Observability Stack Deployment Checklist

Use this checklist to ensure a smooth deployment of the observability stack.

Pre-Deployment

Check for Existing Monitoring Stack

Check if you have existing monitoring components:

# Check for monitoring namespaces
kubectl get namespaces | grep -E "(monitoring|prometheus|grafana|loki|tempo)"

# Check for monitoring pods in common namespaces
kubectl get pods -n monitoring 2>/dev/null || true
kubectl get pods -n prometheus 2>/dev/null || true
kubectl get pods -n grafana 2>/dev/null || true
kubectl get pods -A | grep -E "(prometheus|grafana|loki|tempo|fluent-bit|vector)"

# Check for Helm releases
helm list -A | grep -E "(prometheus|grafana|loki|tempo)"

If existing monitoring is found, remove it first:

./remove-old-monitoring.sh

OR run the deployment script which will prompt you:

./deploy.sh  # Will ask if you want to clean up first

Prerequisites

Kubernetes cluster is running
NGINX Ingress Controller is installed
cert-manager is installed with Let's Encrypt ClusterIssuer
DNS record grafana.betelgeusebytes.io points to cluster IP
Node is labeled kubernetes.io/hostname=hetzner-2
kubectl is configured and working

Verify Prerequisites

# Check cluster
kubectl cluster-info

# Check NGINX Ingress
kubectl get pods -n ingress-nginx

# Check cert-manager
kubectl get pods -n cert-manager

# Check node label
kubectl get nodes --show-labels | grep hetzner-2

# Check DNS (from external machine)
dig grafana.betelgeusebytes.io

Deployment Steps

Step 1: Prepare Storage

SSH into hetzner-2 node
Create directories:

sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}

Set correct permissions:

sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
sudo chown -R 10001:10001 /mnt/local-ssd/loki
sudo chown -R root:root /mnt/local-ssd/tempo
sudo chown -R 472:472 /mnt/local-ssd/grafana

Verify permissions:

ls -la /mnt/local-ssd/

Step 2: Review Configuration

Review 03-prometheus-config.yaml - verify scrape targets
Review 04-loki-config.yaml - verify retention (7 days)
Review 05-tempo-config.yaml - verify retention (7 days)
Review 06-alloy-config.yaml - verify endpoints
Review 20-grafana-ingress.yaml - verify domain name

Step 3: Deploy the Stack

Navigate to observability-stack directory

cd /path/to/observability-stack

Make scripts executable (already done):

chmod +x *.sh

Run deployment script:

./deploy.sh

OR deploy manually:

kubectl apply -f 00-namespace.yaml
kubectl apply -f 01-persistent-volumes.yaml
kubectl apply -f 02-persistent-volume-claims.yaml
kubectl apply -f 03-prometheus-config.yaml
kubectl apply -f 04-loki-config.yaml
kubectl apply -f 05-tempo-config.yaml
kubectl apply -f 06-alloy-config.yaml
kubectl apply -f 07-grafana-datasources.yaml
kubectl apply -f 08-rbac.yaml
kubectl apply -f 10-prometheus.yaml
kubectl apply -f 11-loki.yaml
kubectl apply -f 12-tempo.yaml
kubectl apply -f 13-grafana.yaml
kubectl apply -f 14-alloy.yaml
kubectl apply -f 15-kube-state-metrics.yaml
kubectl apply -f 16-node-exporter.yaml
kubectl apply -f 20-grafana-ingress.yaml

Step 4: Verify Deployment

Run status check:

./status.sh

Check all PersistentVolumes are Bound:

kubectl get pv

Check all PersistentVolumeClaims are Bound:

kubectl get pvc -n observability

Check all pods are Running:

kubectl get pods -n observability

Expected pods:

prometheus-0
loki-0
tempo-0
grafana-0
alloy-xxxxx (one per node)
kube-state-metrics-xxxxx
node-exporter-xxxxx (one per node)
Check services are created:

kubectl get svc -n observability

Check ingress is created:

kubectl get ingress -n observability

Verify TLS certificate is issued:

kubectl get certificate -n observability
kubectl describe certificate grafana-tls -n observability

Step 5: Test Connectivity

Test Prometheus endpoint:

kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
  curl http://prometheus.observability.svc.cluster.local:9090/-/healthy

Test Loki endpoint:

kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
  curl http://loki.observability.svc.cluster.local:3100/ready

Test Tempo endpoint:

kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
  curl http://tempo.observability.svc.cluster.local:3200/ready

Test Grafana endpoint:

kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
  curl http://grafana.observability.svc.cluster.local:3000/api/health

Post-Deployment Configuration

Step 6: Access Grafana

Open browser to: https://grafana.betelgeusebytes.io
Login with default credentials:
- Username: admin
- Password: admin
CRITICAL: Change admin password immediately
Verify datasources are configured:
- Go to Configuration → Data Sources
- Should see: Prometheus (default), Loki, Tempo
- Click "Test" on each datasource

Step 7: Verify Data Collection

Check Prometheus has targets:
- In Grafana, Explore → Prometheus
- Query: up
- Should see multiple targets with value=1
Check Loki is receiving logs:
- In Grafana, Explore → Loki
- Query: {namespace="observability"}
- Should see logs from observability stack
Check kube-state-metrics:
- In Grafana, Explore → Prometheus
- Query: kube_pod_status_phase
- Should see pod status metrics

Step 8: Import Dashboards (Optional)

Import Kubernetes cluster dashboard:
- Dashboards → Import → ID: 315
Import Node Exporter dashboard:
- Dashboards → Import → ID: 1860
Import Loki dashboard:
- Dashboards → Import → ID: 13639

Step 9: Test with Demo App (Optional)

Deploy demo application:

kubectl apply -f demo-app.yaml

Wait for pod to be ready:

kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s

Test the endpoints:

kubectl port-forward -n observability svc/demo-app 8080:8080
# In another terminal:
curl http://localhost:8080/
curl http://localhost:8080/items
curl http://localhost:8080/slow
curl http://localhost:8080/error

Verify in Grafana:
- Logs: {app="demo-app"}
- Metrics: flask_http_request_total
- Traces: Search for "demo-app" service in Tempo

Monitoring and Maintenance

Daily Checks

Check pod status: kubectl get pods -n observability
Check resource usage: kubectl top pods -n observability
Check disk usage on hetzner-2: df -h /mnt/local-ssd/

Weekly Checks

Review Grafana for any alerts or anomalies
Verify TLS certificate is valid
Check logs for any errors:

kubectl logs -n observability -l app=prometheus --tail=100
kubectl logs -n observability -l app=loki --tail=100
kubectl logs -n observability -l app=tempo --tail=100
kubectl logs -n observability -l app=grafana --tail=100

Monthly Checks

Review retention policies (7 days is appropriate)
Check storage growth trends
Review and update dashboards
Backup Grafana dashboards and configs

Troubleshooting Guide

Pod Won't Start

Check events: kubectl describe pod <pod-name> -n observability
Check logs: kubectl logs <pod-name> -n observability
Check storage: kubectl get pv and kubectl get pvc -n observability
Verify node has space: SSH to hetzner-2 and run df -h

No Logs Appearing

Check Alloy pods: kubectl get pods -n observability -l app=alloy
Check Alloy logs: kubectl logs -n observability -l app=alloy
Check Loki is running: kubectl get pods -n observability -l app=loki
Test Loki endpoint from Alloy pod

No Metrics Appearing

Check Prometheus targets: Port-forward and visit http://localhost:9090/targets
Check service discovery: Look for "kubernetes-*" targets
Verify RBAC: kubectl get clusterrolebinding prometheus
Check kube-state-metrics: kubectl get pods -n observability -l app=kube-state-metrics

Grafana Can't Connect to Datasources

Test from Grafana pod:

kubectl exec -it grafana-0 -n observability -- wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy

Check datasource configuration in Grafana UI
Verify services exist: kubectl get svc -n observability

High Resource Usage

Check actual usage: kubectl top pods -n observability
Check node capacity: kubectl top nodes
Consider reducing retention periods
Review and adjust resource limits

Rollback Procedure

If something goes wrong:

Remove the deployment:

./cleanup.sh

Fix the issue in configuration files
Redeploy:

./deploy.sh

Success Criteria

All checked items below indicate successful deployment:

All pods are in Running state
All PVCs are Bound
Grafana is accessible at https://grafana.betelgeusebytes.io
All three datasources (Prometheus, Loki, Tempo) test successfully
Prometheus shows targets as "up"
Loki shows logs from observability namespace
TLS certificate is valid and auto-renewing
Admin password has been changed
Resource usage is within acceptable limits

Documentation References

README.md: Comprehensive documentation
QUICKREF.md: Quick reference for common operations
demo-app.yaml: Example instrumented application
deploy.sh: Automated deployment script
cleanup.sh: Removal script
status.sh: Status checking script

Next Steps After Deployment

Import useful dashboards from Grafana.com
Configure alerts (requires Alertmanager - not included)
Instrument your applications to send logs/metrics/traces
Create custom dashboards for your specific needs
Set up backup procedures for Grafana dashboards
Document your team's observability practices

Notes

Default retention: 7 days for all components
Default resources are optimized for single-node cluster
Scale up resources if monitoring high-traffic applications
Always backup before making configuration changes
Test changes in a non-production environment first

Deployment Date: _______________ Deployed By: _______________ Grafana Version: 11.4.0 Stack Version: January 2025

10 KiB Raw Permalink Blame History