betelgeusebytes/k8s/observability-stack/QUICKREF.md

399 lines
9.0 KiB
Markdown

# Observability Stack Quick Reference
## Before You Start
### Remove Old Monitoring Stack
If you have existing monitoring components, remove them first:
```bash
./remove-old-monitoring.sh
```
This will safely remove:
- Prometheus, Grafana, Loki, Tempo deployments
- Fluent Bit, Vector, or other log collectors
- Helm releases
- ConfigMaps, PVCs, RBAC resources
- Prometheus Operator CRDs
## Quick Access
- **Grafana UI**: https://grafana.betelgeusebytes.io
- **Default Login**: admin / admin (change immediately!)
## Essential Commands
### Check Status
```bash
# Quick status check
./status.sh
# View all pods
kubectl get pods -n observability -o wide
# Check specific component
kubectl get pods -n observability -l app=prometheus
kubectl get pods -n observability -l app=loki
kubectl get pods -n observability -l app=tempo
kubectl get pods -n observability -l app=grafana
# Check storage
kubectl get pv
kubectl get pvc -n observability
```
### View Logs
```bash
# Grafana
kubectl logs -n observability -l app=grafana -f
# Prometheus
kubectl logs -n observability -l app=prometheus -f
# Loki
kubectl logs -n observability -l app=loki -f
# Tempo
kubectl logs -n observability -l app=tempo -f
# Alloy (log collector)
kubectl logs -n observability -l app=alloy -f
```
### Restart Components
```bash
# Restart Prometheus
kubectl rollout restart statefulset/prometheus -n observability
# Restart Loki
kubectl rollout restart statefulset/loki -n observability
# Restart Tempo
kubectl rollout restart statefulset/tempo -n observability
# Restart Grafana
kubectl rollout restart statefulset/grafana -n observability
# Restart Alloy
kubectl rollout restart daemonset/alloy -n observability
```
### Update Configurations
```bash
# Edit Prometheus config
kubectl edit configmap prometheus-config -n observability
kubectl rollout restart statefulset/prometheus -n observability
# Edit Loki config
kubectl edit configmap loki-config -n observability
kubectl rollout restart statefulset/loki -n observability
# Edit Tempo config
kubectl edit configmap tempo-config -n observability
kubectl rollout restart statefulset/tempo -n observability
# Edit Alloy config
kubectl edit configmap alloy-config -n observability
kubectl rollout restart daemonset/alloy -n observability
# Edit Grafana datasources
kubectl edit configmap grafana-datasources -n observability
kubectl rollout restart statefulset/grafana -n observability
```
## Common LogQL Queries (Loki)
### Basic Queries
```logql
# All logs from observability namespace
{namespace="observability"}
# Logs from specific app
{namespace="observability", app="prometheus"}
# Filter by log level
{namespace="default"} |= "error"
{namespace="default"} | json | level="error"
# Exclude certain logs
{namespace="default"} != "health check"
# Multiple filters
{namespace="default"} |= "error" != "ignore"
```
### Advanced Queries
```logql
# Rate of errors
rate({namespace="default"} |= "error" [5m])
# Count logs by level
sum by (level) (count_over_time({namespace="default"} | json [5m]))
# Top 10 error messages
topk(10, count by (message) (
{namespace="default"} | json | level="error"
))
```
## Common PromQL Queries (Prometheus)
### Cluster Health
```promql
# All targets up/down
up
# Pods by phase
kube_pod_status_phase{namespace="observability"}
# Node memory available
node_memory_MemAvailable_bytes
# Node CPU usage
rate(node_cpu_seconds_total{mode="user"}[5m])
```
### Container Metrics
```promql
# CPU usage by container
rate(container_cpu_usage_seconds_total[5m])
# Memory usage by container
container_memory_usage_bytes
# Network traffic
rate(container_network_transmit_bytes_total[5m])
rate(container_network_receive_bytes_total[5m])
```
### Application Metrics
```promql
# HTTP request rate
rate(http_requests_total[5m])
# Request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
```
## Trace Search (Tempo)
In Grafana Explore with Tempo datasource:
- **Search by service**: Select from dropdown
- **Search by duration**: "> 1s", "< 100ms"
- **Search by tag**: `http.status_code=500`
- **TraceQL**: `{span.http.method="POST" && span.http.status_code>=400}`
## Correlations
### From Logs to Traces
1. View logs in Loki
2. Click on a log line with a trace ID
3. Click the "Tempo" link
4. Trace opens in Tempo
### From Traces to Logs
1. View trace in Tempo
2. Click on a span
3. Click "Logs for this span"
4. Related logs appear
### From Traces to Metrics
1. View trace in Tempo
2. Service graph shows metrics
3. Click service to see metrics
## Demo Application
Deploy the demo app to test the stack:
```bash
kubectl apply -f demo-app.yaml
# Wait for it to start
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s
# Test it
kubectl port-forward -n observability svc/demo-app 8080:8080
# In another terminal
curl http://localhost:8080/
curl http://localhost:8080/items
curl http://localhost:8080/item/0
curl http://localhost:8080/slow
curl http://localhost:8080/error
```
Now view in Grafana:
- **Logs**: Search `{app="demo-app"}` in Loki
- **Traces**: Search "demo-app" service in Tempo
- **Metrics**: Query `flask_http_request_total` in Prometheus
## Storage Management
### Check Disk Usage
```bash
# On hetzner-2 node
df -h /mnt/local-ssd/
# Detailed usage
du -sh /mnt/local-ssd/*
```
### Cleanup Old Data
Data is automatically deleted after 7 days. To manually adjust retention:
**Prometheus** (in 03-prometheus-config.yaml):
```yaml
args:
- '--storage.tsdb.retention.time=7d'
```
**Loki** (in 04-loki-config.yaml):
```yaml
limits_config:
retention_period: 168h # 7 days
```
**Tempo** (in 05-tempo-config.yaml):
```yaml
compactor:
compaction:
block_retention: 168h # 7 days
```
## Troubleshooting
### No Logs Appearing
```bash
# Check Alloy is running
kubectl get pods -n observability -l app=alloy
# Check Alloy logs
kubectl logs -n observability -l app=alloy
# Check Loki
kubectl logs -n observability -l app=loki
# Test Loki endpoint
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://loki.observability.svc.cluster.local:3100/ready
```
### No Traces Appearing
```bash
# Check Tempo is running
kubectl get pods -n observability -l app=tempo
# Check Tempo logs
kubectl logs -n observability -l app=tempo
# Test Tempo endpoint
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://tempo.observability.svc.cluster.local:3200/ready
# Verify your app sends to correct endpoint
# Should be: tempo.observability.svc.cluster.local:4317 (gRPC)
# or: tempo.observability.svc.cluster.local:4318 (HTTP)
```
### Grafana Can't Connect to Datasources
```bash
# Check all services are running
kubectl get svc -n observability
# Test from Grafana pod
kubectl exec -it -n observability grafana-0 -- \
wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy
kubectl exec -it -n observability grafana-0 -- \
wget -O- http://loki.observability.svc.cluster.local:3100/ready
kubectl exec -it -n observability grafana-0 -- \
wget -O- http://tempo.observability.svc.cluster.local:3200/ready
```
### High Resource Usage
```bash
# Check resource usage
kubectl top pods -n observability
kubectl top nodes
# Scale down if needed (for testing)
kubectl scale statefulset/prometheus -n observability --replicas=0
kubectl scale statefulset/loki -n observability --replicas=0
```
## Backup and Restore
### Backup Grafana Dashboards
```bash
# Export all dashboards via API
kubectl port-forward -n observability svc/grafana 3000:3000
# In another terminal
curl -H "Authorization: Bearer <API_KEY>" \
http://localhost:3000/api/search?type=dash-db | jq
```
### Backup Configurations
```bash
# Backup all ConfigMaps
kubectl get configmap -n observability -o yaml > configmaps-backup.yaml
# Backup specific config
kubectl get configmap prometheus-config -n observability -o yaml > prometheus-config-backup.yaml
```
## Useful Dashboards in Grafana
After login, import these dashboard IDs:
- **315**: Kubernetes cluster monitoring
- **7249**: Kubernetes cluster
- **13639**: Loki dashboard
- **12611**: Tempo dashboard
- **3662**: Prometheus 2.0 stats
- **1860**: Node Exporter Full
Go to: Dashboards Import Enter ID Load
## Performance Tuning
### For Higher Load
Increase resources in respective YAML files:
```yaml
resources:
requests:
cpu: 1000m # from 500m
memory: 4Gi # from 2Gi
limits:
cpu: 4000m # from 2000m
memory: 8Gi # from 4Gi
```
### For Lower Resource Usage
- Reduce scrape intervals in Prometheus config
- Reduce log retention periods
- Reduce trace sampling rate
## Security Checklist
- [ ] Change Grafana admin password
- [ ] Review RBAC permissions
- [ ] Enable audit logging
- [ ] Consider adding NetworkPolicies
- [ ] Review ingress TLS configuration
- [ ] Backup configurations regularly
## Getting Help
1. Check component logs first
2. Review configurations
3. Test network connectivity
4. Check resource availability
5. Review Grafana datasource settings