399 lines
9.0 KiB
Markdown
399 lines
9.0 KiB
Markdown
# Observability Stack Quick Reference
|
|
|
|
## Before You Start
|
|
|
|
### Remove Old Monitoring Stack
|
|
If you have existing monitoring components, remove them first:
|
|
```bash
|
|
./remove-old-monitoring.sh
|
|
```
|
|
|
|
This will safely remove:
|
|
- Prometheus, Grafana, Loki, Tempo deployments
|
|
- Fluent Bit, Vector, or other log collectors
|
|
- Helm releases
|
|
- ConfigMaps, PVCs, RBAC resources
|
|
- Prometheus Operator CRDs
|
|
|
|
## Quick Access
|
|
|
|
- **Grafana UI**: https://grafana.betelgeusebytes.io
|
|
- **Default Login**: admin / admin (change immediately!)
|
|
|
|
## Essential Commands
|
|
|
|
### Check Status
|
|
```bash
|
|
# Quick status check
|
|
./status.sh
|
|
|
|
# View all pods
|
|
kubectl get pods -n observability -o wide
|
|
|
|
# Check specific component
|
|
kubectl get pods -n observability -l app=prometheus
|
|
kubectl get pods -n observability -l app=loki
|
|
kubectl get pods -n observability -l app=tempo
|
|
kubectl get pods -n observability -l app=grafana
|
|
|
|
# Check storage
|
|
kubectl get pv
|
|
kubectl get pvc -n observability
|
|
```
|
|
|
|
### View Logs
|
|
```bash
|
|
# Grafana
|
|
kubectl logs -n observability -l app=grafana -f
|
|
|
|
# Prometheus
|
|
kubectl logs -n observability -l app=prometheus -f
|
|
|
|
# Loki
|
|
kubectl logs -n observability -l app=loki -f
|
|
|
|
# Tempo
|
|
kubectl logs -n observability -l app=tempo -f
|
|
|
|
# Alloy (log collector)
|
|
kubectl logs -n observability -l app=alloy -f
|
|
```
|
|
|
|
### Restart Components
|
|
```bash
|
|
# Restart Prometheus
|
|
kubectl rollout restart statefulset/prometheus -n observability
|
|
|
|
# Restart Loki
|
|
kubectl rollout restart statefulset/loki -n observability
|
|
|
|
# Restart Tempo
|
|
kubectl rollout restart statefulset/tempo -n observability
|
|
|
|
# Restart Grafana
|
|
kubectl rollout restart statefulset/grafana -n observability
|
|
|
|
# Restart Alloy
|
|
kubectl rollout restart daemonset/alloy -n observability
|
|
```
|
|
|
|
### Update Configurations
|
|
```bash
|
|
# Edit Prometheus config
|
|
kubectl edit configmap prometheus-config -n observability
|
|
kubectl rollout restart statefulset/prometheus -n observability
|
|
|
|
# Edit Loki config
|
|
kubectl edit configmap loki-config -n observability
|
|
kubectl rollout restart statefulset/loki -n observability
|
|
|
|
# Edit Tempo config
|
|
kubectl edit configmap tempo-config -n observability
|
|
kubectl rollout restart statefulset/tempo -n observability
|
|
|
|
# Edit Alloy config
|
|
kubectl edit configmap alloy-config -n observability
|
|
kubectl rollout restart daemonset/alloy -n observability
|
|
|
|
# Edit Grafana datasources
|
|
kubectl edit configmap grafana-datasources -n observability
|
|
kubectl rollout restart statefulset/grafana -n observability
|
|
```
|
|
|
|
## Common LogQL Queries (Loki)
|
|
|
|
### Basic Queries
|
|
```logql
|
|
# All logs from observability namespace
|
|
{namespace="observability"}
|
|
|
|
# Logs from specific app
|
|
{namespace="observability", app="prometheus"}
|
|
|
|
# Filter by log level
|
|
{namespace="default"} |= "error"
|
|
{namespace="default"} | json | level="error"
|
|
|
|
# Exclude certain logs
|
|
{namespace="default"} != "health check"
|
|
|
|
# Multiple filters
|
|
{namespace="default"} |= "error" != "ignore"
|
|
```
|
|
|
|
### Advanced Queries
|
|
```logql
|
|
# Rate of errors
|
|
rate({namespace="default"} |= "error" [5m])
|
|
|
|
# Count logs by level
|
|
sum by (level) (count_over_time({namespace="default"} | json [5m]))
|
|
|
|
# Top 10 error messages
|
|
topk(10, count by (message) (
|
|
{namespace="default"} | json | level="error"
|
|
))
|
|
```
|
|
|
|
## Common PromQL Queries (Prometheus)
|
|
|
|
### Cluster Health
|
|
```promql
|
|
# All targets up/down
|
|
up
|
|
|
|
# Pods by phase
|
|
kube_pod_status_phase{namespace="observability"}
|
|
|
|
# Node memory available
|
|
node_memory_MemAvailable_bytes
|
|
|
|
# Node CPU usage
|
|
rate(node_cpu_seconds_total{mode="user"}[5m])
|
|
```
|
|
|
|
### Container Metrics
|
|
```promql
|
|
# CPU usage by container
|
|
rate(container_cpu_usage_seconds_total[5m])
|
|
|
|
# Memory usage by container
|
|
container_memory_usage_bytes
|
|
|
|
# Network traffic
|
|
rate(container_network_transmit_bytes_total[5m])
|
|
rate(container_network_receive_bytes_total[5m])
|
|
```
|
|
|
|
### Application Metrics
|
|
```promql
|
|
# HTTP request rate
|
|
rate(http_requests_total[5m])
|
|
|
|
# Request duration
|
|
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
|
|
|
# Error rate
|
|
rate(http_requests_total{status=~"5.."}[5m])
|
|
```
|
|
|
|
## Trace Search (Tempo)
|
|
|
|
In Grafana Explore with Tempo datasource:
|
|
|
|
- **Search by service**: Select from dropdown
|
|
- **Search by duration**: "> 1s", "< 100ms"
|
|
- **Search by tag**: `http.status_code=500`
|
|
- **TraceQL**: `{span.http.method="POST" && span.http.status_code>=400}`
|
|
|
|
## Correlations
|
|
|
|
### From Logs to Traces
|
|
1. View logs in Loki
|
|
2. Click on a log line with a trace ID
|
|
3. Click the "Tempo" link
|
|
4. Trace opens in Tempo
|
|
|
|
### From Traces to Logs
|
|
1. View trace in Tempo
|
|
2. Click on a span
|
|
3. Click "Logs for this span"
|
|
4. Related logs appear
|
|
|
|
### From Traces to Metrics
|
|
1. View trace in Tempo
|
|
2. Service graph shows metrics
|
|
3. Click service to see metrics
|
|
|
|
## Demo Application
|
|
|
|
Deploy the demo app to test the stack:
|
|
|
|
```bash
|
|
kubectl apply -f demo-app.yaml
|
|
|
|
# Wait for it to start
|
|
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s
|
|
|
|
# Test it
|
|
kubectl port-forward -n observability svc/demo-app 8080:8080
|
|
|
|
# In another terminal
|
|
curl http://localhost:8080/
|
|
curl http://localhost:8080/items
|
|
curl http://localhost:8080/item/0
|
|
curl http://localhost:8080/slow
|
|
curl http://localhost:8080/error
|
|
```
|
|
|
|
Now view in Grafana:
|
|
- **Logs**: Search `{app="demo-app"}` in Loki
|
|
- **Traces**: Search "demo-app" service in Tempo
|
|
- **Metrics**: Query `flask_http_request_total` in Prometheus
|
|
|
|
## Storage Management
|
|
|
|
### Check Disk Usage
|
|
```bash
|
|
# On hetzner-2 node
|
|
df -h /mnt/local-ssd/
|
|
|
|
# Detailed usage
|
|
du -sh /mnt/local-ssd/*
|
|
```
|
|
|
|
### Cleanup Old Data
|
|
Data is automatically deleted after 7 days. To manually adjust retention:
|
|
|
|
**Prometheus** (in 03-prometheus-config.yaml):
|
|
```yaml
|
|
args:
|
|
- '--storage.tsdb.retention.time=7d'
|
|
```
|
|
|
|
**Loki** (in 04-loki-config.yaml):
|
|
```yaml
|
|
limits_config:
|
|
retention_period: 168h # 7 days
|
|
```
|
|
|
|
**Tempo** (in 05-tempo-config.yaml):
|
|
```yaml
|
|
compactor:
|
|
compaction:
|
|
block_retention: 168h # 7 days
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### No Logs Appearing
|
|
```bash
|
|
# Check Alloy is running
|
|
kubectl get pods -n observability -l app=alloy
|
|
|
|
# Check Alloy logs
|
|
kubectl logs -n observability -l app=alloy
|
|
|
|
# Check Loki
|
|
kubectl logs -n observability -l app=loki
|
|
|
|
# Test Loki endpoint
|
|
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
|
|
curl http://loki.observability.svc.cluster.local:3100/ready
|
|
```
|
|
|
|
### No Traces Appearing
|
|
```bash
|
|
# Check Tempo is running
|
|
kubectl get pods -n observability -l app=tempo
|
|
|
|
# Check Tempo logs
|
|
kubectl logs -n observability -l app=tempo
|
|
|
|
# Test Tempo endpoint
|
|
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
|
|
curl http://tempo.observability.svc.cluster.local:3200/ready
|
|
|
|
# Verify your app sends to correct endpoint
|
|
# Should be: tempo.observability.svc.cluster.local:4317 (gRPC)
|
|
# or: tempo.observability.svc.cluster.local:4318 (HTTP)
|
|
```
|
|
|
|
### Grafana Can't Connect to Datasources
|
|
```bash
|
|
# Check all services are running
|
|
kubectl get svc -n observability
|
|
|
|
# Test from Grafana pod
|
|
kubectl exec -it -n observability grafana-0 -- \
|
|
wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy
|
|
|
|
kubectl exec -it -n observability grafana-0 -- \
|
|
wget -O- http://loki.observability.svc.cluster.local:3100/ready
|
|
|
|
kubectl exec -it -n observability grafana-0 -- \
|
|
wget -O- http://tempo.observability.svc.cluster.local:3200/ready
|
|
```
|
|
|
|
### High Resource Usage
|
|
```bash
|
|
# Check resource usage
|
|
kubectl top pods -n observability
|
|
kubectl top nodes
|
|
|
|
# Scale down if needed (for testing)
|
|
kubectl scale statefulset/prometheus -n observability --replicas=0
|
|
kubectl scale statefulset/loki -n observability --replicas=0
|
|
```
|
|
|
|
## Backup and Restore
|
|
|
|
### Backup Grafana Dashboards
|
|
```bash
|
|
# Export all dashboards via API
|
|
kubectl port-forward -n observability svc/grafana 3000:3000
|
|
|
|
# In another terminal
|
|
curl -H "Authorization: Bearer <API_KEY>" \
|
|
http://localhost:3000/api/search?type=dash-db | jq
|
|
```
|
|
|
|
### Backup Configurations
|
|
```bash
|
|
# Backup all ConfigMaps
|
|
kubectl get configmap -n observability -o yaml > configmaps-backup.yaml
|
|
|
|
# Backup specific config
|
|
kubectl get configmap prometheus-config -n observability -o yaml > prometheus-config-backup.yaml
|
|
```
|
|
|
|
## Useful Dashboards in Grafana
|
|
|
|
After login, import these dashboard IDs:
|
|
|
|
- **315**: Kubernetes cluster monitoring
|
|
- **7249**: Kubernetes cluster
|
|
- **13639**: Loki dashboard
|
|
- **12611**: Tempo dashboard
|
|
- **3662**: Prometheus 2.0 stats
|
|
- **1860**: Node Exporter Full
|
|
|
|
Go to: Dashboards → Import → Enter ID → Load
|
|
|
|
## Performance Tuning
|
|
|
|
### For Higher Load
|
|
Increase resources in respective YAML files:
|
|
|
|
```yaml
|
|
resources:
|
|
requests:
|
|
cpu: 1000m # from 500m
|
|
memory: 4Gi # from 2Gi
|
|
limits:
|
|
cpu: 4000m # from 2000m
|
|
memory: 8Gi # from 4Gi
|
|
```
|
|
|
|
### For Lower Resource Usage
|
|
- Reduce scrape intervals in Prometheus config
|
|
- Reduce log retention periods
|
|
- Reduce trace sampling rate
|
|
|
|
## Security Checklist
|
|
|
|
- [ ] Change Grafana admin password
|
|
- [ ] Review RBAC permissions
|
|
- [ ] Enable audit logging
|
|
- [ ] Consider adding NetworkPolicies
|
|
- [ ] Review ingress TLS configuration
|
|
- [ ] Backup configurations regularly
|
|
|
|
## Getting Help
|
|
|
|
1. Check component logs first
|
|
2. Review configurations
|
|
3. Test network connectivity
|
|
4. Check resource availability
|
|
5. Review Grafana datasource settings
|