386 lines
11 KiB
Markdown
386 lines
11 KiB
Markdown
# State-of-the-Art Observability Stack for Kubernetes
|
|
|
|
This deployment provides a comprehensive, production-ready observability solution using the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) with unified collection through Grafana Alloy.
|
|
|
|
## Architecture Overview
|
|
|
|
### Core Components
|
|
|
|
1. **Grafana** (v11.4.0) - Unified visualization platform
|
|
- Pre-configured datasources for Prometheus, Loki, and Tempo
|
|
- Automatic correlation between logs, metrics, and traces
|
|
- Modern UI with TraceQL editor support
|
|
|
|
2. **Prometheus** (v2.54.1) - Metrics collection and storage
|
|
- 7-day retention
|
|
- Comprehensive Kubernetes service discovery
|
|
- Scrapes: API server, nodes, cadvisor, pods, services
|
|
|
|
3. **Grafana Loki** (v3.2.1) - Log aggregation
|
|
- 7-day retention with compaction
|
|
- TSDB index for efficient queries
|
|
- Automatic correlation with traces
|
|
|
|
4. **Grafana Tempo** (v2.6.1) - Distributed tracing
|
|
- 7-day retention
|
|
- Multiple protocol support: OTLP, Jaeger, Zipkin
|
|
- Metrics generation from traces
|
|
- Automatic correlation with logs and metrics
|
|
|
|
5. **Grafana Alloy** (v1.5.1) - Unified observability agent
|
|
- Replaces Promtail, Vector, Fluent Bit
|
|
- Collects logs from all pods
|
|
- OTLP receiver for traces
|
|
- Runs as DaemonSet on all nodes
|
|
|
|
6. **kube-state-metrics** (v2.13.0) - Kubernetes object metrics
|
|
- Deployment, Pod, Service, Node metrics
|
|
- Essential for cluster monitoring
|
|
|
|
7. **node-exporter** (v1.8.2) - Node-level system metrics
|
|
- CPU, memory, disk, network metrics
|
|
- Runs on all nodes via DaemonSet
|
|
|
|
## Key Features
|
|
|
|
- **Unified Observability**: Logs, metrics, and traces in one platform
|
|
- **Automatic Correlation**: Click from logs to traces to metrics seamlessly
|
|
- **7-Day Retention**: Optimized for single-node cluster
|
|
- **Local SSD Storage**: Fast, persistent storage on hetzner-2 node
|
|
- **OTLP Support**: Modern OpenTelemetry protocol support
|
|
- **TLS Enabled**: Secure access via NGINX Ingress with Let's Encrypt
|
|
- **Low Resource Footprint**: Optimized for single-node deployment
|
|
|
|
## Storage Layout
|
|
|
|
All data stored on local SSD at `/mnt/local-ssd/`:
|
|
|
|
```
|
|
/mnt/local-ssd/
|
|
├── prometheus/ (50Gi) - Metrics data
|
|
├── loki/ (100Gi) - Log data
|
|
├── tempo/ (50Gi) - Trace data
|
|
└── grafana/ (10Gi) - Dashboards and settings
|
|
```
|
|
|
|
## Deployment Instructions
|
|
|
|
### Prerequisites
|
|
|
|
1. Kubernetes cluster with NGINX Ingress Controller
|
|
2. cert-manager installed with Let's Encrypt issuer
|
|
3. DNS record: `grafana.betelgeusebytes.io` → your cluster IP
|
|
4. Node labeled: `kubernetes.io/hostname=hetzner-2`
|
|
|
|
### Step 0: Remove Existing Monitoring (If Applicable)
|
|
|
|
If you have an existing monitoring stack (Prometheus, Grafana, Loki, Fluent Bit, etc.), remove it first to avoid conflicts:
|
|
|
|
```bash
|
|
./remove-old-monitoring.sh
|
|
```
|
|
|
|
This interactive script will help you safely remove:
|
|
- Existing Prometheus/Grafana/Loki/Tempo deployments
|
|
- Helm releases for monitoring components
|
|
- Fluent Bit, Vector, or other log collectors
|
|
- Related ConfigMaps, PVCs, and RBAC resources
|
|
- Prometheus Operator CRDs (if applicable)
|
|
|
|
**Note**: The main deployment script (`deploy.sh`) will also prompt you to run cleanup if needed.
|
|
|
|
### Step 1: Prepare Storage Directories
|
|
|
|
SSH into the hetzner-2 node and create directories:
|
|
|
|
```bash
|
|
sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
|
|
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
|
|
sudo chown -R 10001:10001 /mnt/local-ssd/loki
|
|
sudo chown -R root:root /mnt/local-ssd/tempo
|
|
sudo chown -R 472:472 /mnt/local-ssd/grafana
|
|
```
|
|
|
|
### Step 2: Deploy the Stack
|
|
|
|
```bash
|
|
chmod +x deploy.sh
|
|
./deploy.sh
|
|
```
|
|
|
|
Or deploy manually:
|
|
|
|
```bash
|
|
kubectl apply -f 00-namespace.yaml
|
|
kubectl apply -f 01-persistent-volumes.yaml
|
|
kubectl apply -f 02-persistent-volume-claims.yaml
|
|
kubectl apply -f 03-prometheus-config.yaml
|
|
kubectl apply -f 04-loki-config.yaml
|
|
kubectl apply -f 05-tempo-config.yaml
|
|
kubectl apply -f 06-alloy-config.yaml
|
|
kubectl apply -f 07-grafana-datasources.yaml
|
|
kubectl apply -f 08-rbac.yaml
|
|
kubectl apply -f 10-prometheus.yaml
|
|
kubectl apply -f 11-loki.yaml
|
|
kubectl apply -f 12-tempo.yaml
|
|
kubectl apply -f 13-grafana.yaml
|
|
kubectl apply -f 14-alloy.yaml
|
|
kubectl apply -f 15-kube-state-metrics.yaml
|
|
kubectl apply -f 16-node-exporter.yaml
|
|
kubectl apply -f 20-grafana-ingress.yaml
|
|
```
|
|
|
|
### Step 3: Verify Deployment
|
|
|
|
```bash
|
|
kubectl get pods -n observability
|
|
kubectl get pv
|
|
kubectl get pvc -n observability
|
|
```
|
|
|
|
All pods should be in `Running` state:
|
|
- grafana-0
|
|
- loki-0
|
|
- prometheus-0
|
|
- tempo-0
|
|
- alloy-xxxxx (one per node)
|
|
- kube-state-metrics-xxxxx
|
|
- node-exporter-xxxxx (one per node)
|
|
|
|
### Step 4: Access Grafana
|
|
|
|
1. Open: https://grafana.betelgeusebytes.io
|
|
2. Login with default credentials:
|
|
- Username: `admin`
|
|
- Password: `admin`
|
|
3. **IMPORTANT**: Change the password on first login!
|
|
|
|
## Using the Stack
|
|
|
|
### Exploring Logs (Loki)
|
|
|
|
1. In Grafana, go to **Explore**
|
|
2. Select **Loki** datasource
|
|
3. Example queries:
|
|
```
|
|
{namespace="observability"}
|
|
{namespace="observability", app="prometheus"}
|
|
{namespace="default"} |= "error"
|
|
{pod="my-app-xxx"} | json | level="error"
|
|
```
|
|
|
|
### Exploring Metrics (Prometheus)
|
|
|
|
1. In Grafana, go to **Explore**
|
|
2. Select **Prometheus** datasource
|
|
3. Example queries:
|
|
```
|
|
up
|
|
node_memory_MemAvailable_bytes
|
|
rate(container_cpu_usage_seconds_total[5m])
|
|
kube_pod_status_phase{namespace="observability"}
|
|
```
|
|
|
|
### Exploring Traces (Tempo)
|
|
|
|
1. In Grafana, go to **Explore**
|
|
2. Select **Tempo** datasource
|
|
3. Search by:
|
|
- Service name
|
|
- Duration
|
|
- Tags
|
|
4. Click on a trace to see detailed span timeline
|
|
|
|
### Correlations
|
|
|
|
The stack automatically correlates:
|
|
- **Logs → Traces**: Click traceID in logs to view trace
|
|
- **Traces → Logs**: Click on trace to see related logs
|
|
- **Traces → Metrics**: Tempo generates metrics from traces
|
|
|
|
### Instrumenting Your Applications
|
|
|
|
#### For Logs
|
|
Logs are automatically collected from all pods by Alloy. Emit structured JSON logs:
|
|
|
|
```json
|
|
{"level":"info","message":"Request processed","duration_ms":42}
|
|
```
|
|
|
|
#### For Traces
|
|
Send traces to Tempo using OTLP:
|
|
|
|
```python
|
|
# Python with OpenTelemetry
|
|
from opentelemetry import trace
|
|
from opentelemetry.sdk.trace import TracerProvider
|
|
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
|
|
|
provider = TracerProvider()
|
|
provider.add_span_processor(
|
|
BatchSpanProcessor(
|
|
OTLPSpanExporter(endpoint="http://tempo.observability.svc.cluster.local:4317")
|
|
)
|
|
)
|
|
trace.set_tracer_provider(provider)
|
|
```
|
|
|
|
#### For Metrics
|
|
Expose metrics in Prometheus format and add annotations to your pod:
|
|
|
|
```yaml
|
|
metadata:
|
|
annotations:
|
|
prometheus.io/scrape: "true"
|
|
prometheus.io/port: "8080"
|
|
prometheus.io/path: "/metrics"
|
|
```
|
|
|
|
## Monitoring Endpoints
|
|
|
|
Internal service endpoints:
|
|
|
|
- **Prometheus**: `http://prometheus.observability.svc.cluster.local:9090`
|
|
- **Loki**: `http://loki.observability.svc.cluster.local:3100`
|
|
- **Tempo**:
|
|
- HTTP: `http://tempo.observability.svc.cluster.local:3200`
|
|
- OTLP gRPC: `tempo.observability.svc.cluster.local:4317`
|
|
- OTLP HTTP: `tempo.observability.svc.cluster.local:4318`
|
|
- **Grafana**: `http://grafana.observability.svc.cluster.local:3000`
|
|
|
|
## Troubleshooting
|
|
|
|
### Check Pod Status
|
|
```bash
|
|
kubectl get pods -n observability
|
|
kubectl describe pod <pod-name> -n observability
|
|
```
|
|
|
|
### View Logs
|
|
```bash
|
|
kubectl logs -n observability -l app=grafana
|
|
kubectl logs -n observability -l app=prometheus
|
|
kubectl logs -n observability -l app=loki
|
|
kubectl logs -n observability -l app=tempo
|
|
kubectl logs -n observability -l app=alloy
|
|
```
|
|
|
|
### Check Storage
|
|
```bash
|
|
kubectl get pv
|
|
kubectl get pvc -n observability
|
|
```
|
|
|
|
### Test Connectivity
|
|
```bash
|
|
# From inside cluster
|
|
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
|
|
curl http://prometheus.observability.svc.cluster.local:9090/-/healthy
|
|
```
|
|
|
|
### Common Issues
|
|
|
|
**Pods stuck in Pending**
|
|
- Check if storage directories exist on hetzner-2
|
|
- Verify PV/PVC bindings: `kubectl describe pvc -n observability`
|
|
|
|
**Loki won't start**
|
|
- Check permissions on `/mnt/local-ssd/loki` (should be 10001:10001)
|
|
- View logs: `kubectl logs -n observability loki-0`
|
|
|
|
**No logs appearing**
|
|
- Check Alloy pods are running: `kubectl get pods -n observability -l app=alloy`
|
|
- View Alloy logs: `kubectl logs -n observability -l app=alloy`
|
|
|
|
**Grafana can't reach datasources**
|
|
- Verify services: `kubectl get svc -n observability`
|
|
- Check datasource URLs in Grafana UI
|
|
|
|
## Updating Configuration
|
|
|
|
### Update Prometheus Scrape Config
|
|
```bash
|
|
kubectl edit configmap prometheus-config -n observability
|
|
kubectl rollout restart statefulset/prometheus -n observability
|
|
```
|
|
|
|
### Update Loki Retention
|
|
```bash
|
|
kubectl edit configmap loki-config -n observability
|
|
kubectl rollout restart statefulset/loki -n observability
|
|
```
|
|
|
|
### Update Alloy Collection Rules
|
|
```bash
|
|
kubectl edit configmap alloy-config -n observability
|
|
kubectl rollout restart daemonset/alloy -n observability
|
|
```
|
|
|
|
## Resource Usage
|
|
|
|
Expected resource consumption:
|
|
|
|
| Component | CPU Request | CPU Limit | Memory Request | Memory Limit |
|
|
|-----------|-------------|-----------|----------------|--------------|
|
|
| Prometheus | 500m | 2000m | 2Gi | 4Gi |
|
|
| Loki | 500m | 2000m | 1Gi | 2Gi |
|
|
| Tempo | 500m | 2000m | 1Gi | 2Gi |
|
|
| Grafana | 250m | 1000m | 512Mi | 1Gi |
|
|
| Alloy (per node) | 100m | 500m | 256Mi | 512Mi |
|
|
| kube-state-metrics | 100m | 200m | 128Mi | 256Mi |
|
|
| node-exporter (per node) | 100m | 200m | 128Mi | 256Mi |
|
|
|
|
**Total (single node)**: ~2.1 CPU cores, ~7.5Gi memory
|
|
|
|
## Security Considerations
|
|
|
|
1. **Change default Grafana password** immediately after deployment
|
|
2. Consider adding authentication for internal services if exposed
|
|
3. Review and restrict RBAC permissions as needed
|
|
4. Enable audit logging in Loki for sensitive namespaces
|
|
5. Consider adding NetworkPolicies to restrict traffic
|
|
|
|
## Documentation
|
|
|
|
This deployment includes comprehensive guides:
|
|
|
|
- **README.md**: Complete deployment and configuration guide (this file)
|
|
- **MONITORING-GUIDE.md**: URLs, access, and how to monitor new applications
|
|
- **DEPLOYMENT-CHECKLIST.md**: Step-by-step deployment checklist
|
|
- **QUICKREF.md**: Quick reference for daily operations
|
|
- **demo-app.yaml**: Example fully instrumented application
|
|
- **deploy.sh**: Automated deployment script
|
|
- **status.sh**: Health check script
|
|
- **cleanup.sh**: Complete stack removal
|
|
- **remove-old-monitoring.sh**: Remove existing monitoring before deployment
|
|
- **21-optional-ingresses.yaml**: Optional external access to Prometheus/Loki/Tempo
|
|
|
|
## Future Enhancements
|
|
|
|
- Add Alertmanager for alerting
|
|
- Configure Grafana SMTP for email notifications
|
|
- Add custom dashboards for your applications
|
|
- Implement Grafana RBAC for team access
|
|
- Consider Mimir for long-term metrics storage
|
|
- Add backup/restore procedures
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
1. Check pod logs first
|
|
2. Review Grafana datasource configuration
|
|
3. Verify network connectivity between components
|
|
4. Check storage and resource availability
|
|
|
|
## Version Information
|
|
|
|
- Grafana: 11.4.0
|
|
- Prometheus: 2.54.1
|
|
- Loki: 3.2.1
|
|
- Tempo: 2.6.1
|
|
- Alloy: 1.5.1
|
|
- kube-state-metrics: 2.13.0
|
|
- node-exporter: 1.8.2
|
|
|
|
Last updated: January 2025
|