betelgeusebytes/k8s/observability-stack/README_old.md

# State-of-the-Art Observability Stack for Kubernetes

This deployment provides a comprehensive, production-ready observability solution using the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) with unified collection through Grafana Alloy.

## Architecture Overview

### Core Components

1. **Grafana** (v11.4.0) - Unified visualization platform
   - Pre-configured datasources for Prometheus, Loki, and Tempo
   - Automatic correlation between logs, metrics, and traces
   - Modern UI with TraceQL editor support

2. **Prometheus** (v2.54.1) - Metrics collection and storage
   - 7-day retention
   - Comprehensive Kubernetes service discovery
   - Scrapes: API server, nodes, cadvisor, pods, services

3. **Grafana Loki** (v3.2.1) - Log aggregation
   - 7-day retention with compaction
   - TSDB index for efficient queries
   - Automatic correlation with traces

4. **Grafana Tempo** (v2.6.1) - Distributed tracing
   - 7-day retention
   - Multiple protocol support: OTLP, Jaeger, Zipkin
   - Metrics generation from traces
   - Automatic correlation with logs and metrics

5. **Grafana Alloy** (v1.5.1) - Unified observability agent
   - Replaces Promtail, Vector, Fluent Bit
   - Collects logs from all pods
   - OTLP receiver for traces
   - Runs as DaemonSet on all nodes

6. **kube-state-metrics** (v2.13.0) - Kubernetes object metrics
   - Deployment, Pod, Service, Node metrics
   - Essential for cluster monitoring

7. **node-exporter** (v1.8.2) - Node-level system metrics
   - CPU, memory, disk, network metrics
   - Runs on all nodes via DaemonSet

## Key Features

- **Unified Observability**: Logs, metrics, and traces in one platform
- **Automatic Correlation**: Click from logs to traces to metrics seamlessly
- **7-Day Retention**: Optimized for single-node cluster
- **Local SSD Storage**: Fast, persistent storage on hetzner-2 node
- **OTLP Support**: Modern OpenTelemetry protocol support
- **TLS Enabled**: Secure access via NGINX Ingress with Let's Encrypt
- **Low Resource Footprint**: Optimized for single-node deployment

## Storage Layout

All data stored on local SSD at `/mnt/local-ssd/`:

```
/mnt/local-ssd/
├── prometheus/    (50Gi)  - Metrics data
├── loki/          (100Gi) - Log data
├── tempo/         (50Gi)  - Trace data
└── grafana/       (10Gi)  - Dashboards and settings
```

## Deployment Instructions

### Prerequisites

1. Kubernetes cluster with NGINX Ingress Controller
2. cert-manager installed with Let's Encrypt issuer
3. DNS record: `grafana.betelgeusebytes.io` → your cluster IP
4. Node labeled: `kubernetes.io/hostname=hetzner-2`

### Step 0: Remove Existing Monitoring (If Applicable)

If you have an existing monitoring stack (Prometheus, Grafana, Loki, Fluent Bit, etc.), remove it first to avoid conflicts:

```bash
./remove-old-monitoring.sh
```

This interactive script will help you safely remove:
- Existing Prometheus/Grafana/Loki/Tempo deployments
- Helm releases for monitoring components
- Fluent Bit, Vector, or other log collectors
- Related ConfigMaps, PVCs, and RBAC resources
- Prometheus Operator CRDs (if applicable)

**Note**: The main deployment script (`deploy.sh`) will also prompt you to run cleanup if needed.

### Step 1: Prepare Storage Directories

SSH into the hetzner-2 node and create directories:

```bash
sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
sudo chown -R 10001:10001 /mnt/local-ssd/loki
sudo chown -R root:root /mnt/local-ssd/tempo
sudo chown -R 472:472 /mnt/local-ssd/grafana
```

### Step 2: Deploy the Stack

```bash
chmod +x deploy.sh
./deploy.sh
```

Or deploy manually:

```bash
kubectl apply -f 00-namespace.yaml
kubectl apply -f 01-persistent-volumes.yaml
kubectl apply -f 02-persistent-volume-claims.yaml
kubectl apply -f 03-prometheus-config.yaml
kubectl apply -f 04-loki-config.yaml
kubectl apply -f 05-tempo-config.yaml
kubectl apply -f 06-alloy-config.yaml
kubectl apply -f 07-grafana-datasources.yaml
kubectl apply -f 08-rbac.yaml
kubectl apply -f 10-prometheus.yaml
kubectl apply -f 11-loki.yaml
kubectl apply -f 12-tempo.yaml
kubectl apply -f 13-grafana.yaml
kubectl apply -f 14-alloy.yaml
kubectl apply -f 15-kube-state-metrics.yaml
kubectl apply -f 16-node-exporter.yaml
kubectl apply -f 20-grafana-ingress.yaml
```

### Step 3: Verify Deployment

```bash
kubectl get pods -n observability
kubectl get pv
kubectl get pvc -n observability
```

All pods should be in `Running` state:
- grafana-0
- loki-0
- prometheus-0
- tempo-0
- alloy-xxxxx (one per node)
- kube-state-metrics-xxxxx
- node-exporter-xxxxx (one per node)

### Step 4: Access Grafana

1. Open: https://grafana.betelgeusebytes.io
2. Login with default credentials:
   - Username: `admin`
   - Password: `admin`
3. **IMPORTANT**: Change the password on first login!

## Using the Stack

### Exploring Logs (Loki)

1. In Grafana, go to **Explore**
2. Select **Loki** datasource
3. Example queries:
   ```
   {namespace="observability"}
   {namespace="observability", app="prometheus"}
   {namespace="default"} |= "error"
   {pod="my-app-xxx"} | json | level="error"
   ```

### Exploring Metrics (Prometheus)

1. In Grafana, go to **Explore**
2. Select **Prometheus** datasource
3. Example queries:
   ```
   up
   node_memory_MemAvailable_bytes
   rate(container_cpu_usage_seconds_total[5m])
   kube_pod_status_phase{namespace="observability"}
   ```

### Exploring Traces (Tempo)

1. In Grafana, go to **Explore**
2. Select **Tempo** datasource
3. Search by:
   - Service name
   - Duration
   - Tags
4. Click on a trace to see detailed span timeline

### Correlations

The stack automatically correlates:
- **Logs → Traces**: Click traceID in logs to view trace
- **Traces → Logs**: Click on trace to see related logs
- **Traces → Metrics**: Tempo generates metrics from traces

### Instrumenting Your Applications

#### For Logs
Logs are automatically collected from all pods by Alloy. Emit structured JSON logs:

```json
{"level":"info","message":"Request processed","duration_ms":42}
```

#### For Traces
Send traces to Tempo using OTLP:

```python
# Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://tempo.observability.svc.cluster.local:4317")
    )
)
trace.set_tracer_provider(provider)
```

#### For Metrics
Expose metrics in Prometheus format and add annotations to your pod:

```yaml
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
```

## Monitoring Endpoints

Internal service endpoints:

- **Prometheus**: `http://prometheus.observability.svc.cluster.local:9090`
- **Loki**: `http://loki.observability.svc.cluster.local:3100`
- **Tempo**:
  - HTTP: `http://tempo.observability.svc.cluster.local:3200`
  - OTLP gRPC: `tempo.observability.svc.cluster.local:4317`
  - OTLP HTTP: `tempo.observability.svc.cluster.local:4318`
- **Grafana**: `http://grafana.observability.svc.cluster.local:3000`

## Troubleshooting

### Check Pod Status
```bash
kubectl get pods -n observability
kubectl describe pod <pod-name> -n observability
```

### View Logs
```bash
kubectl logs -n observability -l app=grafana
kubectl logs -n observability -l app=prometheus
kubectl logs -n observability -l app=loki
kubectl logs -n observability -l app=tempo
kubectl logs -n observability -l app=alloy
```

### Check Storage
```bash
kubectl get pv
kubectl get pvc -n observability
```

### Test Connectivity
```bash
# From inside cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://prometheus.observability.svc.cluster.local:9090/-/healthy
```

### Common Issues

**Pods stuck in Pending**
- Check if storage directories exist on hetzner-2
- Verify PV/PVC bindings: `kubectl describe pvc -n observability`

**Loki won't start**
- Check permissions on `/mnt/local-ssd/loki` (should be 10001:10001)
- View logs: `kubectl logs -n observability loki-0`

**No logs appearing**
- Check Alloy pods are running: `kubectl get pods -n observability -l app=alloy`
- View Alloy logs: `kubectl logs -n observability -l app=alloy`

**Grafana can't reach datasources**
- Verify services: `kubectl get svc -n observability`
- Check datasource URLs in Grafana UI

## Updating Configuration

### Update Prometheus Scrape Config
```bash
kubectl edit configmap prometheus-config -n observability
kubectl rollout restart statefulset/prometheus -n observability
```

### Update Loki Retention
```bash
kubectl edit configmap loki-config -n observability
kubectl rollout restart statefulset/loki -n observability
```

### Update Alloy Collection Rules
```bash
kubectl edit configmap alloy-config -n observability
kubectl rollout restart daemonset/alloy -n observability
```

## Resource Usage

Expected resource consumption:

| Component | CPU Request | CPU Limit | Memory Request | Memory Limit |
|-----------|-------------|-----------|----------------|--------------|
| Prometheus | 500m | 2000m | 2Gi | 4Gi |
| Loki | 500m | 2000m | 1Gi | 2Gi |
| Tempo | 500m | 2000m | 1Gi | 2Gi |
| Grafana | 250m | 1000m | 512Mi | 1Gi |
| Alloy (per node) | 100m | 500m | 256Mi | 512Mi |
| kube-state-metrics | 100m | 200m | 128Mi | 256Mi |
| node-exporter (per node) | 100m | 200m | 128Mi | 256Mi |

**Total (single node)**: ~2.1 CPU cores, ~7.5Gi memory

## Security Considerations

1. **Change default Grafana password** immediately after deployment
2. Consider adding authentication for internal services if exposed
3. Review and restrict RBAC permissions as needed
4. Enable audit logging in Loki for sensitive namespaces
5. Consider adding NetworkPolicies to restrict traffic

## Documentation

This deployment includes comprehensive guides:

- **README.md**: Complete deployment and configuration guide (this file)
- **MONITORING-GUIDE.md**: URLs, access, and how to monitor new applications
- **DEPLOYMENT-CHECKLIST.md**: Step-by-step deployment checklist
- **QUICKREF.md**: Quick reference for daily operations
- **demo-app.yaml**: Example fully instrumented application
- **deploy.sh**: Automated deployment script
- **status.sh**: Health check script
- **cleanup.sh**: Complete stack removal
- **remove-old-monitoring.sh**: Remove existing monitoring before deployment
- **21-optional-ingresses.yaml**: Optional external access to Prometheus/Loki/Tempo

## Future Enhancements

- Add Alertmanager for alerting
- Configure Grafana SMTP for email notifications
- Add custom dashboards for your applications
- Implement Grafana RBAC for team access
- Consider Mimir for long-term metrics storage
- Add backup/restore procedures

## Support

For issues or questions:
1. Check pod logs first
2. Review Grafana datasource configuration
3. Verify network connectivity between components
4. Check storage and resource availability

## Version Information

- Grafana: 11.4.0
- Prometheus: 2.54.1
- Loki: 3.2.1
- Tempo: 2.6.1
- Alloy: 1.5.1
- kube-state-metrics: 2.13.0
- node-exporter: 1.8.2

Last updated: January 2025