betelgeusebytes/k8s/observability-stack/MONITORING-GUIDE.md

# Access URLs & Monitoring New Applications Guide

## 🌐 Access URLs

### Required (Already Configured)

**Grafana - Main Dashboard**
- **URL**: https://grafana.betelgeusebytes.io
- **DNS Required**: Yes - `grafana.betelgeusebytes.io` → your cluster IP
- **Login**: admin / admin (change on first login!)
- **Purpose**: Unified interface for logs, metrics, and traces
- **Ingress**: Already included in deployment (20-grafana-ingress.yaml)

### Optional (Direct Component Access)

You can optionally expose these components directly:

**Prometheus - Metrics UI**
- **URL**: https://prometheus.betelgeusebytes.io
- **DNS Required**: Yes - `prometheus.betelgeusebytes.io` → your cluster IP
- **Purpose**: Direct access to Prometheus UI, query metrics, check targets
- **Deploy**: `kubectl apply -f 21-optional-ingresses.yaml`
- **Use Case**: Debugging metric collection, advanced PromQL queries

**Loki - Logs API**
- **URL**: https://loki.betelgeusebytes.io
- **DNS Required**: Yes - `loki.betelgeusebytes.io` → your cluster IP
- **Purpose**: Direct API access for log queries
- **Deploy**: `kubectl apply -f 21-optional-ingresses.yaml`
- **Use Case**: External log forwarding, API integration

**Tempo - Traces API**
- **URL**: https://tempo.betelgeusebytes.io
- **DNS Required**: Yes - `tempo.betelgeusebytes.io` → your cluster IP
- **Purpose**: Direct API access for trace queries
- **Deploy**: `kubectl apply -f 21-optional-ingresses.yaml`
- **Use Case**: External trace ingestion, API integration

### Internal Only (No DNS Required)

These are ClusterIP services accessible only from within the cluster:

```
http://prometheus.observability.svc.cluster.local:9090
http://loki.observability.svc.cluster.local:3100
http://tempo.observability.svc.cluster.local:3200
http://tempo.observability.svc.cluster.local:4317  # OTLP gRPC
http://tempo.observability.svc.cluster.local:4318  # OTLP HTTP
```

## 🎯 Recommendation

**For most users**: Just use Grafana (grafana.betelgeusebytes.io)
- Grafana provides unified access to all components
- No need to expose Prometheus, Loki, or Tempo directly
- Simpler DNS configuration (only one subdomain)

**For power users**: Add optional ingresses
- Direct Prometheus access is useful for debugging
- Helps verify targets and scrape configs
- Deploy with: `kubectl apply -f 21-optional-ingresses.yaml`

## 📊 Monitoring New Applications

### Automatic: Kubernetes Logs

**All pod logs are automatically collected!** No configuration needed.

Alloy runs as a DaemonSet and automatically:
1. Discovers all pods in the cluster
2. Reads logs from `/var/log/pods/`
3. Sends them to Loki with labels:
   - `namespace`
   - `pod`
   - `container`
   - `node`
   - All pod labels

**View in Grafana:**
```logql
# All logs from your app
{namespace="your-namespace", pod=~"your-app.*"}

# Error logs only
{namespace="your-namespace"} |= "error"

# JSON logs parsed
{namespace="your-namespace"} | json | level="error"
```

**Best Practice for Logs:**
Emit structured JSON logs from your application:

```python
import json
import logging

# Python example
logging.basicConfig(
    format='%(message)s',
    level=logging.INFO
)

logger = logging.getLogger(__name__)

# Log as JSON
logger.info(json.dumps({
    "level": "info",
    "message": "User login successful",
    "user_id": "123",
    "ip": "1.2.3.4",
    "duration_ms": 42
}))
```

### Manual: Application Metrics

#### Step 1: Expose Metrics Endpoint

Your application needs to expose metrics at `/metrics` in Prometheus format.

**Python (Flask) Example:**
```python
from prometheus_flask_exporter import PrometheusMetrics

app = Flask(__name__)
metrics = PrometheusMetrics(app)

# Now /metrics endpoint is available
# Automatic metrics: request count, duration, etc.
```

**Python (FastAPI) Example:**
```python
from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()
Instrumentator().instrument(app).expose(app)

# /metrics endpoint is now available
```

**Go Example:**
```go
import (
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "net/http"
)

http.Handle("/metrics", promhttp.Handler())
```

**Node.js Example:**
```javascript
const promClient = require('prom-client');

// Create default metrics
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });

// Expose /metrics endpoint
app.get('/metrics', async (req, res) => {
    res.set('Content-Type', register.contentType);
    res.end(await register.metrics());
});
```

#### Step 2: Add Prometheus Annotations to Your Deployment

Add these annotations to your pod template:

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: my-namespace
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"   # Enable scraping
        prometheus.io/port: "8080"     # Port where metrics are exposed
        prometheus.io/path: "/metrics" # Path to metrics (optional, /metrics is default)
    spec:
      containers:
        - name: my-app
          image: my-app:latest
          ports:
            - name: http
              containerPort: 8080
```

#### Step 3: Verify Metrics Collection

**Check in Prometheus:**
1. Access Prometheus UI (if exposed): https://prometheus.betelgeusebytes.io
2. Go to Status → Targets
3. Look for your pod under "kubernetes-pods"
4. Should show as "UP"

**Or via Grafana:**
1. Go to Explore → Prometheus
2. Query: `up{pod=~"my-app.*"}`
3. Should return value=1

**Query your metrics:**
```promql
# Request rate
rate(http_requests_total{namespace="my-namespace"}[5m])

# Request duration 95th percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{namespace="my-namespace", status=~"5.."}[5m])
```

### Manual: Application Traces

#### Step 1: Add OpenTelemetry to Your Application

**Python Example:**
```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.resources import Resource

# Configure resource
resource = Resource.create({"service.name": "my-app"})

# Setup tracer
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint="http://tempo.observability.svc.cluster.local:4317",
            insecure=True
        )
    )
)
trace.set_tracer_provider(trace_provider)

# Auto-instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

# Manual spans
tracer = trace.get_tracer(__name__)

@app.route('/api/data')
def get_data():
    with tracer.start_as_current_span("fetch_data") as span:
        # Your code here
        span.set_attribute("rows", 100)
        return {"data": "..."}
```

**Install dependencies:**
```bash
pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-flask \
    opentelemetry-exporter-otlp-proto-grpc
```

**Go Example:**
```go
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

exporter, _ := otlptracegrpc.New(
    context.Background(),
    otlptracegrpc.WithEndpoint("tempo.observability.svc.cluster.local:4317"),
    otlptracegrpc.WithInsecure(),
)

tp := trace.NewTracerProvider(
    trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
```

**Node.js Example:**
```javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');

const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({
    url: 'http://tempo.observability.svc.cluster.local:4317'
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
```

#### Step 2: Add Trace IDs to Logs (Optional but Recommended)

This enables clicking from logs to traces in Grafana!

**Python Example:**
```python
import json
from opentelemetry import trace

def log_with_trace(message):
    span = trace.get_current_span()
    trace_id = format(span.get_span_context().trace_id, '032x')

    log_entry = {
        "message": message,
        "trace_id": trace_id,
        "level": "info"
    }
    print(json.dumps(log_entry))
```

#### Step 3: Verify Traces

**In Grafana:**
1. Go to Explore → Tempo
2. Search for service: "my-app"
3. Click on a trace to view details
4. Click "Logs for this span" to see correlated logs

## 📋 Complete Example: Monitoring a New App

Here's a complete deployment with all monitoring configured:

```yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-app-config
  namespace: my-namespace
data:
  app.py: |
    from flask import Flask
    import logging
    import json
    from opentelemetry import trace
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor
    from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
    from opentelemetry.instrumentation.flask import FlaskInstrumentor
    from opentelemetry.sdk.resources import Resource
    from prometheus_flask_exporter import PrometheusMetrics

    # Setup logging
    logging.basicConfig(level=logging.INFO, format='%(message)s')
    logger = logging.getLogger(__name__)

    # Setup tracing
    resource = Resource.create({"service.name": "my-app"})
    trace_provider = TracerProvider(resource=resource)
    trace_provider.add_span_processor(
        BatchSpanProcessor(
            OTLPSpanExporter(
                endpoint="http://tempo.observability.svc.cluster.local:4317",
                insecure=True
            )
        )
    )
    trace.set_tracer_provider(trace_provider)

    app = Flask(__name__)

    # Setup metrics
    metrics = PrometheusMetrics(app)

    # Auto-instrument with traces
    FlaskInstrumentor().instrument_app(app)

    @app.route('/')
    def index():
        span = trace.get_current_span()
        trace_id = format(span.get_span_context().trace_id, '032x')

        logger.info(json.dumps({
            "level": "info",
            "message": "Request received",
            "trace_id": trace_id,
            "endpoint": "/"
        }))

        return {"status": "ok", "trace_id": trace_id}

    if __name__ == '__main__':
        app.run(host='0.0.0.0', port=8080)

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: my-namespace
  labels:
    app: my-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
      annotations:
        # Enable Prometheus scraping
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: my-app
          image: python:3.11-slim
          command:
            - /bin/bash
            - -c
            - |
              pip install flask opentelemetry-api opentelemetry-sdk \
                opentelemetry-instrumentation-flask \
                opentelemetry-exporter-otlp-proto-grpc \
                prometheus-flask-exporter && \
              python /app/app.py
          ports:
            - name: http
              containerPort: 8080
          volumeMounts:
            - name: app-code
              mountPath: /app
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
      volumes:
        - name: app-code
          configMap:
            name: my-app-config

---
apiVersion: v1
kind: Service
metadata:
  name: my-app
  namespace: my-namespace
  labels:
    app: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
spec:
  type: ClusterIP
  ports:
    - port: 8080
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app: my-app
```

## 🔍 Verification Checklist

After deploying a new app with monitoring:

### Logs ✓ (Automatic)
```bash
# Check logs appear in Grafana
# Explore → Loki → {namespace="my-namespace", pod=~"my-app.*"}
```

### Metrics ✓ (If configured)
```bash
# Check Prometheus is scraping
# Explore → Prometheus → up{pod=~"my-app.*"}
# Should return 1

# Check your custom metrics
# Explore → Prometheus → flask_http_request_total{namespace="my-namespace"}
```

### Traces ✓ (If configured)
```bash
# Check traces appear in Tempo
# Explore → Tempo → Search for service "my-app"
# Should see traces

# Verify log-trace correlation
# Click on a log line with trace_id → should jump to trace
```

## 🎓 Quick Start for Common Frameworks

### Python Flask/FastAPI
```bash
pip install opentelemetry-distro opentelemetry-exporter-otlp prometheus-flask-exporter
opentelemetry-bootstrap -a install
```

```python
# Set environment variables in your deployment:
OTEL_SERVICE_NAME=my-app
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo.observability.svc.cluster.local:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

# Then run with auto-instrumentation:
opentelemetry-instrument python app.py
```

### Go
```bash
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
```

### Node.js
```bash
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
    @opentelemetry/exporter-trace-otlp-grpc prom-client
```

## 📚 Summary

| Component | Automatic? | Configuration Needed |
|-----------|-----------|---------------------|
| **Logs** | ✅ Yes | None - just deploy your app |
| **Metrics** | ❌ No | Add /metrics endpoint + annotations |
| **Traces** | ❌ No | Add OpenTelemetry SDK + configure endpoint |

**Recommended Approach:**
1. **Start simple**: Deploy app, logs work automatically
2. **Add metrics**: Expose /metrics, add annotations
3. **Add traces**: Instrument with OpenTelemetry
4. **Correlate**: Add trace IDs to logs for full observability

## 🔗 Useful Links

- OpenTelemetry Python: https://opentelemetry.io/docs/instrumentation/python/
- OpenTelemetry Go: https://opentelemetry.io/docs/instrumentation/go/
- OpenTelemetry Node.js: https://opentelemetry.io/docs/instrumentation/js/
- Prometheus Client Libraries: https://prometheus.io/docs/instrumenting/clientlibs/
- Grafana Docs: https://grafana.com/docs/

## 🆘 Troubleshooting

**Logs not appearing:**
- Check Alloy is running: `kubectl get pods -n observability -l app=alloy`
- Check pod logs are being written to stdout/stderr
- View in real-time: `kubectl logs -f <pod-name> -n <namespace>`

**Metrics not being scraped:**
- Verify annotations are present: `kubectl get pod <pod> -o yaml | grep prometheus`
- Check /metrics endpoint: `kubectl port-forward pod/<pod> 8080:8080` then `curl localhost:8080/metrics`
- Check Prometheus targets: https://prometheus.betelgeusebytes.io/targets

**Traces not appearing:**
- Verify endpoint: `tempo.observability.svc.cluster.local:4317`
- Check Tempo logs: `kubectl logs -n observability tempo-0`
- Verify OTLP exporter is configured correctly in your app
- Check network policies allow traffic to observability namespace