betelgeusebytes/k8s/observability-stack/MONITORING-GUIDE.md

15 KiB

Access URLs & Monitoring New Applications Guide

🌐 Access URLs

Required (Already Configured)

Grafana - Main Dashboard

  • URL: https://grafana.betelgeusebytes.io
  • DNS Required: Yes - grafana.betelgeusebytes.io → your cluster IP
  • Login: admin / admin (change on first login!)
  • Purpose: Unified interface for logs, metrics, and traces
  • Ingress: Already included in deployment (20-grafana-ingress.yaml)

Optional (Direct Component Access)

You can optionally expose these components directly:

Prometheus - Metrics UI

  • URL: https://prometheus.betelgeusebytes.io
  • DNS Required: Yes - prometheus.betelgeusebytes.io → your cluster IP
  • Purpose: Direct access to Prometheus UI, query metrics, check targets
  • Deploy: kubectl apply -f 21-optional-ingresses.yaml
  • Use Case: Debugging metric collection, advanced PromQL queries

Loki - Logs API

  • URL: https://loki.betelgeusebytes.io
  • DNS Required: Yes - loki.betelgeusebytes.io → your cluster IP
  • Purpose: Direct API access for log queries
  • Deploy: kubectl apply -f 21-optional-ingresses.yaml
  • Use Case: External log forwarding, API integration

Tempo - Traces API

  • URL: https://tempo.betelgeusebytes.io
  • DNS Required: Yes - tempo.betelgeusebytes.io → your cluster IP
  • Purpose: Direct API access for trace queries
  • Deploy: kubectl apply -f 21-optional-ingresses.yaml
  • Use Case: External trace ingestion, API integration

Internal Only (No DNS Required)

These are ClusterIP services accessible only from within the cluster:

http://prometheus.observability.svc.cluster.local:9090
http://loki.observability.svc.cluster.local:3100
http://tempo.observability.svc.cluster.local:3200
http://tempo.observability.svc.cluster.local:4317  # OTLP gRPC
http://tempo.observability.svc.cluster.local:4318  # OTLP HTTP

🎯 Recommendation

For most users: Just use Grafana (grafana.betelgeusebytes.io)

  • Grafana provides unified access to all components
  • No need to expose Prometheus, Loki, or Tempo directly
  • Simpler DNS configuration (only one subdomain)

For power users: Add optional ingresses

  • Direct Prometheus access is useful for debugging
  • Helps verify targets and scrape configs
  • Deploy with: kubectl apply -f 21-optional-ingresses.yaml

📊 Monitoring New Applications

Automatic: Kubernetes Logs

All pod logs are automatically collected! No configuration needed.

Alloy runs as a DaemonSet and automatically:

  1. Discovers all pods in the cluster
  2. Reads logs from /var/log/pods/
  3. Sends them to Loki with labels:
    • namespace
    • pod
    • container
    • node
    • All pod labels

View in Grafana:

# All logs from your app
{namespace="your-namespace", pod=~"your-app.*"}

# Error logs only
{namespace="your-namespace"} |= "error"

# JSON logs parsed
{namespace="your-namespace"} | json | level="error"

Best Practice for Logs: Emit structured JSON logs from your application:

import json
import logging

# Python example
logging.basicConfig(
    format='%(message)s',
    level=logging.INFO
)

logger = logging.getLogger(__name__)

# Log as JSON
logger.info(json.dumps({
    "level": "info",
    "message": "User login successful",
    "user_id": "123",
    "ip": "1.2.3.4",
    "duration_ms": 42
}))

Manual: Application Metrics

Step 1: Expose Metrics Endpoint

Your application needs to expose metrics at /metrics in Prometheus format.

Python (Flask) Example:

from prometheus_flask_exporter import PrometheusMetrics

app = Flask(__name__)
metrics = PrometheusMetrics(app)

# Now /metrics endpoint is available
# Automatic metrics: request count, duration, etc.

Python (FastAPI) Example:

from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()
Instrumentator().instrument(app).expose(app)

# /metrics endpoint is now available

Go Example:

import (
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "net/http"
)

http.Handle("/metrics", promhttp.Handler())

Node.js Example:

const promClient = require('prom-client');

// Create default metrics
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });

// Expose /metrics endpoint
app.get('/metrics', async (req, res) => {
    res.set('Content-Type', register.contentType);
    res.end(await register.metrics());
});

Step 2: Add Prometheus Annotations to Your Deployment

Add these annotations to your pod template:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: my-namespace
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"   # Enable scraping
        prometheus.io/port: "8080"     # Port where metrics are exposed
        prometheus.io/path: "/metrics" # Path to metrics (optional, /metrics is default)
    spec:
      containers:
        - name: my-app
          image: my-app:latest
          ports:
            - name: http
              containerPort: 8080

Step 3: Verify Metrics Collection

Check in Prometheus:

  1. Access Prometheus UI (if exposed): https://prometheus.betelgeusebytes.io
  2. Go to Status → Targets
  3. Look for your pod under "kubernetes-pods"
  4. Should show as "UP"

Or via Grafana:

  1. Go to Explore → Prometheus
  2. Query: up{pod=~"my-app.*"}
  3. Should return value=1

Query your metrics:

# Request rate
rate(http_requests_total{namespace="my-namespace"}[5m])

# Request duration 95th percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{namespace="my-namespace", status=~"5.."}[5m])

Manual: Application Traces

Step 1: Add OpenTelemetry to Your Application

Python Example:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.resources import Resource

# Configure resource
resource = Resource.create({"service.name": "my-app"})

# Setup tracer
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint="http://tempo.observability.svc.cluster.local:4317",
            insecure=True
        )
    )
)
trace.set_tracer_provider(trace_provider)

# Auto-instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

# Manual spans
tracer = trace.get_tracer(__name__)

@app.route('/api/data')
def get_data():
    with tracer.start_as_current_span("fetch_data") as span:
        # Your code here
        span.set_attribute("rows", 100)
        return {"data": "..."}

Install dependencies:

pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-flask \
    opentelemetry-exporter-otlp-proto-grpc

Go Example:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

exporter, _ := otlptracegrpc.New(
    context.Background(),
    otlptracegrpc.WithEndpoint("tempo.observability.svc.cluster.local:4317"),
    otlptracegrpc.WithInsecure(),
)

tp := trace.NewTracerProvider(
    trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)

Node.js Example:

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');

const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({
    url: 'http://tempo.observability.svc.cluster.local:4317'
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

This enables clicking from logs to traces in Grafana!

Python Example:

import json
from opentelemetry import trace

def log_with_trace(message):
    span = trace.get_current_span()
    trace_id = format(span.get_span_context().trace_id, '032x')
    
    log_entry = {
        "message": message,
        "trace_id": trace_id,
        "level": "info"
    }
    print(json.dumps(log_entry))

Step 3: Verify Traces

In Grafana:

  1. Go to Explore → Tempo
  2. Search for service: "my-app"
  3. Click on a trace to view details
  4. Click "Logs for this span" to see correlated logs

📋 Complete Example: Monitoring a New App

Here's a complete deployment with all monitoring configured:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-app-config
  namespace: my-namespace
data:
  app.py: |
    from flask import Flask
    import logging
    import json
    from opentelemetry import trace
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor
    from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
    from opentelemetry.instrumentation.flask import FlaskInstrumentor
    from opentelemetry.sdk.resources import Resource
    from prometheus_flask_exporter import PrometheusMetrics
    
    # Setup logging
    logging.basicConfig(level=logging.INFO, format='%(message)s')
    logger = logging.getLogger(__name__)
    
    # Setup tracing
    resource = Resource.create({"service.name": "my-app"})
    trace_provider = TracerProvider(resource=resource)
    trace_provider.add_span_processor(
        BatchSpanProcessor(
            OTLPSpanExporter(
                endpoint="http://tempo.observability.svc.cluster.local:4317",
                insecure=True
            )
        )
    )
    trace.set_tracer_provider(trace_provider)
    
    app = Flask(__name__)
    
    # Setup metrics
    metrics = PrometheusMetrics(app)
    
    # Auto-instrument with traces
    FlaskInstrumentor().instrument_app(app)
    
    @app.route('/')
    def index():
        span = trace.get_current_span()
        trace_id = format(span.get_span_context().trace_id, '032x')
        
        logger.info(json.dumps({
            "level": "info",
            "message": "Request received",
            "trace_id": trace_id,
            "endpoint": "/"
        }))
        
        return {"status": "ok", "trace_id": trace_id}
    
    if __name__ == '__main__':
        app.run(host='0.0.0.0', port=8080)    

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: my-namespace
  labels:
    app: my-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
      annotations:
        # Enable Prometheus scraping
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: my-app
          image: python:3.11-slim
          command:
            - /bin/bash
            - -c
            - |
              pip install flask opentelemetry-api opentelemetry-sdk \
                opentelemetry-instrumentation-flask \
                opentelemetry-exporter-otlp-proto-grpc \
                prometheus-flask-exporter && \
              python /app/app.py              
          ports:
            - name: http
              containerPort: 8080
          volumeMounts:
            - name: app-code
              mountPath: /app
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
      volumes:
        - name: app-code
          configMap:
            name: my-app-config

---
apiVersion: v1
kind: Service
metadata:
  name: my-app
  namespace: my-namespace
  labels:
    app: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
spec:
  type: ClusterIP
  ports:
    - port: 8080
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app: my-app

🔍 Verification Checklist

After deploying a new app with monitoring:

Logs ✓ (Automatic)

# Check logs appear in Grafana
# Explore → Loki → {namespace="my-namespace", pod=~"my-app.*"}

Metrics ✓ (If configured)

# Check Prometheus is scraping
# Explore → Prometheus → up{pod=~"my-app.*"}
# Should return 1

# Check your custom metrics
# Explore → Prometheus → flask_http_request_total{namespace="my-namespace"}

Traces ✓ (If configured)

# Check traces appear in Tempo
# Explore → Tempo → Search for service "my-app"
# Should see traces

# Verify log-trace correlation
# Click on a log line with trace_id → should jump to trace

🎓 Quick Start for Common Frameworks

Python Flask/FastAPI

pip install opentelemetry-distro opentelemetry-exporter-otlp prometheus-flask-exporter
opentelemetry-bootstrap -a install
# Set environment variables in your deployment:
OTEL_SERVICE_NAME=my-app
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo.observability.svc.cluster.local:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

# Then run with auto-instrumentation:
opentelemetry-instrument python app.py

Go

go get go.opentelemetry.io/otel
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc

Node.js

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
    @opentelemetry/exporter-trace-otlp-grpc prom-client

📚 Summary

Component Automatic? Configuration Needed
Logs Yes None - just deploy your app
Metrics No Add /metrics endpoint + annotations
Traces No Add OpenTelemetry SDK + configure endpoint

Recommended Approach:

  1. Start simple: Deploy app, logs work automatically
  2. Add metrics: Expose /metrics, add annotations
  3. Add traces: Instrument with OpenTelemetry
  4. Correlate: Add trace IDs to logs for full observability

🆘 Troubleshooting

Logs not appearing:

  • Check Alloy is running: kubectl get pods -n observability -l app=alloy
  • Check pod logs are being written to stdout/stderr
  • View in real-time: kubectl logs -f <pod-name> -n <namespace>

Metrics not being scraped:

  • Verify annotations are present: kubectl get pod <pod> -o yaml | grep prometheus
  • Check /metrics endpoint: kubectl port-forward pod/<pod> 8080:8080 then curl localhost:8080/metrics
  • Check Prometheus targets: https://prometheus.betelgeusebytes.io/targets

Traces not appearing:

  • Verify endpoint: tempo.observability.svc.cluster.local:4317
  • Check Tempo logs: kubectl logs -n observability tempo-0
  • Verify OTLP exporter is configured correctly in your app
  • Check network policies allow traffic to observability namespace