betelgeusebytes/k8s/observability-stack/MONITORING-GUIDE.md

573 lines
15 KiB
Markdown

# Access URLs & Monitoring New Applications Guide
## 🌐 Access URLs
### Required (Already Configured)
**Grafana - Main Dashboard**
- **URL**: https://grafana.betelgeusebytes.io
- **DNS Required**: Yes - `grafana.betelgeusebytes.io` → your cluster IP
- **Login**: admin / admin (change on first login!)
- **Purpose**: Unified interface for logs, metrics, and traces
- **Ingress**: Already included in deployment (20-grafana-ingress.yaml)
### Optional (Direct Component Access)
You can optionally expose these components directly:
**Prometheus - Metrics UI**
- **URL**: https://prometheus.betelgeusebytes.io
- **DNS Required**: Yes - `prometheus.betelgeusebytes.io` → your cluster IP
- **Purpose**: Direct access to Prometheus UI, query metrics, check targets
- **Deploy**: `kubectl apply -f 21-optional-ingresses.yaml`
- **Use Case**: Debugging metric collection, advanced PromQL queries
**Loki - Logs API**
- **URL**: https://loki.betelgeusebytes.io
- **DNS Required**: Yes - `loki.betelgeusebytes.io` → your cluster IP
- **Purpose**: Direct API access for log queries
- **Deploy**: `kubectl apply -f 21-optional-ingresses.yaml`
- **Use Case**: External log forwarding, API integration
**Tempo - Traces API**
- **URL**: https://tempo.betelgeusebytes.io
- **DNS Required**: Yes - `tempo.betelgeusebytes.io` → your cluster IP
- **Purpose**: Direct API access for trace queries
- **Deploy**: `kubectl apply -f 21-optional-ingresses.yaml`
- **Use Case**: External trace ingestion, API integration
### Internal Only (No DNS Required)
These are ClusterIP services accessible only from within the cluster:
```
http://prometheus.observability.svc.cluster.local:9090
http://loki.observability.svc.cluster.local:3100
http://tempo.observability.svc.cluster.local:3200
http://tempo.observability.svc.cluster.local:4317 # OTLP gRPC
http://tempo.observability.svc.cluster.local:4318 # OTLP HTTP
```
## 🎯 Recommendation
**For most users**: Just use Grafana (grafana.betelgeusebytes.io)
- Grafana provides unified access to all components
- No need to expose Prometheus, Loki, or Tempo directly
- Simpler DNS configuration (only one subdomain)
**For power users**: Add optional ingresses
- Direct Prometheus access is useful for debugging
- Helps verify targets and scrape configs
- Deploy with: `kubectl apply -f 21-optional-ingresses.yaml`
## 📊 Monitoring New Applications
### Automatic: Kubernetes Logs
**All pod logs are automatically collected!** No configuration needed.
Alloy runs as a DaemonSet and automatically:
1. Discovers all pods in the cluster
2. Reads logs from `/var/log/pods/`
3. Sends them to Loki with labels:
- `namespace`
- `pod`
- `container`
- `node`
- All pod labels
**View in Grafana:**
```logql
# All logs from your app
{namespace="your-namespace", pod=~"your-app.*"}
# Error logs only
{namespace="your-namespace"} |= "error"
# JSON logs parsed
{namespace="your-namespace"} | json | level="error"
```
**Best Practice for Logs:**
Emit structured JSON logs from your application:
```python
import json
import logging
# Python example
logging.basicConfig(
format='%(message)s',
level=logging.INFO
)
logger = logging.getLogger(__name__)
# Log as JSON
logger.info(json.dumps({
"level": "info",
"message": "User login successful",
"user_id": "123",
"ip": "1.2.3.4",
"duration_ms": 42
}))
```
### Manual: Application Metrics
#### Step 1: Expose Metrics Endpoint
Your application needs to expose metrics at `/metrics` in Prometheus format.
**Python (Flask) Example:**
```python
from prometheus_flask_exporter import PrometheusMetrics
app = Flask(__name__)
metrics = PrometheusMetrics(app)
# Now /metrics endpoint is available
# Automatic metrics: request count, duration, etc.
```
**Python (FastAPI) Example:**
```python
from prometheus_fastapi_instrumentator import Instrumentator
app = FastAPI()
Instrumentator().instrument(app).expose(app)
# /metrics endpoint is now available
```
**Go Example:**
```go
import (
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
http.Handle("/metrics", promhttp.Handler())
```
**Node.js Example:**
```javascript
const promClient = require('prom-client');
// Create default metrics
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });
// Expose /metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
```
#### Step 2: Add Prometheus Annotations to Your Deployment
Add these annotations to your pod template:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: my-namespace
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true" # Enable scraping
prometheus.io/port: "8080" # Port where metrics are exposed
prometheus.io/path: "/metrics" # Path to metrics (optional, /metrics is default)
spec:
containers:
- name: my-app
image: my-app:latest
ports:
- name: http
containerPort: 8080
```
#### Step 3: Verify Metrics Collection
**Check in Prometheus:**
1. Access Prometheus UI (if exposed): https://prometheus.betelgeusebytes.io
2. Go to Status → Targets
3. Look for your pod under "kubernetes-pods"
4. Should show as "UP"
**Or via Grafana:**
1. Go to Explore → Prometheus
2. Query: `up{pod=~"my-app.*"}`
3. Should return value=1
**Query your metrics:**
```promql
# Request rate
rate(http_requests_total{namespace="my-namespace"}[5m])
# Request duration 95th percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{namespace="my-namespace", status=~"5.."}[5m])
```
### Manual: Application Traces
#### Step 1: Add OpenTelemetry to Your Application
**Python Example:**
```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.resources import Resource
# Configure resource
resource = Resource.create({"service.name": "my-app"})
# Setup tracer
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(
endpoint="http://tempo.observability.svc.cluster.local:4317",
insecure=True
)
)
)
trace.set_tracer_provider(trace_provider)
# Auto-instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
# Manual spans
tracer = trace.get_tracer(__name__)
@app.route('/api/data')
def get_data():
with tracer.start_as_current_span("fetch_data") as span:
# Your code here
span.set_attribute("rows", 100)
return {"data": "..."}
```
**Install dependencies:**
```bash
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-flask \
opentelemetry-exporter-otlp-proto-grpc
```
**Go Example:**
```go
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
exporter, _ := otlptracegrpc.New(
context.Background(),
otlptracegrpc.WithEndpoint("tempo.observability.svc.cluster.local:4317"),
otlptracegrpc.WithInsecure(),
)
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
```
**Node.js Example:**
```javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({
url: 'http://tempo.observability.svc.cluster.local:4317'
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
```
#### Step 2: Add Trace IDs to Logs (Optional but Recommended)
This enables clicking from logs to traces in Grafana!
**Python Example:**
```python
import json
from opentelemetry import trace
def log_with_trace(message):
span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
log_entry = {
"message": message,
"trace_id": trace_id,
"level": "info"
}
print(json.dumps(log_entry))
```
#### Step 3: Verify Traces
**In Grafana:**
1. Go to Explore → Tempo
2. Search for service: "my-app"
3. Click on a trace to view details
4. Click "Logs for this span" to see correlated logs
## 📋 Complete Example: Monitoring a New App
Here's a complete deployment with all monitoring configured:
```yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
name: my-app-config
namespace: my-namespace
data:
app.py: |
from flask import Flask
import logging
import json
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.resources import Resource
from prometheus_flask_exporter import PrometheusMetrics
# Setup logging
logging.basicConfig(level=logging.INFO, format='%(message)s')
logger = logging.getLogger(__name__)
# Setup tracing
resource = Resource.create({"service.name": "my-app"})
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(
endpoint="http://tempo.observability.svc.cluster.local:4317",
insecure=True
)
)
)
trace.set_tracer_provider(trace_provider)
app = Flask(__name__)
# Setup metrics
metrics = PrometheusMetrics(app)
# Auto-instrument with traces
FlaskInstrumentor().instrument_app(app)
@app.route('/')
def index():
span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
logger.info(json.dumps({
"level": "info",
"message": "Request received",
"trace_id": trace_id,
"endpoint": "/"
}))
return {"status": "ok", "trace_id": trace_id}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: my-namespace
labels:
app: my-app
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
annotations:
# Enable Prometheus scraping
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: my-app
image: python:3.11-slim
command:
- /bin/bash
- -c
- |
pip install flask opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-flask \
opentelemetry-exporter-otlp-proto-grpc \
prometheus-flask-exporter && \
python /app/app.py
ports:
- name: http
containerPort: 8080
volumeMounts:
- name: app-code
mountPath: /app
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumes:
- name: app-code
configMap:
name: my-app-config
---
apiVersion: v1
kind: Service
metadata:
name: my-app
namespace: my-namespace
labels:
app: my-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
type: ClusterIP
ports:
- port: 8080
targetPort: http
protocol: TCP
name: http
selector:
app: my-app
```
## 🔍 Verification Checklist
After deploying a new app with monitoring:
### Logs ✓ (Automatic)
```bash
# Check logs appear in Grafana
# Explore → Loki → {namespace="my-namespace", pod=~"my-app.*"}
```
### Metrics ✓ (If configured)
```bash
# Check Prometheus is scraping
# Explore → Prometheus → up{pod=~"my-app.*"}
# Should return 1
# Check your custom metrics
# Explore → Prometheus → flask_http_request_total{namespace="my-namespace"}
```
### Traces ✓ (If configured)
```bash
# Check traces appear in Tempo
# Explore → Tempo → Search for service "my-app"
# Should see traces
# Verify log-trace correlation
# Click on a log line with trace_id → should jump to trace
```
## 🎓 Quick Start for Common Frameworks
### Python Flask/FastAPI
```bash
pip install opentelemetry-distro opentelemetry-exporter-otlp prometheus-flask-exporter
opentelemetry-bootstrap -a install
```
```python
# Set environment variables in your deployment:
OTEL_SERVICE_NAME=my-app
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo.observability.svc.cluster.local:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
# Then run with auto-instrumentation:
opentelemetry-instrument python app.py
```
### Go
```bash
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
```
### Node.js
```bash
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc prom-client
```
## 📚 Summary
| Component | Automatic? | Configuration Needed |
|-----------|-----------|---------------------|
| **Logs** | ✅ Yes | None - just deploy your app |
| **Metrics** | ❌ No | Add /metrics endpoint + annotations |
| **Traces** | ❌ No | Add OpenTelemetry SDK + configure endpoint |
**Recommended Approach:**
1. **Start simple**: Deploy app, logs work automatically
2. **Add metrics**: Expose /metrics, add annotations
3. **Add traces**: Instrument with OpenTelemetry
4. **Correlate**: Add trace IDs to logs for full observability
## 🔗 Useful Links
- OpenTelemetry Python: https://opentelemetry.io/docs/instrumentation/python/
- OpenTelemetry Go: https://opentelemetry.io/docs/instrumentation/go/
- OpenTelemetry Node.js: https://opentelemetry.io/docs/instrumentation/js/
- Prometheus Client Libraries: https://prometheus.io/docs/instrumenting/clientlibs/
- Grafana Docs: https://grafana.com/docs/
## 🆘 Troubleshooting
**Logs not appearing:**
- Check Alloy is running: `kubectl get pods -n observability -l app=alloy`
- Check pod logs are being written to stdout/stderr
- View in real-time: `kubectl logs -f <pod-name> -n <namespace>`
**Metrics not being scraped:**
- Verify annotations are present: `kubectl get pod <pod> -o yaml | grep prometheus`
- Check /metrics endpoint: `kubectl port-forward pod/<pod> 8080:8080` then `curl localhost:8080/metrics`
- Check Prometheus targets: https://prometheus.betelgeusebytes.io/targets
**Traces not appearing:**
- Verify endpoint: `tempo.observability.svc.cluster.local:4317`
- Check Tempo logs: `kubectl logs -n observability tempo-0`
- Verify OTLP exporter is configured correctly in your app
- Check network policies allow traffic to observability namespace