Add observability stack and supporting scripts

- Introduced combine.sh script to aggregate .txt, .py, .yml, .yaml, .ini files into betelgeusebytes.txt. - Updated Loki configuration to disable retention settings. - Modified Tempo configuration to change storage paths from /tmp to /var. - Refactored Alloy configuration to streamline Prometheus integration and removed unnecessary metrics export. - Enhanced RBAC permissions to include pod log access. - Added security context to Tempo deployment for improved security. - Created README_old.md for documentation of the observability stack. - Developed me.md as an authoritative guide for the AI infrastructure stack. - Implemented test-loki-logs.sh script to validate Loki log collection and connectivity.
2026-01-28 11:07:16 +01:00 · 2026-01-28 11:07:16 +01:00 · 404deb1d52
parent dfdd36db3f
commit 404deb1d52
19 changed files with 7171 additions and 69 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -0,0 +1,93 @@
+# BetelgeuseBytes – Architecture Overview
+
+## High-Level Architecture
+
+This platform is a **self-hosted, production-grade Kubernetes stack** designed for:
+
+* AI / ML experimentation and serving
+* Data engineering & observability
+* Knowledge graphs & vector search
+* Automation, workflows, and research tooling
+
+The architecture follows a **hub-and-spoke model**:
+
+* **Core Infrastructure**: Kubernetes + networking + storage
+* **Platform Services**: databases, messaging, auth, observability
+* **ML / AI Services**: labeling, embeddings, LLM serving, notebooks
+* **Automation & Workflows**: Argo Workflows, n8n
+* **Access Layer**: DNS, Ingress, TLS
+
+---
+
+## Logical Architecture Diagram (Textual)
+
+```
+Internet
+   │
+   ▼
+DNS (betelgeusebytes.io)
+   │
+   ▼
+Ingress-NGINX (TLS via cert-manager)
+   │
+   ├── Platform UIs (Grafana, Kibana, Gitea, Neo4j, MinIO, etc.)
+   ├── ML UIs (Jupyter, Label Studio, MLflow)
+   ├── Automation (n8n, Argo)
+   └── APIs (Postgres TCP, Neo4j Bolt, Kafka)
+
+Kubernetes Cluster
+   ├── Control Plane
+   ├── Worker Nodes
+   ├── Stateful Workloads (local SSD)
+   └── Observability Stack
+```
+
+---
+
+## Key Design Principles
+
+* **Bare‑metal friendly** (Hetzner dedicated servers)
+* **Local SSD storage** for stateful workloads
+* **Everything observable** (logs, metrics, traces)
+* **CPU-first ML** with optional GPU expansion
+* **Single-tenant but multi-project ready**
+
+---
+
+## Networking
+
+* Cilium CNI (eBPF-based networking)
+* NGINX Ingress Controller
+* TCP services exposed via Ingress patch (Postgres, Neo4j Bolt)
+* WireGuard mesh between nodes
+
+---
+
+## Security Model
+
+* TLS everywhere (cert-manager + Let’s Encrypt)
+* Namespace isolation per domain (db, ml, graph, observability…)
+* Secrets stored in Kubernetes Secrets
+* Optional Basic Auth on sensitive UIs
+* Keycloak available for future SSO
+
+---
+
+## Scalability Notes
+
+* Currently single control-plane + workers
+* Designed to add:
+
+  * More workers
+  * Dedicated control-plane VPS nodes
+  * GPU nodes (for vLLM / training)
+
+---
+
+## What This Enables
+
+* Research platforms
+* Knowledge graph + LLM pipelines
+* End-to-end ML lifecycle
+* Automated data pipelines
+* Production observability-first apps
--- a/DEPLOYMENT.md
+++ b/DEPLOYMENT.md
@ -0,0 +1,46 @@
+# Deployment & Operations Guide
+
+## Deployment Model
+
+* Declarative Kubernetes manifests
+* Applied via `kubectl` or Argo CD
+* No Helm dependency
+
+---
+
+## General Rules
+
+* Stateless apps by default
+* PVCs required for state
+* Secrets via Kubernetes Secrets
+* Config via environment variables
+
+---
+
+## Deployment Order (Recommended)
+
+1. Networking (Cilium, Ingress)
+2. cert-manager
+3. Storage (PVs)
+4. Databases (Postgres, Redis, Kafka)
+5. Observability stack
+6. ML tooling
+7. Automation tools
+8. Custom applications
+
+---
+
+## Operations
+
+* Monitor via Grafana
+* Debug via logs & traces
+* Upgrade via Git commits
+* Rollback via Argo CD
+
+---
+
+## Backup Strategy
+
+* MinIO buckets versioned
+* Database snapshots
+* Git repositories mirrored
--- a/FUTURE-PROJECTS.md
+++ b/FUTURE-PROJECTS.md
@ -0,0 +1,34 @@
+# Future Use Cases & Projects
+
+This platform is intentionally **general‑purpose**.
+
+## AI & ML
+
+* RAG platforms
+* Offline assistants
+* Agent systems
+* NLP research
+
+## Knowledge Graphs
+
+* Academic citation graphs
+* Trust & provenance systems
+* Dependency analysis
+
+## Data Platforms
+
+* Event‑driven ETL
+* Feature stores
+* Research data lakes
+
+## Observability & Ops
+
+* Internal platform monitoring
+* Security analytics
+* Audit systems
+
+## Sovereign Deployments
+
+* On‑prem AI for enterprises
+* NGO / government tooling
+* Privacy‑preserving analytics
--- a/INFRASTRUCTURE.md
+++ b/INFRASTRUCTURE.md
@ -0,0 +1,102 @@
+# BetelgeuseBytes – Infrastructure & Cluster Configuration
+
+## Hosting Provider
+
+* **Provider**: Hetzner
+* **Server Type**: Dedicated servers
+* **Region**: EU
+* **Network**: Private LAN + WireGuard
+
+---
+
+## Nodes
+
+### Current Nodes
+
+| Node      | Role                   | Notes               |
+| --------- | ---------------------- | ------------------- |
+| hetzner-1 | control-plane + worker | runs core workloads |
+| hetzner-2 | worker + storage       | hosts local SSD PVs |
+
+---
+
+## Kubernetes Setup
+
+* Kubernetes installed via kubeadm
+* Single cluster
+* Control plane is also schedulable
+
+### CNI
+
+* **Cilium**
+
+  * eBPF dataplane
+  * kube-proxy replacement
+  * Network policy support
+
+---
+
+## Storage
+
+### Persistent Volumes
+
+* Backed by **local NVMe / SSD**
+* Manually provisioned PVs
+* Bound via PVCs
+
+### Storage Layout
+
+```
+/mnt/local-ssd/
+├── postgres/
+├── neo4j/
+├── elasticsearch/
+├── prometheus/
+├── loki/
+├── tempo/
+├── grafana/
+├── minio/
+└── qdrant/
+```
+
+---
+
+## Networking
+
+* Ingress Controller: nginx
+* External DNS records → ingress IP
+* TCP mappings for:
+
+  * PostgreSQL
+  * Neo4j Bolt
+
+---
+
+## TLS & Certificates
+
+* cert-manager
+* ClusterIssuer: Let’s Encrypt
+* Automatic renewal
+
+---
+
+## Namespaces
+
+| Namespace     | Purpose                            |
+| ------------- | ---------------------------------- |
+| db            | Databases (Postgres, Redis)        |
+| graph         | Neo4j                              |
+| broker        | Kafka                              |
+| ml            | ML tooling (Jupyter, Argo, MLflow) |
+| observability | Grafana, Prometheus, Loki, Tempo   |
+| automation    | n8n                                |
+| devops        | Gitea, Argo CD                     |
+
+---
+
+## What This Infra Enables
+
+* Full on‑prem AI platform
+* Predictable performance
+* Low-latency data access
+* Independence from cloud providers
--- a/OBSERVABILITY.md
+++ b/OBSERVABILITY.md
@ -0,0 +1,32 @@
+# 🔭 Observability Stack
+
+---
+
+## Components
+
+- Grafana
+- Prometheus
+- Loki
+- Tempo
+- Grafana Alloy
+- kube-state-metrics
+- node-exporter
+
+---
+
+## Capabilities
+
+- Logs ↔ traces ↔ metrics correlation
+- OTLP-native instrumentation
+- Centralized dashboards
+- Alerting-ready
+
+---
+
+## Instrumentation Rules
+
+All apps must:
+- expose `/metrics`
+- emit structured JSON logs
+- export OTLP traces
+
--- a/README.md
+++ b/README.md
@ -1,43 +1,123 @@
-# BetelgeuseBytes K8s — Full Stack (kubectl-only)
+# 🧠 BetelgeuseBytes AI Platform — Documentation

-**Nodes**
- Control-plane + worker: hetzner-1 (95.217.89.53)
- Worker: hetzner-2 (138.201.254.97)
+This documentation describes a **self-hosted, CPU-first AI platform** running on Kubernetes,
+designed to power an **Islamic Hadith Scholar AI** and future AI/data projects.

-## Bring up the cluster
-```bash
-ansible -i ansible/inventories/prod/hosts.ini all -m ping
-ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
-```
+## 📚 Documentation Index

-## Apply apps (edit secrets first)
-```bash
-kubectl apply -f k8s/00-namespaces.yaml
-kubectl apply -f k8s/01-secrets/
-kubectl apply -f k8s/storage/storageclass.yaml
+- [Architecture](ARCHITECTURE.md)
+- [Infrastructure](INFRASTRUCTURE.md)
+- [Full Stack Overview](STACK.md)
+- [Deployment & Operations](DEPLOYMENT.md)
+- [Observability](OBSERVABILITY.md)
+- [Roadmap & Next Steps](ROADMAP.md)
+- [Future Projects & Use Cases](FUTURE-PROJECTS.md)

-kubectl apply -f k8s/postgres/
-kubectl apply -f k8s/redis/
-kubectl apply -f k8s/elastic/elasticsearch.yaml
-kubectl apply -f k8s/elastic/kibana.yaml
+## 🎯 Current Focus

-kubectl apply -f k8s/gitea/
-kubectl apply -f k8s/jupyter/
-kubectl apply -f k8s/kafka/kafka.yaml
-kubectl apply -f k8s/kafka/kafka-ui.yaml
-kubectl apply -f k8s/neo4j/
+- Hadith sanad & matn extraction
+- Narrator relationship modeling
+- Knowledge graph construction
+- Human-in-the-loop verification
+- Explainable, sovereign AI

-kubectl apply -f k8s/otlp/
-kubectl apply -f k8s/observability/fluent-bit.yaml
-kubectl apply -f k8s/prometheus/
-kubectl apply -f k8s/grafana/
-```
+## 🧠 What each document gives you
+### ARCHITECTURE

-## DNS
-A records:
- apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
+- Logical system architecture

-CNAMEs → apps.betelgeusebytes.io:
- gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
+- Data & control flow

-(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.
+- Networking and security model
+
+- Design principles (CPU-first, sovereign, observable)
+
+- What the architecture enables long-term
+
+This is what you show to **architects and senior engineers.**
+
+### INFRASTRUCTURE
+
+- Hetzner setup (dedicated, CPU-only, SSD)
+
+- Node roles and responsibilities
+
+- Kubernetes topology
+
+- Cilium networking
+
+- Storage layout on disk
+
+- Namespaces and isolation strategy
+
+This is what you show to **ops / SRE / infra people.**
+
+### STACK
+
+- Exhaustive list of every deployed component
+
+- Grouped by domain:
+
+    - Core platform
+
+    - Databases & messaging
+
+    - Knowledge & vectors
+
+    - ML & AI
+
+    - Automation & DevOps
+
+    - Observability
+
+    - Authentication
+
+For each: **what it does now + what it can be reused for**
+
+This is the **master mental model** of your platform.
+
+### DEPLOYMENT
+
+- How the platform is deployed (kubectl + GitOps)
+
+- Deployment order
+
+- Operational rules
+
+- Backup strategy
+
+- Day-2 operations mindset
+
+This is your ***runbook starter.***
+
+### ROADMAP
+
+- Clear technical phases:
+
+    - Neo4j isnād schema
+
+    - Authenticity scoring
+
+    - Productization
+
+    - Scaling (GPU, multi-project)
+
+This keeps the project ***directionally sane.***
+
+### FUTURE-PROJECTS
+
+- Explicitly documents that this is **not just a Hadith stack**
+
+- Lists realistic reuse cases:
+
+    - RAG
+
+    - Knowledge graphs
+
+    - Sovereign AI
+
+    - Digital humanities
+
+    - Research platforms
+
+This justifies the ***investment in infra quality.***
--- a/README_old.md
+++ b/README_old.md
@ -0,0 +1,43 @@
+# BetelgeuseBytes K8s — Full Stack (kubectl-only)
+
+**Nodes**
+- Control-plane + worker: hetzner-1 (95.217.89.53)
+- Worker: hetzner-2 (138.201.254.97)
+
+## Bring up the cluster
+```bash
+ansible -i ansible/inventories/prod/hosts.ini all -m ping
+ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
+```
+
+## Apply apps (edit secrets first)
+```bash
+kubectl apply -f k8s/00-namespaces.yaml
+kubectl apply -f k8s/01-secrets/
+kubectl apply -f k8s/storage/storageclass.yaml
+
+kubectl apply -f k8s/postgres/
+kubectl apply -f k8s/redis/
+kubectl apply -f k8s/elastic/elasticsearch.yaml
+kubectl apply -f k8s/elastic/kibana.yaml
+
+kubectl apply -f k8s/gitea/
+kubectl apply -f k8s/jupyter/
+kubectl apply -f k8s/kafka/kafka.yaml
+kubectl apply -f k8s/kafka/kafka-ui.yaml
+kubectl apply -f k8s/neo4j/
+
+kubectl apply -f k8s/otlp/
+kubectl apply -f k8s/observability/fluent-bit.yaml
+kubectl apply -f k8s/prometheus/
+kubectl apply -f k8s/grafana/
+```
+
+## DNS
+A records:
+- apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
+
+CNAMEs → apps.betelgeusebytes.io:
+- gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
+
+(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.
--- a/ROADMAP.md
+++ b/ROADMAP.md
@ -0,0 +1,26 @@
+# Roadmap & Next Steps
+
+## Phase 1 – Knowledge Modeling
+
+* Design Neo4j isnād schema
+* Identity resolution
+* Relationship typing
+
+## Phase 2 – Authenticity Scoring
+
+* Chain continuity analysis
+* Narrator reliability
+* Graph‑based scoring
+* LLM‑assisted reasoning
+
+## Phase 3 – Productization
+
+* Admin dashboards
+* APIs
+* Provenance visualization
+
+## Phase 4 – Scale & Extend
+
+* GPU nodes
+* vLLM integration
+* Multi‑project tenancy
--- a/STACK.md
+++ b/STACK.md
@ -0,0 +1,153 @@
+# 🧠 BetelgeuseBytes – Full Stack Catalog
+
+
+This document lists **every major component deployed in the cluster**, what it is used for today, and what it can be reused for.
+
+---
+
+## Core Platform
+
+| Component     | Namespace     | Purpose         | Reuse           |
+| ------------- | ------------- | --------------- | --------------- |
+| Kubernetes    | all           | Orchestration   | Any platform    |
+| Cilium        | kube-system   | Networking      | Secure clusters |
+| NGINX Ingress | ingress-nginx | Traffic routing | API gateway     |
+| cert-manager  | cert-manager  | TLS automation  | PKI             |
+
+---
+
+## Databases & Messaging
+
+| Component     | URL / Access    | Purpose         | Reuse            |
+| ------------- | --------------- | --------------- | ---------------- |
+| PostgreSQL    | TCP via Ingress | Relational DB   | App backends     |
+| Redis         | internal        | Cache           | Queues           |
+| Kafka         | kafka-ui UI     | Event streaming | Streaming ETL    |
+| Elasticsearch | Kibana UI       | Search + logs   | Full‑text search |
+
+---
+
+## Knowledge & Vector
+
+| Component | URL                       | Purpose         | Reuse           |
+| --------- | ------------------------- | --------------- | --------------- |
+| Neo4j     | neo4j.betelgeusebytes.io  | Knowledge graph | Graph analytics |
+| Qdrant    | vector.betelgeusebytes.io | Vector search   | RAG             |
+
+---
+
+## ML & AI
+
+| Component    | URL                           | Purpose         | Reuse            |
+| ------------ | ----------------------------- | --------------- | ---------------- |
+| Jupyter      | notebook UI                   | Experiments     | Research         |
+| Label Studio | label.betelgeusebytes.io      | Annotation      | Dataset creation |
+| MLflow       | mlflow.betelgeusebytes.io     | Model tracking  | MLOps            |
+| Ollama / LLM | llm.betelgeusebytes.io        | LLM inference   | Agents           |
+| Embeddings   | embeddings.betelgeusebytes.io | Text embeddings | Semantic search  |
+
+---
+
+## Automation & DevOps
+
+| Component      | URL                     | Purpose             | Reuse       |
+| -------------- | ----------------------- | ------------------- | ----------- |
+| Argo Workflows | argo.betelgeusebytes.io | Pipelines           | ETL         |
+| Argo CD        | argocd UI               | GitOps              | CI/CD       |
+| Gitea          | gitea UI                | Git hosting         | SCM         |
+| n8n            | automation UI           | Workflow automation | Integration |
+
+---
+
+## Observability (LGTM)
+
+| Component  | Purpose         | Reuse                  |
+| ---------- | --------------- | ---------------------- |
+| Grafana    | Dashboards      | Ops center             |
+| Prometheus | Metrics         | Monitoring             |
+| Loki       | Logs            | Debugging              |
+| Tempo      | Traces          | Distributed tracing    |
+| Alloy      | Telemetry agent | Standardized telemetry |
+
+---
+
+## Authentication
+
+| Component | Purpose    | Reuse |
+| --------- | ---------- | ----- |
+| Keycloak  | OIDC / SSO | IAM   |
+
+---
+
+## Why This Stack Matters
+
+* Covers **data → ML → serving → observability** end‑to‑end
+* Suitable for research **and** production
+* Modular and future‑proof
+
+
+# 📚 Stack Catalog — Services, URLs, Access & Usage
+
+This document lists **every deployed component**, how to access it,
+what it is used for **now**, and what it enables **in the future**.
+
+---
+
+## 🌐 Public Services (Ingress / HTTPS)
+
+| Component | URL | Auth | What It Is | Current Usage | Future Usage |
+|--------|-----|------|------------|---------------|--------------|
+| LLM Inference | https://llm.betelgeusebytes.io | none / internal | CPU LLM server (Ollama / llama.cpp) | Extract sanad & matn as JSON | Agents, doc AI, RAG |
+| Embeddings | https://embeddings.betelgeusebytes.io | none / internal | Text Embeddings Inference (HF) | Hadith & bio embeddings | Semantic search |
+| Vector DB | https://vector.betelgeusebytes.io | none | Qdrant + UI | Similarity search | Recommendations |
+| Graph DB | https://neo4j.betelgeusebytes.io | Basic Auth | Neo4j Browser | Isnād graph | Knowledge graphs |
+| Orchestrator | https://hadith-api.betelgeusebytes.io | OIDC | FastAPI router | Core AI API | Any AI backend |
+| Admin UI | https://hadith-admin.betelgeusebytes.io | OIDC | Next.js UI | Scholar review | Any internal tool |
+| Labeling | https://label.betelgeusebytes.io | Local / OIDC | Label Studio | NER/RE annotation | Dataset curation |
+| ML Tracking | https://mlflow.betelgeusebytes.io | OIDC | MLflow UI | Experiments & models | Governance |
+| Object Storage | https://minio.betelgeusebytes.io | Access key | MinIO Console | Datasets & artifacts | Data lake |
+| Pipelines | https://argo.betelgeusebytes.io | SA / OIDC | Argo Workflows UI | ML pipelines | ETL |
+| Auth | https://auth.betelgeusebytes.io | Admin login | Keycloak | SSO & tokens | IAM |
+| Observability | https://grafana.betelgeusebytes.io | Login | Grafana | Metrics/logs/traces | Ops center |
+
+---
+
+## 🔐 Authentication & Access Summary
+
+| System | Auth Method | Who Uses It |
+|-----|------------|-------------|
+| Keycloak | Username / Password | Admins |
+| Admin UI | OIDC (Keycloak) | Scholars |
+| Orchestrator API | OIDC Bearer Token | Apps |
+| MLflow | OIDC | ML engineers |
+| Label Studio | Local / OIDC | Annotators |
+| Neo4j | Basic Auth | Engineers |
+| MinIO | Access / Secret key | Pipelines |
+| Grafana | Login | Operators |
+
+---
+
+## 🧠 Internal Cluster Services (ClusterIP)
+
+| Component | Namespace | Purpose |
+|--------|-----------|--------|
+| PostgreSQL | db | Relational storage |
+| Redis | db | Cache / temp state |
+| Kafka | broker | Event backbone |
+| Prometheus | observability | Metrics |
+| Loki | observability | Logs |
+| Tempo | observability | Traces |
+| Alloy | observability | Telemetry agent |
+
+---
+
+## 🗂 Storage Responsibilities
+
+| Storage | Used By | Contains |
+|------|--------|---------|
+| MinIO | Pipelines, MLflow | Datasets, models |
+| Neo4j PVC | Graph DB | Isnād graph |
+| Qdrant PVC | Vector DB | Embeddings |
+| PostgreSQL PVC | DB | Metadata |
+| Observability PVCs | LGTM | Logs, metrics, traces |
+
--- a/betelgeusebytes.txt
+++ b/betelgeusebytes.txt
--- a/combine.sh
+++ b/combine.sh
@ -0,0 +1,5 @@
+find . -type f -name "*.txt" -o -name "*.py" -o -name "*.yml" -o -name "*.yaml" -o -name "*.YAML" -o -name "*.ini" | while read file; do
+    echo "=== $file ===" >> betelgeusebytes.txt
+    cat "$file" >> betelgeusebytes.txt
+    echo  "" >> betelgeusebytes.txt
+done
--- a/k8s/observability-stack/04-loki-config.yaml
+++ b/k8s/observability-stack/04-loki-config.yaml
@ -43,12 +43,9 @@ data:
    compactor:
      working_directory: /loki/compactor
      compaction_interval: 10m
-      retention_enabled: true
-      retention_delete_delay: 2h
-      retention_delete_worker_count: 150
+      retention_enabled: false

    limits_config:
-      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h  # 7 days
      retention_period: 168h  # 7 days
--- a/k8s/observability-stack/05-tempo-config.yaml
+++ b/k8s/observability-stack/05-tempo-config.yaml
@ -39,7 +39,7 @@ data:
          source: tempo
          cluster: betelgeuse-k8s
      storage:
-        path: /tmp/tempo/generator/wal
+        path: /var/tempo/generator/wal
        remote_write:
          - url: http://prometheus.observability.svc.cluster.local:9090/api/v1/write
            send_exemplars: true
@ -48,17 +48,14 @@ data:
      trace:
        backend: local
        wal:
-          path: /tmp/tempo/wal
+          path: /var/tempo/wal
        local:
-          path: /tmp/tempo/blocks
+          path: /var/tempo/blocks
        pool:
          max_workers: 100
          queue_depth: 10000

-    querier:
-      frontend_worker:
-        frontend_address: tempo.observability.svc.cluster.local:9095
-
+    # Single instance mode - no need for frontend/querier split
    query_frontend:
      search:
        duration_slo: 5s
--- a/k8s/observability-stack/06-alloy-config.yaml
+++ b/k8s/observability-stack/06-alloy-config.yaml
@ -124,7 +124,6 @@ data:

      output {
        traces  = [otelcol.exporter.otlp.tempo.input]
-        metrics = [otelcol.exporter.prometheus.metrics.input]
      }
    }

@ -138,22 +137,7 @@ data:
      }
    }

-    // Export OTLP metrics to Prometheus
-    otelcol.exporter.prometheus "metrics" {
-      forward_to = [prometheus.remote_write.local.receiver]
-    }
-
-    // Remote write to Prometheus
-    prometheus.remote_write "local" {
-      endpoint {
-        url = "http://prometheus.observability.svc.cluster.local:9090/api/v1/write"
-      }
-    }
-
    // Scrape local metrics (Alloy's own metrics)
-    prometheus.scrape "alloy" {
-      targets = [{
-        __address__ = "localhost:12345",
-      }]
-      forward_to = [prometheus.remote_write.local.receiver]
+    // Prometheus will scrape these via service discovery
+    prometheus.exporter.self "alloy" {
    }
--- a/k8s/observability-stack/08-rbac.yaml
+++ b/k8s/observability-stack/08-rbac.yaml
@ -66,6 +66,7 @@ rules:
      - services
      - endpoints
      - pods
+      - pods/log
    verbs: ["get", "list", "watch"]
  - apiGroups:
      - extensions
--- a/k8s/observability-stack/12-tempo.yaml
+++ b/k8s/observability-stack/12-tempo.yaml
@ -21,6 +21,11 @@ spec:
    spec:
      nodeSelector:
        kubernetes.io/hostname: hetzner-2
+      securityContext:
+        fsGroup: 10001
+        runAsGroup: 10001
+        runAsNonRoot: true
+        runAsUser: 10001
      containers:
        - name: tempo
          image: grafana/tempo:2.6.1
@ -70,7 +75,7 @@ spec:
            - name: tempo-config
              mountPath: /etc/tempo
            - name: tempo-data
-              mountPath: /tmp/tempo
+              mountPath: /var/tempo
      volumes:
        - name: tempo-config
          configMap:
--- a/k8s/observability-stack/README_old.md
+++ b/k8s/observability-stack/README_old.md
--- a/k8s/observability-stack/me.md
+++ b/k8s/observability-stack/me.md
@ -0,0 +1,388 @@
+# 🧠 BetelgeuseBytes — Full AI Infrastructure Stack
+## Authoritative README, Architecture & Onboarding Guide
+
+This repository documents the **entire self-hosted AI infrastructure stack** running on a Kubernetes cluster hosted on **Hetzner dedicated servers**.
+
+The stack currently powers an **Islamic Hadith Scholar AI**, but it is intentionally designed as a **general-purpose, sovereign AI, MLOps, and data platform** that can support many future projects.
+
+This document is the **single source of truth** for:
+- architecture (logical & physical)
+- infrastructure configuration
+- networking & DNS
+- every deployed component
+- why each component exists
+- how to build new systems on top of the platform
+
+---
+
+## 1. Mission & Design Philosophy
+
+### Current Mission
+Build an AI system that can:
+
+- Parse classical Islamic texts
+- Extract **Sanad** (chains of narrators) and **Matn** (hadith text)
+- Identify narrators and their relationships:
+  - teacher / student
+  - familial lineage
+- Construct a **verifiable knowledge graph**
+- Support **human scholarly review**
+- Provide **transparent and explainable reasoning**
+- Operate **fully on-prem**, CPU-first, without SaaS or GPU dependency
+
+### Core Principles
+- **Sovereignty** — no external cloud lock-in
+- **Explainability** — graph + provenance, not black boxes
+- **Human-in-the-loop** — scholars remain in control
+- **Observability-first** — everything is measurable and traceable
+- **Composable** — every part can be reused or replaced
+
+---
+
+## 2. Physical Infrastructure (Hetzner)
+
+### Nodes
+- **Provider:** Hetzner
+- **Type:** Dedicated servers
+- **Architecture:** x86_64
+- **GPU:** None (CPU-only by design)
+- **Storage:** Local NVMe / SSD
+
+### Node Roles (Logical)
+| Node Type | Responsibilities |
+|---------|------------------|
+| Control / Worker | Kubernetes control plane + workloads |
+| Storage-heavy | Databases, MinIO, observability data |
+| Compute-heavy | LLM inference, embeddings, pipelines |
+
+> The cluster is intentionally **single-region and on-prem-like**, optimized for predictability and data locality.
+
+---
+
+## 3. Kubernetes Infrastructure Configuration
+
+### Kubernetes
+- Runtime for **all services**
+- Namespaced isolation
+- Explicit PersistentVolumeClaims
+- Declarative configuration (GitOps)
+
+### Namespaces (Conceptual)
+| Namespace | Purpose |
+|--------|--------|
+| `ai` | LLMs, embeddings, labeling |
+| `vec` | Vector database |
+| `graph` | Knowledge graph |
+| `db` | Relational databases |
+| `storage` | Object storage |
+| `mlops` | MLflow |
+| `ml` | Argo Workflows |
+| `auth` | Keycloak |
+| `observability` | LGTM stack |
+| `hadith` | Custom apps (orchestrator, UI) |
+
+---
+
+## 4. Networking & DNS
+
+### Ingress
+- **NGINX Ingress Controller**
+- HTTPS termination at ingress
+- Internal services communicate via ClusterIP
+
+### TLS
+- **cert-manager**
+- Let’s Encrypt
+- Automatic renewal
+
+### Public Endpoints
+
+| URL | Service |
+|----|--------|
+| https://llm.betelgeusebytes.io | LLM inference (Ollama / llama.cpp) |
+| https://embeddings.betelgeusebytes.io | Text Embeddings Inference |
+| https://vector.betelgeusebytes.io | Qdrant + UI |
+| https://neo4j.betelgeusebytes.io | Neo4j Browser |
+| https://hadith-api.betelgeusebytes.io | FastAPI Orchestrator |
+| https://hadith-admin.betelgeusebytes.io | Admin / Curation UI |
+| https://label.betelgeusebytes.io | Label Studio |
+| https://mlflow.betelgeusebytes.io | MLflow |
+| https://minio.betelgeusebytes.io | MinIO Console |
+| https://argo.betelgeusebytes.io | Argo Workflows |
+| https://auth.betelgeusebytes.io | Keycloak |
+| https://grafana.betelgeusebytes.io | Grafana |
+
+---
+
+## 5. Full Logical Architecture
+
+```mermaid
+flowchart LR
+  User --> AdminUI --> Orchestrator
+
+  Orchestrator --> LLM
+  Orchestrator --> TEI --> Qdrant
+  Orchestrator --> Neo4j
+  Orchestrator --> PostgreSQL
+  Orchestrator --> Redis
+
+  LabelStudio --> MinIO
+  MinIO --> ArgoWF --> MLflow
+  MLflow --> Models --> Orchestrator
+
+  Kafka --> ArgoWF
+
+  Alloy --> Prometheus --> Grafana
+  Alloy --> Loki --> Grafana
+  Alloy --> Tempo --> Grafana
+```
+6. AI & Reasoning Layer
+Ollama / llama.cpp (CPU LLM)
+Current usage
+
+JSON-structured extraction
+
+Sanad / matn reasoning
+
+Deterministic outputs
+
+No GPU dependency
+
+Future usage
+
+Offline assistants
+
+Document intelligence
+
+Agent frameworks
+
+Replaceable by vLLM when GPUs are added
+
+Text Embeddings Inference (TEI)
+Current usage
+
+Embeddings for hadith texts and biographies
+
+Future usage
+
+RAG systems
+
+Semantic search
+
+Deduplication
+
+Similarity clustering
+
+Qdrant (Vector Database)
+Current usage
+
+Stores embeddings
+
+Similarity search
+
+Future usage
+
+Recommendation systems
+
+Agent memory
+
+Multimodal retrieval
+
+Includes Web UI.
+
+7. Knowledge & Data Layer
+Neo4j (Graph Database)
+Current usage
+
+Isnād chains
+
+Narrator relationships
+
+Future usage
+
+Knowledge graphs
+
+Trust networks
+
+Provenance systems
+
+PostgreSQL
+Current usage
+
+App data
+
+MLflow backend
+
+Label Studio DB
+
+Future usage
+
+Feature stores
+
+Metadata catalogs
+
+Transactional apps
+
+Redis
+Current usage
+
+Caching
+
+Temporary state
+
+Future usage
+
+Job queues
+
+Rate limiting
+
+Sessions
+
+Kafka
+Current usage
+
+Optional async backbone
+
+Future usage
+
+Streaming ingestion
+
+Event-driven ML
+
+Audit pipelines
+
+MinIO (S3)
+Current usage
+
+Datasets
+
+Model artifacts
+
+Pipeline outputs
+
+Future usage
+
+Data lake
+
+Backups
+
+Feature storage
+
+8. MLOps & Human-in-the-Loop
+Label Studio
+Current usage
+
+Human annotation of narrators & relations
+
+Future usage
+
+Any labeling task (text, image, audio)
+
+MLflow
+Current usage
+
+Experiment tracking
+
+Model registry
+
+Future usage
+
+Governance
+
+Model promotion
+
+Auditing
+
+Argo Workflows
+Current usage
+
+ETL & training pipelines
+
+Future usage
+
+Batch inference
+
+Scheduled automation
+
+Data engineering
+
+9. Authentication & Security
+Keycloak
+Current usage
+
+SSO for Admin UI, MLflow, Label Studio
+
+Future usage
+
+API authentication
+
+Multi-tenant access
+
+Organization-wide IAM
+
+10. Observability Stack (LGTM)
+Components
+Grafana
+
+Prometheus
+
+Loki
+
+Tempo
+
+Grafana Alloy
+
+kube-state-metrics
+
+node-exporter
+
+Capabilities
+Metrics, logs, traces
+
+Automatic correlation
+
+OTLP-native
+
+Local SSD persistence
+
+11. Design Rules for All Custom Services
+All services must:
+
+be stateless
+
+use env vars & Kubernetes Secrets
+
+authenticate via Keycloak
+
+emit:
+
+Prometheus metrics
+
+OTLP traces
+
+structured JSON logs
+
+be deployable via kubectl & Argo CD
+
+12. Future Use Cases (Beyond Hadith)
+This platform can support:
+
+General Knowledge Graph AI
+
+Legal / scholarly document analysis
+
+Enterprise RAG systems
+
+Research data platforms
+
+Explainable AI systems
+
+Internal search engines
+
+Agent-based systems
+
+Provenance & trust scoring engines
+
+Digital humanities projects
+
+Offline sovereign AI deployments
--- a/k8s/observability-stack/test-loki-logs.sh
+++ b/k8s/observability-stack/test-loki-logs.sh
@ -0,0 +1,158 @@
+#!/bin/bash
+
+set -e
+
+GREEN='\033[0;32m'
+RED='\033[0;31m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+
+echo -e "${BLUE}========================================${NC}"
+echo -e "${BLUE}   Loki Log Collection Test${NC}"
+echo -e "${BLUE}========================================${NC}"
+echo ""
+
+PASS=0
+FAIL=0
+
+# Test 1: Check Alloy DaemonSet
+echo -e "${YELLOW}Test 1: Checking Alloy DaemonSet...${NC}"
+if kubectl get pods -n observability -l app=alloy --no-headers 2>/dev/null | grep -q "Running"; then
+    ALLOY_COUNT=$(kubectl get pods -n observability -l app=alloy --no-headers | grep -c "Running")
+    echo -e "${GREEN}✓ Alloy is running ($ALLOY_COUNT pod(s))${NC}"
+    PASS=$((PASS+1))
+else
+    echo -e "${RED}✗ Alloy is not running${NC}"
+    FAIL=$((FAIL+1))
+fi
+echo ""
+
+# Test 2: Check Loki pod
+echo -e "${YELLOW}Test 2: Checking Loki pod...${NC}"
+if kubectl get pods -n observability -l app=loki --no-headers 2>/dev/null | grep -q "Running"; then
+    echo -e "${GREEN}✓ Loki is running${NC}"
+    PASS=$((PASS+1))
+else
+    echo -e "${RED}✗ Loki is not running${NC}"
+    FAIL=$((FAIL+1))
+fi
+echo ""
+
+# Test 3: Test Loki readiness endpoint
+echo -e "${YELLOW}Test 3: Testing Loki readiness endpoint...${NC}"
+READY=$(kubectl run test-loki-ready-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
+    curl -s -m 5 http://loki.observability.svc.cluster.local:3100/ready 2>/dev/null || echo "failed")
+
+if [ "$READY" = "ready" ]; then
+    echo -e "${GREEN}✓ Loki is ready${NC}"
+    PASS=$((PASS+1))
+else
+    echo -e "${RED}✗ Loki is not ready (response: $READY)${NC}"
+    FAIL=$((FAIL+1))
+fi
+echo ""
+
+# Test 4: Check Alloy can connect to Loki
+echo -e "${YELLOW}Test 4: Checking Alloy → Loki connectivity...${NC}"
+ALLOY_ERRORS=$(kubectl logs -n observability -l app=alloy --tail=50 2>/dev/null | grep -i "error.*loki" | wc -l)
+if [ "$ALLOY_ERRORS" -eq 0 ]; then
+    echo -e "${GREEN}✓ No Alloy → Loki connection errors${NC}"
+    PASS=$((PASS+1))
+else
+    echo -e "${RED}✗ Found $ALLOY_ERRORS error(s) in Alloy logs${NC}"
+    kubectl logs -n observability -l app=alloy --tail=20 | grep -i error
+    FAIL=$((FAIL+1))
+fi
+echo ""
+
+# Test 5: Create test pod and verify logs
+echo -e "${YELLOW}Test 5: Creating test pod and verifying log collection...${NC}"
+
+# Clean up any existing test pod
+kubectl delete pod test-logger-verify --ignore-not-found 2>/dev/null
+
+# Create test pod
+echo "  Creating test pod that logs every second..."
+kubectl run test-logger-verify --image=busybox --restart=Never -- sh -c \
+  'for i in 1 2 3 4 5 6 7 8 9 10; do echo "LOKI-TEST-LOG: Message number $i at $(date)"; sleep 1; done' \
+  >/dev/null 2>&1
+
+# Wait for pod to start and generate logs
+echo "  Waiting 15 seconds for logs to be collected..."
+sleep 15
+
+# Query Loki API for test logs
+echo "  Querying Loki for test logs..."
+START_TIME=$(date -u -d '2 minutes ago' +%s)000000000
+END_TIME=$(date -u +%s)000000000
+
+QUERY_RESULT=$(kubectl run test-loki-query-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
+    curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
+    --data-urlencode 'query={pod="test-logger-verify"}' \
+    --data-urlencode "start=$START_TIME" \
+    --data-urlencode "end=$END_TIME" 2>/dev/null || echo "failed")
+
+if echo "$QUERY_RESULT" | grep -q "LOKI-TEST-LOG"; then
+    LOG_COUNT=$(echo "$QUERY_RESULT" | grep -o "LOKI-TEST-LOG" | wc -l)
+    echo -e "${GREEN}✓ Found $LOG_COUNT test log messages in Loki${NC}"
+    PASS=$((PASS+1))
+else
+    echo -e "${RED}✗ Test logs not found in Loki${NC}"
+    echo "  Response: ${QUERY_RESULT:0:200}"
+    FAIL=$((FAIL+1))
+fi
+
+# Clean up test pod
+kubectl delete pod test-logger-verify --ignore-not-found >/dev/null 2>&1
+
+echo ""
+
+# Test 6: Check observability namespace logs
+echo -e "${YELLOW}Test 6: Checking for observability namespace logs...${NC}"
+
+OBS_QUERY=$(kubectl run test-loki-obs-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
+    curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
+    --data-urlencode 'query={namespace="observability"}' \
+    --data-urlencode "start=$START_TIME" \
+    --data-urlencode "end=$END_TIME" \
+    --data-urlencode "limit=10" 2>/dev/null || echo "failed")
+
+if echo "$OBS_QUERY" | grep -q '"values":\[\['; then
+    echo -e "${GREEN}✓ Observability namespace logs found in Loki${NC}"
+    PASS=$((PASS+1))
+else
+    echo -e "${RED}✗ No logs found for observability namespace${NC}"
+    FAIL=$((FAIL+1))
+fi
+
+echo ""
+echo -e "${BLUE}========================================${NC}"
+echo -e "${BLUE}   Test Results${NC}"
+echo -e "${BLUE}========================================${NC}"
+echo ""
+
+TOTAL=$((PASS+FAIL))
+echo -e "Passed: ${GREEN}$PASS${NC} / $TOTAL"
+echo -e "Failed: ${RED}$FAIL${NC} / $TOTAL"
+echo ""
+
+if [ $FAIL -eq 0 ]; then
+    echo -e "${GREEN}✓✓✓ All tests passed! Logs are flowing to Loki! ✓✓✓${NC}"
+    echo ""
+    echo "Next steps:"
+    echo "  1. Open Grafana: https://grafana.betelgeusebytes.io"
+    echo "  2. Go to Explore → Loki"
+    echo "  3. Query: {namespace=\"observability\"}"
+    echo ""
+else
+    echo -e "${RED}✗✗✗ Some tests failed. Check the output above for details. ✗✗✗${NC}"
+    echo ""
+    echo "Troubleshooting:"
+    echo "  - Check Alloy logs: kubectl logs -n observability -l app=alloy"
+    echo "  - Check Loki logs: kubectl logs -n observability loki-0"
+    echo "  - Verify services: kubectl get svc -n observability"
+    echo "  - See full guide: VERIFY-LOKI-LOGS.md"
+    echo ""
+    exit 1
+fi