Add observability stack and supporting scripts

- Introduced combine.sh script to aggregate .txt, .py, .yml, .yaml, .ini files into betelgeusebytes.txt. - Updated Loki configuration to disable retention settings. - Modified Tempo configuration to change storage paths from /tmp to /var. - Refactored Alloy configuration to streamline Prometheus integration and removed unnecessary metrics export. - Enhanced RBAC permissions to include pod log access. - Added security context to Tempo deployment for improved security. - Created README_old.md for documentation of the observability stack. - Developed me.md as an authoritative guide for the AI infrastructure stack. - Implemented test-loki-logs.sh script to validate Loki log collection and connectivity.
2026-01-28 11:07:16 +01:00 · 2026-01-28 11:07:16 +01:00 · 404deb1d52
parent dfdd36db3f
commit 404deb1d52
19 changed files with 7171 additions and 69 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -0,0 +1,93 @@
 # BetelgeuseBytes – Architecture Overview
 ## High-Level Architecture
 This platform is a **self-hosted, production-grade Kubernetes stack** designed for:
 * AI / ML experimentation and serving
 * Data engineering & observability
 * Knowledge graphs & vector search
 * Automation, workflows, and research tooling
 The architecture follows a **hub-and-spoke model**:
 * **Core Infrastructure**: Kubernetes + networking + storage
 * **Platform Services**: databases, messaging, auth, observability
 * **ML / AI Services**: labeling, embeddings, LLM serving, notebooks
 * **Automation & Workflows**: Argo Workflows, n8n
 * **Access Layer**: DNS, Ingress, TLS
 ---
 ## Logical Architecture Diagram (Textual)
 ```
 Internet
   │
   ▼
 DNS (betelgeusebytes.io)
   │
   ▼
 Ingress-NGINX (TLS via cert-manager)
   │
   ├── Platform UIs (Grafana, Kibana, Gitea, Neo4j, MinIO, etc.)
   ├── ML UIs (Jupyter, Label Studio, MLflow)
   ├── Automation (n8n, Argo)
   └── APIs (Postgres TCP, Neo4j Bolt, Kafka)
 Kubernetes Cluster
   ├── Control Plane
   ├── Worker Nodes
   ├── Stateful Workloads (local SSD)
   └── Observability Stack
 ```
 ---
 ## Key Design Principles
 * **Bare‑metal friendly** (Hetzner dedicated servers)
 * **Local SSD storage** for stateful workloads
 * **Everything observable** (logs, metrics, traces)
 * **CPU-first ML** with optional GPU expansion
 * **Single-tenant but multi-project ready**
 ---
 ## Networking
 * Cilium CNI (eBPF-based networking)
 * NGINX Ingress Controller
 * TCP services exposed via Ingress patch (Postgres, Neo4j Bolt)
 * WireGuard mesh between nodes
 ---
 ## Security Model
 * TLS everywhere (cert-manager + Let’s Encrypt)
 * Namespace isolation per domain (db, ml, graph, observability…)
 * Secrets stored in Kubernetes Secrets
 * Optional Basic Auth on sensitive UIs
 * Keycloak available for future SSO
 ---
 ## Scalability Notes
 * Currently single control-plane + workers
 * Designed to add:
  * More workers
  * Dedicated control-plane VPS nodes
  * GPU nodes (for vLLM / training)
 ---
 ## What This Enables
 * Research platforms
 * Knowledge graph + LLM pipelines
 * End-to-end ML lifecycle
 * Automated data pipelines
 * Production observability-first apps
--- a/DEPLOYMENT.md
+++ b/DEPLOYMENT.md
@ -0,0 +1,46 @@
 # Deployment & Operations Guide
 ## Deployment Model
 * Declarative Kubernetes manifests
 * Applied via `kubectl` or Argo CD
 * No Helm dependency
 ---
 ## General Rules
 * Stateless apps by default
 * PVCs required for state
 * Secrets via Kubernetes Secrets
 * Config via environment variables
 ---
 ## Deployment Order (Recommended)
 1. Networking (Cilium, Ingress)
 2. cert-manager
 3. Storage (PVs)
 4. Databases (Postgres, Redis, Kafka)
 5. Observability stack
 6. ML tooling
 7. Automation tools
 8. Custom applications
 ---
 ## Operations
 * Monitor via Grafana
 * Debug via logs & traces
 * Upgrade via Git commits
 * Rollback via Argo CD
 ---
 ## Backup Strategy
 * MinIO buckets versioned
 * Database snapshots
 * Git repositories mirrored
--- a/FUTURE-PROJECTS.md
+++ b/FUTURE-PROJECTS.md
@ -0,0 +1,34 @@
 # Future Use Cases & Projects
 This platform is intentionally **general‑purpose**.
 ## AI & ML
 * RAG platforms
 * Offline assistants
 * Agent systems
 * NLP research
 ## Knowledge Graphs
 * Academic citation graphs
 * Trust & provenance systems
 * Dependency analysis
 ## Data Platforms
 * Event‑driven ETL
 * Feature stores
 * Research data lakes
 ## Observability & Ops
 * Internal platform monitoring
 * Security analytics
 * Audit systems
 ## Sovereign Deployments
 * On‑prem AI for enterprises
 * NGO / government tooling
 * Privacy‑preserving analytics
--- a/INFRASTRUCTURE.md
+++ b/INFRASTRUCTURE.md
@ -0,0 +1,102 @@
 # BetelgeuseBytes – Infrastructure & Cluster Configuration
 ## Hosting Provider
 * **Provider**: Hetzner
 * **Server Type**: Dedicated servers
 * **Region**: EU
 * **Network**: Private LAN + WireGuard
 ---
 ## Nodes
 ### Current Nodes
 | Node      | Role                   | Notes               |
 | --------- | ---------------------- | ------------------- |
 | hetzner-1 | control-plane + worker | runs core workloads |
 | hetzner-2 | worker + storage       | hosts local SSD PVs |
 ---
 ## Kubernetes Setup
 * Kubernetes installed via kubeadm
 * Single cluster
 * Control plane is also schedulable
 ### CNI
 * **Cilium**
  * eBPF dataplane
  * kube-proxy replacement
  * Network policy support
 ---
 ## Storage
 ### Persistent Volumes
 * Backed by **local NVMe / SSD**
 * Manually provisioned PVs
 * Bound via PVCs
 ### Storage Layout
 ```
 /mnt/local-ssd/
 ├── postgres/
 ├── neo4j/
 ├── elasticsearch/
 ├── prometheus/
 ├── loki/
 ├── tempo/
 ├── grafana/
 ├── minio/
 └── qdrant/
 ```
 ---
 ## Networking
 * Ingress Controller: nginx
 * External DNS records → ingress IP
 * TCP mappings for:
  * PostgreSQL
  * Neo4j Bolt
 ---
 ## TLS & Certificates
 * cert-manager
 * ClusterIssuer: Let’s Encrypt
 * Automatic renewal
 ---
 ## Namespaces
 | Namespace     | Purpose                            |
 | ------------- | ---------------------------------- |
 | db            | Databases (Postgres, Redis)        |
 | graph         | Neo4j                              |
 | broker        | Kafka                              |
 | ml            | ML tooling (Jupyter, Argo, MLflow) |
 | observability | Grafana, Prometheus, Loki, Tempo   |
 | automation    | n8n                                |
 | devops        | Gitea, Argo CD                     |
 ---
 ## What This Infra Enables
 * Full on‑prem AI platform
 * Predictable performance
 * Low-latency data access
 * Independence from cloud providers
--- a/OBSERVABILITY.md
+++ b/OBSERVABILITY.md
@ -0,0 +1,32 @@
 # 🔭 Observability Stack
 ---
 ## Components
 - Grafana
 - Prometheus
 - Loki
 - Tempo
 - Grafana Alloy
 - kube-state-metrics
 - node-exporter
 ---
 ## Capabilities
 - Logs ↔ traces ↔ metrics correlation
 - OTLP-native instrumentation
 - Centralized dashboards
 - Alerting-ready
 ---
 ## Instrumentation Rules
 All apps must:
 - expose `/metrics`
 - emit structured JSON logs
 - export OTLP traces
--- a/README.md
+++ b/README.md
@ -1,43 +1,123 @@
-# BetelgeuseBytes K8s — Full Stack (kubectl-only)
+# 🧠 BetelgeuseBytes AI Platform — Documentation
-**Nodes**
+This documentation describes a **self-hosted, CPU-first AI platform** running on Kubernetes,
- Control-plane + worker: hetzner-1 (95.217.89.53)
+designed to power an **Islamic Hadith Scholar AI** and future AI/data projects.
 - Worker: hetzner-2 (138.201.254.97)
-## Bring up the cluster
+## 📚 Documentation Index
 ```bash
 ansible -i ansible/inventories/prod/hosts.ini all -m ping
 ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
 ```
-## Apply apps (edit secrets first)
+- [Architecture](ARCHITECTURE.md)
-```bash
+- [Infrastructure](INFRASTRUCTURE.md)
-kubectl apply -f k8s/00-namespaces.yaml
+- [Full Stack Overview](STACK.md)
-kubectl apply -f k8s/01-secrets/
+- [Deployment & Operations](DEPLOYMENT.md)
-kubectl apply -f k8s/storage/storageclass.yaml
+- [Observability](OBSERVABILITY.md)
 - [Roadmap & Next Steps](ROADMAP.md)
 - [Future Projects & Use Cases](FUTURE-PROJECTS.md)
-kubectl apply -f k8s/postgres/
+## 🎯 Current Focus
 kubectl apply -f k8s/redis/
 kubectl apply -f k8s/elastic/elasticsearch.yaml
 kubectl apply -f k8s/elastic/kibana.yaml
-kubectl apply -f k8s/gitea/
+- Hadith sanad & matn extraction
-kubectl apply -f k8s/jupyter/
+- Narrator relationship modeling
-kubectl apply -f k8s/kafka/kafka.yaml
+- Knowledge graph construction
-kubectl apply -f k8s/kafka/kafka-ui.yaml
+- Human-in-the-loop verification
-kubectl apply -f k8s/neo4j/
+- Explainable, sovereign AI
-kubectl apply -f k8s/otlp/
+## 🧠 What each document gives you
-kubectl apply -f k8s/observability/fluent-bit.yaml
+### ARCHITECTURE
 kubectl apply -f k8s/prometheus/
 kubectl apply -f k8s/grafana/
 ```
-## DNS
+- Logical system architecture
 A records:
 - apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
-CNAMEs → apps.betelgeusebytes.io:
+- Data & control flow
 - gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
-(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.
+- Networking and security model
 - Design principles (CPU-first, sovereign, observable)
 - What the architecture enables long-term
 This is what you show to **architects and senior engineers.**
 ### INFRASTRUCTURE
 - Hetzner setup (dedicated, CPU-only, SSD)
 - Node roles and responsibilities
 - Kubernetes topology
 - Cilium networking
 - Storage layout on disk
 - Namespaces and isolation strategy
 This is what you show to **ops / SRE / infra people.**
 ### STACK
 - Exhaustive list of every deployed component
 - Grouped by domain:
    - Core platform
    - Databases & messaging
    - Knowledge & vectors
    - ML & AI
    - Automation & DevOps
    - Observability
    - Authentication
 For each: **what it does now + what it can be reused for**
 This is the **master mental model** of your platform.
 ### DEPLOYMENT
 - How the platform is deployed (kubectl + GitOps)
 - Deployment order
 - Operational rules
 - Backup strategy
 - Day-2 operations mindset
 This is your ***runbook starter.***
 ### ROADMAP
 - Clear technical phases:
    - Neo4j isnād schema
    - Authenticity scoring
    - Productization
    - Scaling (GPU, multi-project)
 This keeps the project ***directionally sane.***
 ### FUTURE-PROJECTS
 - Explicitly documents that this is **not just a Hadith stack**
 - Lists realistic reuse cases:
    - RAG
    - Knowledge graphs
    - Sovereign AI
    - Digital humanities
    - Research platforms
 This justifies the ***investment in infra quality.***
--- a/README_old.md
+++ b/README_old.md
@ -0,0 +1,43 @@
 # BetelgeuseBytes K8s — Full Stack (kubectl-only)
 **Nodes**
 - Control-plane + worker: hetzner-1 (95.217.89.53)
 - Worker: hetzner-2 (138.201.254.97)
 ## Bring up the cluster
 ```bash
 ansible -i ansible/inventories/prod/hosts.ini all -m ping
 ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
 ```
 ## Apply apps (edit secrets first)
 ```bash
 kubectl apply -f k8s/00-namespaces.yaml
 kubectl apply -f k8s/01-secrets/
 kubectl apply -f k8s/storage/storageclass.yaml
 kubectl apply -f k8s/postgres/
 kubectl apply -f k8s/redis/
 kubectl apply -f k8s/elastic/elasticsearch.yaml
 kubectl apply -f k8s/elastic/kibana.yaml
 kubectl apply -f k8s/gitea/
 kubectl apply -f k8s/jupyter/
 kubectl apply -f k8s/kafka/kafka.yaml
 kubectl apply -f k8s/kafka/kafka-ui.yaml
 kubectl apply -f k8s/neo4j/
 kubectl apply -f k8s/otlp/
 kubectl apply -f k8s/observability/fluent-bit.yaml
 kubectl apply -f k8s/prometheus/
 kubectl apply -f k8s/grafana/
 ```
 ## DNS
 A records:
 - apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
 CNAMEs → apps.betelgeusebytes.io:
 - gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
 (HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.
--- a/ROADMAP.md
+++ b/ROADMAP.md
@ -0,0 +1,26 @@
 # Roadmap & Next Steps
 ## Phase 1 – Knowledge Modeling
 * Design Neo4j isnād schema
 * Identity resolution
 * Relationship typing
 ## Phase 2 – Authenticity Scoring
 * Chain continuity analysis
 * Narrator reliability
 * Graph‑based scoring
 * LLM‑assisted reasoning
 ## Phase 3 – Productization
 * Admin dashboards
 * APIs
 * Provenance visualization
 ## Phase 4 – Scale & Extend
 * GPU nodes
 * vLLM integration
 * Multi‑project tenancy
--- a/STACK.md
+++ b/STACK.md
@ -0,0 +1,153 @@
 # 🧠 BetelgeuseBytes – Full Stack Catalog
 This document lists **every major component deployed in the cluster**, what it is used for today, and what it can be reused for.
 ---
 ## Core Platform
 | Component     | Namespace     | Purpose         | Reuse           |
 | ------------- | ------------- | --------------- | --------------- |
 | Kubernetes    | all           | Orchestration   | Any platform    |
 | Cilium        | kube-system   | Networking      | Secure clusters |
 | NGINX Ingress | ingress-nginx | Traffic routing | API gateway     |
 | cert-manager  | cert-manager  | TLS automation  | PKI             |
 ---
 ## Databases & Messaging
 | Component     | URL / Access    | Purpose         | Reuse            |
 | ------------- | --------------- | --------------- | ---------------- |
 | PostgreSQL    | TCP via Ingress | Relational DB   | App backends     |
 | Redis         | internal        | Cache           | Queues           |
 | Kafka         | kafka-ui UI     | Event streaming | Streaming ETL    |
 | Elasticsearch | Kibana UI       | Search + logs   | Full‑text search |
 ---
 ## Knowledge & Vector
 | Component | URL                       | Purpose         | Reuse           |
 | --------- | ------------------------- | --------------- | --------------- |
 | Neo4j     | neo4j.betelgeusebytes.io  | Knowledge graph | Graph analytics |
 | Qdrant    | vector.betelgeusebytes.io | Vector search   | RAG             |
 ---
 ## ML & AI
 | Component    | URL                           | Purpose         | Reuse            |
 | ------------ | ----------------------------- | --------------- | ---------------- |
 | Jupyter      | notebook UI                   | Experiments     | Research         |
 | Label Studio | label.betelgeusebytes.io      | Annotation      | Dataset creation |
 | MLflow       | mlflow.betelgeusebytes.io     | Model tracking  | MLOps            |
 | Ollama / LLM | llm.betelgeusebytes.io        | LLM inference   | Agents           |
 | Embeddings   | embeddings.betelgeusebytes.io | Text embeddings | Semantic search  |
 ---
 ## Automation & DevOps
 | Component      | URL                     | Purpose             | Reuse       |
 | -------------- | ----------------------- | ------------------- | ----------- |
 | Argo Workflows | argo.betelgeusebytes.io | Pipelines           | ETL         |
 | Argo CD        | argocd UI               | GitOps              | CI/CD       |
 | Gitea          | gitea UI                | Git hosting         | SCM         |
 | n8n            | automation UI           | Workflow automation | Integration |
 ---
 ## Observability (LGTM)
 | Component  | Purpose         | Reuse                  |
 | ---------- | --------------- | ---------------------- |
 | Grafana    | Dashboards      | Ops center             |
 | Prometheus | Metrics         | Monitoring             |
 | Loki       | Logs            | Debugging              |
 | Tempo      | Traces          | Distributed tracing    |
 | Alloy      | Telemetry agent | Standardized telemetry |
 ---
 ## Authentication
 | Component | Purpose    | Reuse |
 | --------- | ---------- | ----- |
 | Keycloak  | OIDC / SSO | IAM   |
 ---
 ## Why This Stack Matters
 * Covers **data → ML → serving → observability** end‑to‑end
 * Suitable for research **and** production
 * Modular and future‑proof
 # 📚 Stack Catalog — Services, URLs, Access & Usage
 This document lists **every deployed component**, how to access it,
 what it is used for **now**, and what it enables **in the future**.
 ---
 ## 🌐 Public Services (Ingress / HTTPS)
 | Component | URL | Auth | What It Is | Current Usage | Future Usage |
 |--------|-----|------|------------|---------------|--------------|
 | LLM Inference | https://llm.betelgeusebytes.io | none / internal | CPU LLM server (Ollama / llama.cpp) | Extract sanad & matn as JSON | Agents, doc AI, RAG |
 | Embeddings | https://embeddings.betelgeusebytes.io | none / internal | Text Embeddings Inference (HF) | Hadith & bio embeddings | Semantic search |
 | Vector DB | https://vector.betelgeusebytes.io | none | Qdrant + UI | Similarity search | Recommendations |
 | Graph DB | https://neo4j.betelgeusebytes.io | Basic Auth | Neo4j Browser | Isnād graph | Knowledge graphs |
 | Orchestrator | https://hadith-api.betelgeusebytes.io | OIDC | FastAPI router | Core AI API | Any AI backend |
 | Admin UI | https://hadith-admin.betelgeusebytes.io | OIDC | Next.js UI | Scholar review | Any internal tool |
 | Labeling | https://label.betelgeusebytes.io | Local / OIDC | Label Studio | NER/RE annotation | Dataset curation |
 | ML Tracking | https://mlflow.betelgeusebytes.io | OIDC | MLflow UI | Experiments & models | Governance |
 | Object Storage | https://minio.betelgeusebytes.io | Access key | MinIO Console | Datasets & artifacts | Data lake |
 | Pipelines | https://argo.betelgeusebytes.io | SA / OIDC | Argo Workflows UI | ML pipelines | ETL |
 | Auth | https://auth.betelgeusebytes.io | Admin login | Keycloak | SSO & tokens | IAM |
 | Observability | https://grafana.betelgeusebytes.io | Login | Grafana | Metrics/logs/traces | Ops center |
 ---
 ## 🔐 Authentication & Access Summary
 | System | Auth Method | Who Uses It |
 |-----|------------|-------------|
 | Keycloak | Username / Password | Admins |
 | Admin UI | OIDC (Keycloak) | Scholars |
 | Orchestrator API | OIDC Bearer Token | Apps |
 | MLflow | OIDC | ML engineers |
 | Label Studio | Local / OIDC | Annotators |
 | Neo4j | Basic Auth | Engineers |
 | MinIO | Access / Secret key | Pipelines |
 | Grafana | Login | Operators |
 ---
 ## 🧠 Internal Cluster Services (ClusterIP)
 | Component | Namespace | Purpose |
 |--------|-----------|--------|
 | PostgreSQL | db | Relational storage |
 | Redis | db | Cache / temp state |
 | Kafka | broker | Event backbone |
 | Prometheus | observability | Metrics |
 | Loki | observability | Logs |
 | Tempo | observability | Traces |
 | Alloy | observability | Telemetry agent |
 ---
 ## 🗂 Storage Responsibilities
 | Storage | Used By | Contains |
 |------|--------|---------|
 | MinIO | Pipelines, MLflow | Datasets, models |
 | Neo4j PVC | Graph DB | Isnād graph |
 | Qdrant PVC | Vector DB | Embeddings |
 | PostgreSQL PVC | DB | Metadata |
 | Observability PVCs | LGTM | Logs, metrics, traces |
--- a/betelgeusebytes.txt
+++ b/betelgeusebytes.txt
--- a/combine.sh
+++ b/combine.sh
@ -0,0 +1,5 @@
 find . -type f -name "*.txt" -o -name "*.py" -o -name "*.yml" -o -name "*.yaml" -o -name "*.YAML" -o -name "*.ini" | while read file; do
    echo "=== $file ===" >> betelgeusebytes.txt
    cat "$file" >> betelgeusebytes.txt
    echo  "" >> betelgeusebytes.txt
 done
--- a/k8s/observability-stack/04-loki-config.yaml
+++ b/k8s/observability-stack/04-loki-config.yaml
@ -43,12 +43,9 @@ data:
    compactor:
      working_directory: /loki/compactor
      compaction_interval: 10m
-      retention_enabled: true
+      retention_enabled: false
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h  # 7 days
      retention_period: 168h  # 7 days
--- a/k8s/observability-stack/05-tempo-config.yaml
+++ b/k8s/observability-stack/05-tempo-config.yaml
@ -39,7 +39,7 @@ data:
          source: tempo
          cluster: betelgeuse-k8s
      storage:
-        path: /tmp/tempo/generator/wal
+        path: /var/tempo/generator/wal
        remote_write:
          - url: http://prometheus.observability.svc.cluster.local:9090/api/v1/write
            send_exemplars: true
@ -48,17 +48,14 @@ data:
      trace:
        backend: local
        wal:
-          path: /tmp/tempo/wal
+          path: /var/tempo/wal
        local:
-          path: /tmp/tempo/blocks
+          path: /var/tempo/blocks
        pool:
          max_workers: 100
          queue_depth: 10000
-    querier:
+    # Single instance mode - no need for frontend/querier split
      frontend_worker:
        frontend_address: tempo.observability.svc.cluster.local:9095
    query_frontend:
      search:
        duration_slo: 5s
--- a/k8s/observability-stack/06-alloy-config.yaml
+++ b/k8s/observability-stack/06-alloy-config.yaml
@ -124,7 +124,6 @@ data:
      output {
        traces  = [otelcol.exporter.otlp.tempo.input]
        metrics = [otelcol.exporter.prometheus.metrics.input]
      }
    }
@ -138,22 +137,7 @@ data:
      }
    }
    // Export OTLP metrics to Prometheus
    otelcol.exporter.prometheus "metrics" {
      forward_to = [prometheus.remote_write.local.receiver]
    }
    // Remote write to Prometheus
    prometheus.remote_write "local" {
      endpoint {
        url = "http://prometheus.observability.svc.cluster.local:9090/api/v1/write"
      }
    }
    // Scrape local metrics (Alloy's own metrics)
-    prometheus.scrape "alloy" {
+    // Prometheus will scrape these via service discovery
-      targets = [{
+    prometheus.exporter.self "alloy" {
        __address__ = "localhost:12345",
      }]
      forward_to = [prometheus.remote_write.local.receiver]
    }
--- a/k8s/observability-stack/08-rbac.yaml
+++ b/k8s/observability-stack/08-rbac.yaml
@ -66,6 +66,7 @@ rules:
      - services
      - endpoints
      - pods
      - pods/log
    verbs: ["get", "list", "watch"]
  - apiGroups:
      - extensions
--- a/k8s/observability-stack/12-tempo.yaml
+++ b/k8s/observability-stack/12-tempo.yaml
@ -21,6 +21,11 @@ spec:
    spec:
      nodeSelector:
        kubernetes.io/hostname: hetzner-2
      securityContext:
        fsGroup: 10001
        runAsGroup: 10001
        runAsNonRoot: true
        runAsUser: 10001
      containers:
        - name: tempo
          image: grafana/tempo:2.6.1
@ -70,7 +75,7 @@ spec:
            - name: tempo-config
              mountPath: /etc/tempo
            - name: tempo-data
-              mountPath: /tmp/tempo
+              mountPath: /var/tempo
      volumes:
        - name: tempo-config
          configMap:
--- a/k8s/observability-stack/README_old.md
+++ b/k8s/observability-stack/README_old.md
--- a/k8s/observability-stack/me.md
+++ b/k8s/observability-stack/me.md
@ -0,0 +1,388 @@
 # 🧠 BetelgeuseBytes — Full AI Infrastructure Stack
 ## Authoritative README, Architecture & Onboarding Guide
 This repository documents the **entire self-hosted AI infrastructure stack** running on a Kubernetes cluster hosted on **Hetzner dedicated servers**.
 The stack currently powers an **Islamic Hadith Scholar AI**, but it is intentionally designed as a **general-purpose, sovereign AI, MLOps, and data platform** that can support many future projects.
 This document is the **single source of truth** for:
 - architecture (logical & physical)
 - infrastructure configuration
 - networking & DNS
 - every deployed component
 - why each component exists
 - how to build new systems on top of the platform
 ---
 ## 1. Mission & Design Philosophy
 ### Current Mission
 Build an AI system that can:
 - Parse classical Islamic texts
 - Extract **Sanad** (chains of narrators) and **Matn** (hadith text)
 - Identify narrators and their relationships:
  - teacher / student
  - familial lineage
 - Construct a **verifiable knowledge graph**
 - Support **human scholarly review**
 - Provide **transparent and explainable reasoning**
 - Operate **fully on-prem**, CPU-first, without SaaS or GPU dependency
 ### Core Principles
 - **Sovereignty** — no external cloud lock-in
 - **Explainability** — graph + provenance, not black boxes
 - **Human-in-the-loop** — scholars remain in control
 - **Observability-first** — everything is measurable and traceable
 - **Composable** — every part can be reused or replaced
 ---
 ## 2. Physical Infrastructure (Hetzner)
 ### Nodes
 - **Provider:** Hetzner
 - **Type:** Dedicated servers
 - **Architecture:** x86_64
 - **GPU:** None (CPU-only by design)
 - **Storage:** Local NVMe / SSD
 ### Node Roles (Logical)
 | Node Type | Responsibilities |
 |---------|------------------|
 | Control / Worker | Kubernetes control plane + workloads |
 | Storage-heavy | Databases, MinIO, observability data |
 | Compute-heavy | LLM inference, embeddings, pipelines |
 > The cluster is intentionally **single-region and on-prem-like**, optimized for predictability and data locality.
 ---
 ## 3. Kubernetes Infrastructure Configuration
 ### Kubernetes
 - Runtime for **all services**
 - Namespaced isolation
 - Explicit PersistentVolumeClaims
 - Declarative configuration (GitOps)
 ### Namespaces (Conceptual)
 | Namespace | Purpose |
 |--------|--------|
 | `ai` | LLMs, embeddings, labeling |
 | `vec` | Vector database |
 | `graph` | Knowledge graph |
 | `db` | Relational databases |
 | `storage` | Object storage |
 | `mlops` | MLflow |
 | `ml` | Argo Workflows |
 | `auth` | Keycloak |
 | `observability` | LGTM stack |
 | `hadith` | Custom apps (orchestrator, UI) |
 ---
 ## 4. Networking & DNS
 ### Ingress
 - **NGINX Ingress Controller**
 - HTTPS termination at ingress
 - Internal services communicate via ClusterIP
 ### TLS
 - **cert-manager**
 - Let’s Encrypt
 - Automatic renewal
 ### Public Endpoints
 | URL | Service |
 |----|--------|
 | https://llm.betelgeusebytes.io | LLM inference (Ollama / llama.cpp) |
 | https://embeddings.betelgeusebytes.io | Text Embeddings Inference |
 | https://vector.betelgeusebytes.io | Qdrant + UI |
 | https://neo4j.betelgeusebytes.io | Neo4j Browser |
 | https://hadith-api.betelgeusebytes.io | FastAPI Orchestrator |
 | https://hadith-admin.betelgeusebytes.io | Admin / Curation UI |
 | https://label.betelgeusebytes.io | Label Studio |
 | https://mlflow.betelgeusebytes.io | MLflow |
 | https://minio.betelgeusebytes.io | MinIO Console |
 | https://argo.betelgeusebytes.io | Argo Workflows |
 | https://auth.betelgeusebytes.io | Keycloak |
 | https://grafana.betelgeusebytes.io | Grafana |
 ---
 ## 5. Full Logical Architecture
 ```mermaid
 flowchart LR
  User --> AdminUI --> Orchestrator
  Orchestrator --> LLM
  Orchestrator --> TEI --> Qdrant
  Orchestrator --> Neo4j
  Orchestrator --> PostgreSQL
  Orchestrator --> Redis
  LabelStudio --> MinIO
  MinIO --> ArgoWF --> MLflow
  MLflow --> Models --> Orchestrator
  Kafka --> ArgoWF
  Alloy --> Prometheus --> Grafana
  Alloy --> Loki --> Grafana
  Alloy --> Tempo --> Grafana
 ```
 6. AI & Reasoning Layer
 Ollama / llama.cpp (CPU LLM)
 Current usage
 JSON-structured extraction
 Sanad / matn reasoning
 Deterministic outputs
 No GPU dependency
 Future usage
 Offline assistants
 Document intelligence
 Agent frameworks
 Replaceable by vLLM when GPUs are added
 Text Embeddings Inference (TEI)
 Current usage
 Embeddings for hadith texts and biographies
 Future usage
 RAG systems
 Semantic search
 Deduplication
 Similarity clustering
 Qdrant (Vector Database)
 Current usage
 Stores embeddings
 Similarity search
 Future usage
 Recommendation systems
 Agent memory
 Multimodal retrieval
 Includes Web UI.
 7. Knowledge & Data Layer
 Neo4j (Graph Database)
 Current usage
 Isnād chains
 Narrator relationships
 Future usage
 Knowledge graphs
 Trust networks
 Provenance systems
 PostgreSQL
 Current usage
 App data
 MLflow backend
 Label Studio DB
 Future usage
 Feature stores
 Metadata catalogs
 Transactional apps
 Redis
 Current usage
 Caching
 Temporary state
 Future usage
 Job queues
 Rate limiting
 Sessions
 Kafka
 Current usage
 Optional async backbone
 Future usage
 Streaming ingestion
 Event-driven ML
 Audit pipelines
 MinIO (S3)
 Current usage
 Datasets
 Model artifacts
 Pipeline outputs
 Future usage
 Data lake
 Backups
 Feature storage
 8. MLOps & Human-in-the-Loop
 Label Studio
 Current usage
 Human annotation of narrators & relations
 Future usage
 Any labeling task (text, image, audio)
 MLflow
 Current usage
 Experiment tracking
 Model registry
 Future usage
 Governance
 Model promotion
 Auditing
 Argo Workflows
 Current usage
 ETL & training pipelines
 Future usage
 Batch inference
 Scheduled automation
 Data engineering
 9. Authentication & Security
 Keycloak
 Current usage
 SSO for Admin UI, MLflow, Label Studio
 Future usage
 API authentication
 Multi-tenant access
 Organization-wide IAM
 10. Observability Stack (LGTM)
 Components
 Grafana
 Prometheus
 Loki
 Tempo
 Grafana Alloy
 kube-state-metrics
 node-exporter
 Capabilities
 Metrics, logs, traces
 Automatic correlation
 OTLP-native
 Local SSD persistence
 11. Design Rules for All Custom Services
 All services must:
 be stateless
 use env vars & Kubernetes Secrets
 authenticate via Keycloak
 emit:
 Prometheus metrics
 OTLP traces
 structured JSON logs
 be deployable via kubectl & Argo CD
 12. Future Use Cases (Beyond Hadith)
 This platform can support:
 General Knowledge Graph AI
 Legal / scholarly document analysis
 Enterprise RAG systems
 Research data platforms
 Explainable AI systems
 Internal search engines
 Agent-based systems
 Provenance & trust scoring engines
 Digital humanities projects
 Offline sovereign AI deployments
--- a/k8s/observability-stack/test-loki-logs.sh
+++ b/k8s/observability-stack/test-loki-logs.sh
@ -0,0 +1,158 @@
 #!/bin/bash
 set -e
 GREEN='\033[0;32m'
 RED='\033[0;31m'
 YELLOW='\033[1;33m'
 BLUE='\033[0;34m'
 NC='\033[0m'
 echo -e "${BLUE}========================================${NC}"
 echo -e "${BLUE}   Loki Log Collection Test${NC}"
 echo -e "${BLUE}========================================${NC}"
 echo ""
 PASS=0
 FAIL=0
 # Test 1: Check Alloy DaemonSet
 echo -e "${YELLOW}Test 1: Checking Alloy DaemonSet...${NC}"
 if kubectl get pods -n observability -l app=alloy --no-headers 2>/dev/null | grep -q "Running"; then
    ALLOY_COUNT=$(kubectl get pods -n observability -l app=alloy --no-headers | grep -c "Running")
    echo -e "${GREEN}✓ Alloy is running ($ALLOY_COUNT pod(s))${NC}"
    PASS=$((PASS+1))
 else
    echo -e "${RED}✗ Alloy is not running${NC}"
    FAIL=$((FAIL+1))
 fi
 echo ""
 # Test 2: Check Loki pod
 echo -e "${YELLOW}Test 2: Checking Loki pod...${NC}"
 if kubectl get pods -n observability -l app=loki --no-headers 2>/dev/null | grep -q "Running"; then
    echo -e "${GREEN}✓ Loki is running${NC}"
    PASS=$((PASS+1))
 else
    echo -e "${RED}✗ Loki is not running${NC}"
    FAIL=$((FAIL+1))
 fi
 echo ""
 # Test 3: Test Loki readiness endpoint
 echo -e "${YELLOW}Test 3: Testing Loki readiness endpoint...${NC}"
 READY=$(kubectl run test-loki-ready-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
    curl -s -m 5 http://loki.observability.svc.cluster.local:3100/ready 2>/dev/null || echo "failed")
 if [ "$READY" = "ready" ]; then
    echo -e "${GREEN}✓ Loki is ready${NC}"
    PASS=$((PASS+1))
 else
    echo -e "${RED}✗ Loki is not ready (response: $READY)${NC}"
    FAIL=$((FAIL+1))
 fi
 echo ""
 # Test 4: Check Alloy can connect to Loki
 echo -e "${YELLOW}Test 4: Checking Alloy → Loki connectivity...${NC}"
 ALLOY_ERRORS=$(kubectl logs -n observability -l app=alloy --tail=50 2>/dev/null | grep -i "error.*loki" | wc -l)
 if [ "$ALLOY_ERRORS" -eq 0 ]; then
    echo -e "${GREEN}✓ No Alloy → Loki connection errors${NC}"
    PASS=$((PASS+1))
 else
    echo -e "${RED}✗ Found $ALLOY_ERRORS error(s) in Alloy logs${NC}"
    kubectl logs -n observability -l app=alloy --tail=20 | grep -i error
    FAIL=$((FAIL+1))
 fi
 echo ""
 # Test 5: Create test pod and verify logs
 echo -e "${YELLOW}Test 5: Creating test pod and verifying log collection...${NC}"
 # Clean up any existing test pod
 kubectl delete pod test-logger-verify --ignore-not-found 2>/dev/null
 # Create test pod
 echo "  Creating test pod that logs every second..."
 kubectl run test-logger-verify --image=busybox --restart=Never -- sh -c \
  'for i in 1 2 3 4 5 6 7 8 9 10; do echo "LOKI-TEST-LOG: Message number $i at $(date)"; sleep 1; done' \
  >/dev/null 2>&1
 # Wait for pod to start and generate logs
 echo "  Waiting 15 seconds for logs to be collected..."
 sleep 15
 # Query Loki API for test logs
 echo "  Querying Loki for test logs..."
 START_TIME=$(date -u -d '2 minutes ago' +%s)000000000
 END_TIME=$(date -u +%s)000000000
 QUERY_RESULT=$(kubectl run test-loki-query-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
    curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
    --data-urlencode 'query={pod="test-logger-verify"}' \
    --data-urlencode "start=$START_TIME" \
    --data-urlencode "end=$END_TIME" 2>/dev/null || echo "failed")
 if echo "$QUERY_RESULT" | grep -q "LOKI-TEST-LOG"; then
    LOG_COUNT=$(echo "$QUERY_RESULT" | grep -o "LOKI-TEST-LOG" | wc -l)
    echo -e "${GREEN}✓ Found $LOG_COUNT test log messages in Loki${NC}"
    PASS=$((PASS+1))
 else
    echo -e "${RED}✗ Test logs not found in Loki${NC}"
    echo "  Response: ${QUERY_RESULT:0:200}"
    FAIL=$((FAIL+1))
 fi
 # Clean up test pod
 kubectl delete pod test-logger-verify --ignore-not-found >/dev/null 2>&1
 echo ""
 # Test 6: Check observability namespace logs
 echo -e "${YELLOW}Test 6: Checking for observability namespace logs...${NC}"
 OBS_QUERY=$(kubectl run test-loki-obs-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
    curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
    --data-urlencode 'query={namespace="observability"}' \
    --data-urlencode "start=$START_TIME" \
    --data-urlencode "end=$END_TIME" \
    --data-urlencode "limit=10" 2>/dev/null || echo "failed")
 if echo "$OBS_QUERY" | grep -q '"values":\[\['; then
    echo -e "${GREEN}✓ Observability namespace logs found in Loki${NC}"
    PASS=$((PASS+1))
 else
    echo -e "${RED}✗ No logs found for observability namespace${NC}"
    FAIL=$((FAIL+1))
 fi
 echo ""
 echo -e "${BLUE}========================================${NC}"
 echo -e "${BLUE}   Test Results${NC}"
 echo -e "${BLUE}========================================${NC}"
 echo ""
 TOTAL=$((PASS+FAIL))
 echo -e "Passed: ${GREEN}$PASS${NC} / $TOTAL"
 echo -e "Failed: ${RED}$FAIL${NC} / $TOTAL"
 echo ""
 if [ $FAIL -eq 0 ]; then
    echo -e "${GREEN}✓✓✓ All tests passed! Logs are flowing to Loki! ✓✓✓${NC}"
    echo ""
    echo "Next steps:"
    echo "  1. Open Grafana: https://grafana.betelgeusebytes.io"
    echo "  2. Go to Explore → Loki"
    echo "  3. Query: {namespace=\"observability\"}"
    echo ""
 else
    echo -e "${RED}✗✗✗ Some tests failed. Check the output above for details. ✗✗✗${NC}"
    echo ""
    echo "Troubleshooting:"
    echo "  - Check Alloy logs: kubectl logs -n observability -l app=alloy"
    echo "  - Check Loki logs: kubectl logs -n observability loki-0"
    echo "  - Verify services: kubectl get svc -n observability"
    echo "  - See full guide: VERIFY-LOKI-LOGS.md"
    echo ""
    exit 1
 fi