Add observability stack and supporting scripts

- Introduced combine.sh script to aggregate .txt, .py, .yml, .yaml, .ini files into betelgeusebytes.txt.
- Updated Loki configuration to disable retention settings.
- Modified Tempo configuration to change storage paths from /tmp to /var.
- Refactored Alloy configuration to streamline Prometheus integration and removed unnecessary metrics export.
- Enhanced RBAC permissions to include pod log access.
- Added security context to Tempo deployment for improved security.
- Created README_old.md for documentation of the observability stack.
- Developed me.md as an authoritative guide for the AI infrastructure stack.
- Implemented test-loki-logs.sh script to validate Loki log collection and connectivity.
This commit is contained in:
salah 2026-01-28 11:07:16 +01:00
parent dfdd36db3f
commit 404deb1d52
19 changed files with 7171 additions and 69 deletions

93
ARCHITECTURE.md Normal file
View File

@ -0,0 +1,93 @@
# BetelgeuseBytes Architecture Overview
## High-Level Architecture
This platform is a **self-hosted, production-grade Kubernetes stack** designed for:
* AI / ML experimentation and serving
* Data engineering & observability
* Knowledge graphs & vector search
* Automation, workflows, and research tooling
The architecture follows a **hub-and-spoke model**:
* **Core Infrastructure**: Kubernetes + networking + storage
* **Platform Services**: databases, messaging, auth, observability
* **ML / AI Services**: labeling, embeddings, LLM serving, notebooks
* **Automation & Workflows**: Argo Workflows, n8n
* **Access Layer**: DNS, Ingress, TLS
---
## Logical Architecture Diagram (Textual)
```
Internet
DNS (betelgeusebytes.io)
Ingress-NGINX (TLS via cert-manager)
├── Platform UIs (Grafana, Kibana, Gitea, Neo4j, MinIO, etc.)
├── ML UIs (Jupyter, Label Studio, MLflow)
├── Automation (n8n, Argo)
└── APIs (Postgres TCP, Neo4j Bolt, Kafka)
Kubernetes Cluster
├── Control Plane
├── Worker Nodes
├── Stateful Workloads (local SSD)
└── Observability Stack
```
---
## Key Design Principles
* **Baremetal friendly** (Hetzner dedicated servers)
* **Local SSD storage** for stateful workloads
* **Everything observable** (logs, metrics, traces)
* **CPU-first ML** with optional GPU expansion
* **Single-tenant but multi-project ready**
---
## Networking
* Cilium CNI (eBPF-based networking)
* NGINX Ingress Controller
* TCP services exposed via Ingress patch (Postgres, Neo4j Bolt)
* WireGuard mesh between nodes
---
## Security Model
* TLS everywhere (cert-manager + Lets Encrypt)
* Namespace isolation per domain (db, ml, graph, observability…)
* Secrets stored in Kubernetes Secrets
* Optional Basic Auth on sensitive UIs
* Keycloak available for future SSO
---
## Scalability Notes
* Currently single control-plane + workers
* Designed to add:
* More workers
* Dedicated control-plane VPS nodes
* GPU nodes (for vLLM / training)
---
## What This Enables
* Research platforms
* Knowledge graph + LLM pipelines
* End-to-end ML lifecycle
* Automated data pipelines
* Production observability-first apps

46
DEPLOYMENT.md Normal file
View File

@ -0,0 +1,46 @@
# Deployment & Operations Guide
## Deployment Model
* Declarative Kubernetes manifests
* Applied via `kubectl` or Argo CD
* No Helm dependency
---
## General Rules
* Stateless apps by default
* PVCs required for state
* Secrets via Kubernetes Secrets
* Config via environment variables
---
## Deployment Order (Recommended)
1. Networking (Cilium, Ingress)
2. cert-manager
3. Storage (PVs)
4. Databases (Postgres, Redis, Kafka)
5. Observability stack
6. ML tooling
7. Automation tools
8. Custom applications
---
## Operations
* Monitor via Grafana
* Debug via logs & traces
* Upgrade via Git commits
* Rollback via Argo CD
---
## Backup Strategy
* MinIO buckets versioned
* Database snapshots
* Git repositories mirrored

34
FUTURE-PROJECTS.md Normal file
View File

@ -0,0 +1,34 @@
# Future Use Cases & Projects
This platform is intentionally **generalpurpose**.
## AI & ML
* RAG platforms
* Offline assistants
* Agent systems
* NLP research
## Knowledge Graphs
* Academic citation graphs
* Trust & provenance systems
* Dependency analysis
## Data Platforms
* Eventdriven ETL
* Feature stores
* Research data lakes
## Observability & Ops
* Internal platform monitoring
* Security analytics
* Audit systems
## Sovereign Deployments
* Onprem AI for enterprises
* NGO / government tooling
* Privacypreserving analytics

102
INFRASTRUCTURE.md Normal file
View File

@ -0,0 +1,102 @@
# BetelgeuseBytes Infrastructure & Cluster Configuration
## Hosting Provider
* **Provider**: Hetzner
* **Server Type**: Dedicated servers
* **Region**: EU
* **Network**: Private LAN + WireGuard
---
## Nodes
### Current Nodes
| Node | Role | Notes |
| --------- | ---------------------- | ------------------- |
| hetzner-1 | control-plane + worker | runs core workloads |
| hetzner-2 | worker + storage | hosts local SSD PVs |
---
## Kubernetes Setup
* Kubernetes installed via kubeadm
* Single cluster
* Control plane is also schedulable
### CNI
* **Cilium**
* eBPF dataplane
* kube-proxy replacement
* Network policy support
---
## Storage
### Persistent Volumes
* Backed by **local NVMe / SSD**
* Manually provisioned PVs
* Bound via PVCs
### Storage Layout
```
/mnt/local-ssd/
├── postgres/
├── neo4j/
├── elasticsearch/
├── prometheus/
├── loki/
├── tempo/
├── grafana/
├── minio/
└── qdrant/
```
---
## Networking
* Ingress Controller: nginx
* External DNS records → ingress IP
* TCP mappings for:
* PostgreSQL
* Neo4j Bolt
---
## TLS & Certificates
* cert-manager
* ClusterIssuer: Lets Encrypt
* Automatic renewal
---
## Namespaces
| Namespace | Purpose |
| ------------- | ---------------------------------- |
| db | Databases (Postgres, Redis) |
| graph | Neo4j |
| broker | Kafka |
| ml | ML tooling (Jupyter, Argo, MLflow) |
| observability | Grafana, Prometheus, Loki, Tempo |
| automation | n8n |
| devops | Gitea, Argo CD |
---
## What This Infra Enables
* Full onprem AI platform
* Predictable performance
* Low-latency data access
* Independence from cloud providers

32
OBSERVABILITY.md Normal file
View File

@ -0,0 +1,32 @@
# 🔭 Observability Stack
---
## Components
- Grafana
- Prometheus
- Loki
- Tempo
- Grafana Alloy
- kube-state-metrics
- node-exporter
---
## Capabilities
- Logs ↔ traces ↔ metrics correlation
- OTLP-native instrumentation
- Centralized dashboards
- Alerting-ready
---
## Instrumentation Rules
All apps must:
- expose `/metrics`
- emit structured JSON logs
- export OTLP traces

148
README.md
View File

@ -1,43 +1,123 @@
# BetelgeuseBytes K8s — Full Stack (kubectl-only) # 🧠 BetelgeuseBytes AI Platform — Documentation
**Nodes** This documentation describes a **self-hosted, CPU-first AI platform** running on Kubernetes,
- Control-plane + worker: hetzner-1 (95.217.89.53) designed to power an **Islamic Hadith Scholar AI** and future AI/data projects.
- Worker: hetzner-2 (138.201.254.97)
## Bring up the cluster ## 📚 Documentation Index
```bash
ansible -i ansible/inventories/prod/hosts.ini all -m ping
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
```
## Apply apps (edit secrets first) - [Architecture](ARCHITECTURE.md)
```bash - [Infrastructure](INFRASTRUCTURE.md)
kubectl apply -f k8s/00-namespaces.yaml - [Full Stack Overview](STACK.md)
kubectl apply -f k8s/01-secrets/ - [Deployment & Operations](DEPLOYMENT.md)
kubectl apply -f k8s/storage/storageclass.yaml - [Observability](OBSERVABILITY.md)
- [Roadmap & Next Steps](ROADMAP.md)
- [Future Projects & Use Cases](FUTURE-PROJECTS.md)
kubectl apply -f k8s/postgres/ ## 🎯 Current Focus
kubectl apply -f k8s/redis/
kubectl apply -f k8s/elastic/elasticsearch.yaml
kubectl apply -f k8s/elastic/kibana.yaml
kubectl apply -f k8s/gitea/ - Hadith sanad & matn extraction
kubectl apply -f k8s/jupyter/ - Narrator relationship modeling
kubectl apply -f k8s/kafka/kafka.yaml - Knowledge graph construction
kubectl apply -f k8s/kafka/kafka-ui.yaml - Human-in-the-loop verification
kubectl apply -f k8s/neo4j/ - Explainable, sovereign AI
kubectl apply -f k8s/otlp/ ## 🧠 What each document gives you
kubectl apply -f k8s/observability/fluent-bit.yaml ### ARCHITECTURE
kubectl apply -f k8s/prometheus/
kubectl apply -f k8s/grafana/
```
## DNS - Logical system architecture
A records:
- apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
CNAMEs → apps.betelgeusebytes.io: - Data & control flow
- gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly. - Networking and security model
- Design principles (CPU-first, sovereign, observable)
- What the architecture enables long-term
This is what you show to **architects and senior engineers.**
### INFRASTRUCTURE
- Hetzner setup (dedicated, CPU-only, SSD)
- Node roles and responsibilities
- Kubernetes topology
- Cilium networking
- Storage layout on disk
- Namespaces and isolation strategy
This is what you show to **ops / SRE / infra people.**
### STACK
- Exhaustive list of every deployed component
- Grouped by domain:
- Core platform
- Databases & messaging
- Knowledge & vectors
- ML & AI
- Automation & DevOps
- Observability
- Authentication
For each: **what it does now + what it can be reused for**
This is the **master mental model** of your platform.
### DEPLOYMENT
- How the platform is deployed (kubectl + GitOps)
- Deployment order
- Operational rules
- Backup strategy
- Day-2 operations mindset
This is your ***runbook starter.***
### ROADMAP
- Clear technical phases:
- Neo4j isnād schema
- Authenticity scoring
- Productization
- Scaling (GPU, multi-project)
This keeps the project ***directionally sane.***
### FUTURE-PROJECTS
- Explicitly documents that this is **not just a Hadith stack**
- Lists realistic reuse cases:
- RAG
- Knowledge graphs
- Sovereign AI
- Digital humanities
- Research platforms
This justifies the ***investment in infra quality.***

43
README_old.md Normal file
View File

@ -0,0 +1,43 @@
# BetelgeuseBytes K8s — Full Stack (kubectl-only)
**Nodes**
- Control-plane + worker: hetzner-1 (95.217.89.53)
- Worker: hetzner-2 (138.201.254.97)
## Bring up the cluster
```bash
ansible -i ansible/inventories/prod/hosts.ini all -m ping
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
```
## Apply apps (edit secrets first)
```bash
kubectl apply -f k8s/00-namespaces.yaml
kubectl apply -f k8s/01-secrets/
kubectl apply -f k8s/storage/storageclass.yaml
kubectl apply -f k8s/postgres/
kubectl apply -f k8s/redis/
kubectl apply -f k8s/elastic/elasticsearch.yaml
kubectl apply -f k8s/elastic/kibana.yaml
kubectl apply -f k8s/gitea/
kubectl apply -f k8s/jupyter/
kubectl apply -f k8s/kafka/kafka.yaml
kubectl apply -f k8s/kafka/kafka-ui.yaml
kubectl apply -f k8s/neo4j/
kubectl apply -f k8s/otlp/
kubectl apply -f k8s/observability/fluent-bit.yaml
kubectl apply -f k8s/prometheus/
kubectl apply -f k8s/grafana/
```
## DNS
A records:
- apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
CNAMEs → apps.betelgeusebytes.io:
- gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.

26
ROADMAP.md Normal file
View File

@ -0,0 +1,26 @@
# Roadmap & Next Steps
## Phase 1 Knowledge Modeling
* Design Neo4j isnād schema
* Identity resolution
* Relationship typing
## Phase 2 Authenticity Scoring
* Chain continuity analysis
* Narrator reliability
* Graphbased scoring
* LLMassisted reasoning
## Phase 3 Productization
* Admin dashboards
* APIs
* Provenance visualization
## Phase 4 Scale & Extend
* GPU nodes
* vLLM integration
* Multiproject tenancy

153
STACK.md Normal file
View File

@ -0,0 +1,153 @@
# 🧠 BetelgeuseBytes Full Stack Catalog
This document lists **every major component deployed in the cluster**, what it is used for today, and what it can be reused for.
---
## Core Platform
| Component | Namespace | Purpose | Reuse |
| ------------- | ------------- | --------------- | --------------- |
| Kubernetes | all | Orchestration | Any platform |
| Cilium | kube-system | Networking | Secure clusters |
| NGINX Ingress | ingress-nginx | Traffic routing | API gateway |
| cert-manager | cert-manager | TLS automation | PKI |
---
## Databases & Messaging
| Component | URL / Access | Purpose | Reuse |
| ------------- | --------------- | --------------- | ---------------- |
| PostgreSQL | TCP via Ingress | Relational DB | App backends |
| Redis | internal | Cache | Queues |
| Kafka | kafka-ui UI | Event streaming | Streaming ETL |
| Elasticsearch | Kibana UI | Search + logs | Fulltext search |
---
## Knowledge & Vector
| Component | URL | Purpose | Reuse |
| --------- | ------------------------- | --------------- | --------------- |
| Neo4j | neo4j.betelgeusebytes.io | Knowledge graph | Graph analytics |
| Qdrant | vector.betelgeusebytes.io | Vector search | RAG |
---
## ML & AI
| Component | URL | Purpose | Reuse |
| ------------ | ----------------------------- | --------------- | ---------------- |
| Jupyter | notebook UI | Experiments | Research |
| Label Studio | label.betelgeusebytes.io | Annotation | Dataset creation |
| MLflow | mlflow.betelgeusebytes.io | Model tracking | MLOps |
| Ollama / LLM | llm.betelgeusebytes.io | LLM inference | Agents |
| Embeddings | embeddings.betelgeusebytes.io | Text embeddings | Semantic search |
---
## Automation & DevOps
| Component | URL | Purpose | Reuse |
| -------------- | ----------------------- | ------------------- | ----------- |
| Argo Workflows | argo.betelgeusebytes.io | Pipelines | ETL |
| Argo CD | argocd UI | GitOps | CI/CD |
| Gitea | gitea UI | Git hosting | SCM |
| n8n | automation UI | Workflow automation | Integration |
---
## Observability (LGTM)
| Component | Purpose | Reuse |
| ---------- | --------------- | ---------------------- |
| Grafana | Dashboards | Ops center |
| Prometheus | Metrics | Monitoring |
| Loki | Logs | Debugging |
| Tempo | Traces | Distributed tracing |
| Alloy | Telemetry agent | Standardized telemetry |
---
## Authentication
| Component | Purpose | Reuse |
| --------- | ---------- | ----- |
| Keycloak | OIDC / SSO | IAM |
---
## Why This Stack Matters
* Covers **data → ML → serving → observability** endtoend
* Suitable for research **and** production
* Modular and futureproof
# 📚 Stack Catalog — Services, URLs, Access & Usage
This document lists **every deployed component**, how to access it,
what it is used for **now**, and what it enables **in the future**.
---
## 🌐 Public Services (Ingress / HTTPS)
| Component | URL | Auth | What It Is | Current Usage | Future Usage |
|--------|-----|------|------------|---------------|--------------|
| LLM Inference | https://llm.betelgeusebytes.io | none / internal | CPU LLM server (Ollama / llama.cpp) | Extract sanad & matn as JSON | Agents, doc AI, RAG |
| Embeddings | https://embeddings.betelgeusebytes.io | none / internal | Text Embeddings Inference (HF) | Hadith & bio embeddings | Semantic search |
| Vector DB | https://vector.betelgeusebytes.io | none | Qdrant + UI | Similarity search | Recommendations |
| Graph DB | https://neo4j.betelgeusebytes.io | Basic Auth | Neo4j Browser | Isnād graph | Knowledge graphs |
| Orchestrator | https://hadith-api.betelgeusebytes.io | OIDC | FastAPI router | Core AI API | Any AI backend |
| Admin UI | https://hadith-admin.betelgeusebytes.io | OIDC | Next.js UI | Scholar review | Any internal tool |
| Labeling | https://label.betelgeusebytes.io | Local / OIDC | Label Studio | NER/RE annotation | Dataset curation |
| ML Tracking | https://mlflow.betelgeusebytes.io | OIDC | MLflow UI | Experiments & models | Governance |
| Object Storage | https://minio.betelgeusebytes.io | Access key | MinIO Console | Datasets & artifacts | Data lake |
| Pipelines | https://argo.betelgeusebytes.io | SA / OIDC | Argo Workflows UI | ML pipelines | ETL |
| Auth | https://auth.betelgeusebytes.io | Admin login | Keycloak | SSO & tokens | IAM |
| Observability | https://grafana.betelgeusebytes.io | Login | Grafana | Metrics/logs/traces | Ops center |
---
## 🔐 Authentication & Access Summary
| System | Auth Method | Who Uses It |
|-----|------------|-------------|
| Keycloak | Username / Password | Admins |
| Admin UI | OIDC (Keycloak) | Scholars |
| Orchestrator API | OIDC Bearer Token | Apps |
| MLflow | OIDC | ML engineers |
| Label Studio | Local / OIDC | Annotators |
| Neo4j | Basic Auth | Engineers |
| MinIO | Access / Secret key | Pipelines |
| Grafana | Login | Operators |
---
## 🧠 Internal Cluster Services (ClusterIP)
| Component | Namespace | Purpose |
|--------|-----------|--------|
| PostgreSQL | db | Relational storage |
| Redis | db | Cache / temp state |
| Kafka | broker | Event backbone |
| Prometheus | observability | Metrics |
| Loki | observability | Logs |
| Tempo | observability | Traces |
| Alloy | observability | Telemetry agent |
---
## 🗂 Storage Responsibilities
| Storage | Used By | Contains |
|------|--------|---------|
| MinIO | Pipelines, MLflow | Datasets, models |
| Neo4j PVC | Graph DB | Isnād graph |
| Qdrant PVC | Vector DB | Embeddings |
| PostgreSQL PVC | DB | Metadata |
| Observability PVCs | LGTM | Logs, metrics, traces |

5958
betelgeusebytes.txt Normal file

File diff suppressed because it is too large Load Diff

5
combine.sh Normal file
View File

@ -0,0 +1,5 @@
find . -type f -name "*.txt" -o -name "*.py" -o -name "*.yml" -o -name "*.yaml" -o -name "*.YAML" -o -name "*.ini" | while read file; do
echo "=== $file ===" >> betelgeusebytes.txt
cat "$file" >> betelgeusebytes.txt
echo "" >> betelgeusebytes.txt
done

View File

@ -43,12 +43,9 @@ data:
compactor: compactor:
working_directory: /loki/compactor working_directory: /loki/compactor
compaction_interval: 10m compaction_interval: 10m
retention_enabled: true retention_enabled: false
retention_delete_delay: 2h
retention_delete_worker_count: 150
limits_config: limits_config:
enforce_metric_name: false
reject_old_samples: true reject_old_samples: true
reject_old_samples_max_age: 168h # 7 days reject_old_samples_max_age: 168h # 7 days
retention_period: 168h # 7 days retention_period: 168h # 7 days

View File

@ -39,7 +39,7 @@ data:
source: tempo source: tempo
cluster: betelgeuse-k8s cluster: betelgeuse-k8s
storage: storage:
path: /tmp/tempo/generator/wal path: /var/tempo/generator/wal
remote_write: remote_write:
- url: http://prometheus.observability.svc.cluster.local:9090/api/v1/write - url: http://prometheus.observability.svc.cluster.local:9090/api/v1/write
send_exemplars: true send_exemplars: true
@ -48,17 +48,14 @@ data:
trace: trace:
backend: local backend: local
wal: wal:
path: /tmp/tempo/wal path: /var/tempo/wal
local: local:
path: /tmp/tempo/blocks path: /var/tempo/blocks
pool: pool:
max_workers: 100 max_workers: 100
queue_depth: 10000 queue_depth: 10000
querier: # Single instance mode - no need for frontend/querier split
frontend_worker:
frontend_address: tempo.observability.svc.cluster.local:9095
query_frontend: query_frontend:
search: search:
duration_slo: 5s duration_slo: 5s

View File

@ -124,7 +124,6 @@ data:
output { output {
traces = [otelcol.exporter.otlp.tempo.input] traces = [otelcol.exporter.otlp.tempo.input]
metrics = [otelcol.exporter.prometheus.metrics.input]
} }
} }
@ -138,22 +137,7 @@ data:
} }
} }
// Export OTLP metrics to Prometheus
otelcol.exporter.prometheus "metrics" {
forward_to = [prometheus.remote_write.local.receiver]
}
// Remote write to Prometheus
prometheus.remote_write "local" {
endpoint {
url = "http://prometheus.observability.svc.cluster.local:9090/api/v1/write"
}
}
// Scrape local metrics (Alloy's own metrics) // Scrape local metrics (Alloy's own metrics)
prometheus.scrape "alloy" { // Prometheus will scrape these via service discovery
targets = [{ prometheus.exporter.self "alloy" {
__address__ = "localhost:12345",
}]
forward_to = [prometheus.remote_write.local.receiver]
} }

View File

@ -66,6 +66,7 @@ rules:
- services - services
- endpoints - endpoints
- pods - pods
- pods/log
verbs: ["get", "list", "watch"] verbs: ["get", "list", "watch"]
- apiGroups: - apiGroups:
- extensions - extensions

View File

@ -21,6 +21,11 @@ spec:
spec: spec:
nodeSelector: nodeSelector:
kubernetes.io/hostname: hetzner-2 kubernetes.io/hostname: hetzner-2
securityContext:
fsGroup: 10001
runAsGroup: 10001
runAsNonRoot: true
runAsUser: 10001
containers: containers:
- name: tempo - name: tempo
image: grafana/tempo:2.6.1 image: grafana/tempo:2.6.1
@ -70,7 +75,7 @@ spec:
- name: tempo-config - name: tempo-config
mountPath: /etc/tempo mountPath: /etc/tempo
- name: tempo-data - name: tempo-data
mountPath: /tmp/tempo mountPath: /var/tempo
volumes: volumes:
- name: tempo-config - name: tempo-config
configMap: configMap:

View File

@ -0,0 +1,388 @@
# 🧠 BetelgeuseBytes — Full AI Infrastructure Stack
## Authoritative README, Architecture & Onboarding Guide
This repository documents the **entire self-hosted AI infrastructure stack** running on a Kubernetes cluster hosted on **Hetzner dedicated servers**.
The stack currently powers an **Islamic Hadith Scholar AI**, but it is intentionally designed as a **general-purpose, sovereign AI, MLOps, and data platform** that can support many future projects.
This document is the **single source of truth** for:
- architecture (logical & physical)
- infrastructure configuration
- networking & DNS
- every deployed component
- why each component exists
- how to build new systems on top of the platform
---
## 1. Mission & Design Philosophy
### Current Mission
Build an AI system that can:
- Parse classical Islamic texts
- Extract **Sanad** (chains of narrators) and **Matn** (hadith text)
- Identify narrators and their relationships:
- teacher / student
- familial lineage
- Construct a **verifiable knowledge graph**
- Support **human scholarly review**
- Provide **transparent and explainable reasoning**
- Operate **fully on-prem**, CPU-first, without SaaS or GPU dependency
### Core Principles
- **Sovereignty** — no external cloud lock-in
- **Explainability** — graph + provenance, not black boxes
- **Human-in-the-loop** — scholars remain in control
- **Observability-first** — everything is measurable and traceable
- **Composable** — every part can be reused or replaced
---
## 2. Physical Infrastructure (Hetzner)
### Nodes
- **Provider:** Hetzner
- **Type:** Dedicated servers
- **Architecture:** x86_64
- **GPU:** None (CPU-only by design)
- **Storage:** Local NVMe / SSD
### Node Roles (Logical)
| Node Type | Responsibilities |
|---------|------------------|
| Control / Worker | Kubernetes control plane + workloads |
| Storage-heavy | Databases, MinIO, observability data |
| Compute-heavy | LLM inference, embeddings, pipelines |
> The cluster is intentionally **single-region and on-prem-like**, optimized for predictability and data locality.
---
## 3. Kubernetes Infrastructure Configuration
### Kubernetes
- Runtime for **all services**
- Namespaced isolation
- Explicit PersistentVolumeClaims
- Declarative configuration (GitOps)
### Namespaces (Conceptual)
| Namespace | Purpose |
|--------|--------|
| `ai` | LLMs, embeddings, labeling |
| `vec` | Vector database |
| `graph` | Knowledge graph |
| `db` | Relational databases |
| `storage` | Object storage |
| `mlops` | MLflow |
| `ml` | Argo Workflows |
| `auth` | Keycloak |
| `observability` | LGTM stack |
| `hadith` | Custom apps (orchestrator, UI) |
---
## 4. Networking & DNS
### Ingress
- **NGINX Ingress Controller**
- HTTPS termination at ingress
- Internal services communicate via ClusterIP
### TLS
- **cert-manager**
- Lets Encrypt
- Automatic renewal
### Public Endpoints
| URL | Service |
|----|--------|
| https://llm.betelgeusebytes.io | LLM inference (Ollama / llama.cpp) |
| https://embeddings.betelgeusebytes.io | Text Embeddings Inference |
| https://vector.betelgeusebytes.io | Qdrant + UI |
| https://neo4j.betelgeusebytes.io | Neo4j Browser |
| https://hadith-api.betelgeusebytes.io | FastAPI Orchestrator |
| https://hadith-admin.betelgeusebytes.io | Admin / Curation UI |
| https://label.betelgeusebytes.io | Label Studio |
| https://mlflow.betelgeusebytes.io | MLflow |
| https://minio.betelgeusebytes.io | MinIO Console |
| https://argo.betelgeusebytes.io | Argo Workflows |
| https://auth.betelgeusebytes.io | Keycloak |
| https://grafana.betelgeusebytes.io | Grafana |
---
## 5. Full Logical Architecture
```mermaid
flowchart LR
User --> AdminUI --> Orchestrator
Orchestrator --> LLM
Orchestrator --> TEI --> Qdrant
Orchestrator --> Neo4j
Orchestrator --> PostgreSQL
Orchestrator --> Redis
LabelStudio --> MinIO
MinIO --> ArgoWF --> MLflow
MLflow --> Models --> Orchestrator
Kafka --> ArgoWF
Alloy --> Prometheus --> Grafana
Alloy --> Loki --> Grafana
Alloy --> Tempo --> Grafana
```
6. AI & Reasoning Layer
Ollama / llama.cpp (CPU LLM)
Current usage
JSON-structured extraction
Sanad / matn reasoning
Deterministic outputs
No GPU dependency
Future usage
Offline assistants
Document intelligence
Agent frameworks
Replaceable by vLLM when GPUs are added
Text Embeddings Inference (TEI)
Current usage
Embeddings for hadith texts and biographies
Future usage
RAG systems
Semantic search
Deduplication
Similarity clustering
Qdrant (Vector Database)
Current usage
Stores embeddings
Similarity search
Future usage
Recommendation systems
Agent memory
Multimodal retrieval
Includes Web UI.
7. Knowledge & Data Layer
Neo4j (Graph Database)
Current usage
Isnād chains
Narrator relationships
Future usage
Knowledge graphs
Trust networks
Provenance systems
PostgreSQL
Current usage
App data
MLflow backend
Label Studio DB
Future usage
Feature stores
Metadata catalogs
Transactional apps
Redis
Current usage
Caching
Temporary state
Future usage
Job queues
Rate limiting
Sessions
Kafka
Current usage
Optional async backbone
Future usage
Streaming ingestion
Event-driven ML
Audit pipelines
MinIO (S3)
Current usage
Datasets
Model artifacts
Pipeline outputs
Future usage
Data lake
Backups
Feature storage
8. MLOps & Human-in-the-Loop
Label Studio
Current usage
Human annotation of narrators & relations
Future usage
Any labeling task (text, image, audio)
MLflow
Current usage
Experiment tracking
Model registry
Future usage
Governance
Model promotion
Auditing
Argo Workflows
Current usage
ETL & training pipelines
Future usage
Batch inference
Scheduled automation
Data engineering
9. Authentication & Security
Keycloak
Current usage
SSO for Admin UI, MLflow, Label Studio
Future usage
API authentication
Multi-tenant access
Organization-wide IAM
10. Observability Stack (LGTM)
Components
Grafana
Prometheus
Loki
Tempo
Grafana Alloy
kube-state-metrics
node-exporter
Capabilities
Metrics, logs, traces
Automatic correlation
OTLP-native
Local SSD persistence
11. Design Rules for All Custom Services
All services must:
be stateless
use env vars & Kubernetes Secrets
authenticate via Keycloak
emit:
Prometheus metrics
OTLP traces
structured JSON logs
be deployable via kubectl & Argo CD
12. Future Use Cases (Beyond Hadith)
This platform can support:
General Knowledge Graph AI
Legal / scholarly document analysis
Enterprise RAG systems
Research data platforms
Explainable AI systems
Internal search engines
Agent-based systems
Provenance & trust scoring engines
Digital humanities projects
Offline sovereign AI deployments

View File

@ -0,0 +1,158 @@
#!/bin/bash
set -e
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
echo -e "${BLUE}========================================${NC}"
echo -e "${BLUE} Loki Log Collection Test${NC}"
echo -e "${BLUE}========================================${NC}"
echo ""
PASS=0
FAIL=0
# Test 1: Check Alloy DaemonSet
echo -e "${YELLOW}Test 1: Checking Alloy DaemonSet...${NC}"
if kubectl get pods -n observability -l app=alloy --no-headers 2>/dev/null | grep -q "Running"; then
ALLOY_COUNT=$(kubectl get pods -n observability -l app=alloy --no-headers | grep -c "Running")
echo -e "${GREEN}✓ Alloy is running ($ALLOY_COUNT pod(s))${NC}"
PASS=$((PASS+1))
else
echo -e "${RED}✗ Alloy is not running${NC}"
FAIL=$((FAIL+1))
fi
echo ""
# Test 2: Check Loki pod
echo -e "${YELLOW}Test 2: Checking Loki pod...${NC}"
if kubectl get pods -n observability -l app=loki --no-headers 2>/dev/null | grep -q "Running"; then
echo -e "${GREEN}✓ Loki is running${NC}"
PASS=$((PASS+1))
else
echo -e "${RED}✗ Loki is not running${NC}"
FAIL=$((FAIL+1))
fi
echo ""
# Test 3: Test Loki readiness endpoint
echo -e "${YELLOW}Test 3: Testing Loki readiness endpoint...${NC}"
READY=$(kubectl run test-loki-ready-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
curl -s -m 5 http://loki.observability.svc.cluster.local:3100/ready 2>/dev/null || echo "failed")
if [ "$READY" = "ready" ]; then
echo -e "${GREEN}✓ Loki is ready${NC}"
PASS=$((PASS+1))
else
echo -e "${RED}✗ Loki is not ready (response: $READY)${NC}"
FAIL=$((FAIL+1))
fi
echo ""
# Test 4: Check Alloy can connect to Loki
echo -e "${YELLOW}Test 4: Checking Alloy → Loki connectivity...${NC}"
ALLOY_ERRORS=$(kubectl logs -n observability -l app=alloy --tail=50 2>/dev/null | grep -i "error.*loki" | wc -l)
if [ "$ALLOY_ERRORS" -eq 0 ]; then
echo -e "${GREEN}✓ No Alloy → Loki connection errors${NC}"
PASS=$((PASS+1))
else
echo -e "${RED}✗ Found $ALLOY_ERRORS error(s) in Alloy logs${NC}"
kubectl logs -n observability -l app=alloy --tail=20 | grep -i error
FAIL=$((FAIL+1))
fi
echo ""
# Test 5: Create test pod and verify logs
echo -e "${YELLOW}Test 5: Creating test pod and verifying log collection...${NC}"
# Clean up any existing test pod
kubectl delete pod test-logger-verify --ignore-not-found 2>/dev/null
# Create test pod
echo " Creating test pod that logs every second..."
kubectl run test-logger-verify --image=busybox --restart=Never -- sh -c \
'for i in 1 2 3 4 5 6 7 8 9 10; do echo "LOKI-TEST-LOG: Message number $i at $(date)"; sleep 1; done' \
>/dev/null 2>&1
# Wait for pod to start and generate logs
echo " Waiting 15 seconds for logs to be collected..."
sleep 15
# Query Loki API for test logs
echo " Querying Loki for test logs..."
START_TIME=$(date -u -d '2 minutes ago' +%s)000000000
END_TIME=$(date -u +%s)000000000
QUERY_RESULT=$(kubectl run test-loki-query-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
--data-urlencode 'query={pod="test-logger-verify"}' \
--data-urlencode "start=$START_TIME" \
--data-urlencode "end=$END_TIME" 2>/dev/null || echo "failed")
if echo "$QUERY_RESULT" | grep -q "LOKI-TEST-LOG"; then
LOG_COUNT=$(echo "$QUERY_RESULT" | grep -o "LOKI-TEST-LOG" | wc -l)
echo -e "${GREEN}✓ Found $LOG_COUNT test log messages in Loki${NC}"
PASS=$((PASS+1))
else
echo -e "${RED}✗ Test logs not found in Loki${NC}"
echo " Response: ${QUERY_RESULT:0:200}"
FAIL=$((FAIL+1))
fi
# Clean up test pod
kubectl delete pod test-logger-verify --ignore-not-found >/dev/null 2>&1
echo ""
# Test 6: Check observability namespace logs
echo -e "${YELLOW}Test 6: Checking for observability namespace logs...${NC}"
OBS_QUERY=$(kubectl run test-loki-obs-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
--data-urlencode 'query={namespace="observability"}' \
--data-urlencode "start=$START_TIME" \
--data-urlencode "end=$END_TIME" \
--data-urlencode "limit=10" 2>/dev/null || echo "failed")
if echo "$OBS_QUERY" | grep -q '"values":\[\['; then
echo -e "${GREEN}✓ Observability namespace logs found in Loki${NC}"
PASS=$((PASS+1))
else
echo -e "${RED}✗ No logs found for observability namespace${NC}"
FAIL=$((FAIL+1))
fi
echo ""
echo -e "${BLUE}========================================${NC}"
echo -e "${BLUE} Test Results${NC}"
echo -e "${BLUE}========================================${NC}"
echo ""
TOTAL=$((PASS+FAIL))
echo -e "Passed: ${GREEN}$PASS${NC} / $TOTAL"
echo -e "Failed: ${RED}$FAIL${NC} / $TOTAL"
echo ""
if [ $FAIL -eq 0 ]; then
echo -e "${GREEN}✓✓✓ All tests passed! Logs are flowing to Loki! ✓✓✓${NC}"
echo ""
echo "Next steps:"
echo " 1. Open Grafana: https://grafana.betelgeusebytes.io"
echo " 2. Go to Explore → Loki"
echo " 3. Query: {namespace=\"observability\"}"
echo ""
else
echo -e "${RED}✗✗✗ Some tests failed. Check the output above for details. ✗✗✗${NC}"
echo ""
echo "Troubleshooting:"
echo " - Check Alloy logs: kubectl logs -n observability -l app=alloy"
echo " - Check Loki logs: kubectl logs -n observability loki-0"
echo " - Verify services: kubectl get svc -n observability"
echo " - See full guide: VERIFY-LOKI-LOGS.md"
echo ""
exit 1
fi