betelgeusebytes/ARCHITECTURE.md

259 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BetelgeuseBytes Architecture Overview
## High-Level Architecture
This platform is a **self-hosted, production-grade Kubernetes stack** designed for:
* AI / ML experimentation and serving
* Data engineering & observability
* Knowledge graphs & vector search
* Automation, workflows, and research tooling
The architecture follows a **hub-and-spoke model**:
* **Core Infrastructure**: Kubernetes + networking + storage
* **Platform Services**: databases, messaging, auth, observability
* **ML / AI Services**: labeling, embeddings, LLM serving, notebooks
* **Automation & Workflows**: Argo Workflows, n8n
* **Access Layer**: DNS, Ingress, TLS
---
## Logical Architecture Diagram (Textual)
```
Internet
DNS (betelgeusebytes.io)
Ingress-NGINX (TLS via cert-manager)
├── Platform UIs (Grafana, Kibana, Gitea, Neo4j, MinIO, etc.)
├── ML UIs (Jupyter, Label Studio, MLflow)
├── Automation (n8n, Argo)
└── APIs (Postgres TCP, Neo4j Bolt, Kafka)
Kubernetes Cluster
├── Control Plane
├── Worker Nodes
├── Stateful Workloads (local SSD)
└── Observability Stack
```
---
## Key Design Principles
* **Baremetal friendly** (Hetzner dedicated servers)
* **Local SSD storage** for stateful workloads
* **Everything observable** (logs, metrics, traces)
* **CPU-first ML** with optional GPU expansion
* **Single-tenant but multi-project ready**
---
## Networking
* Cilium CNI (eBPF-based networking)
* NGINX Ingress Controller
* TCP services exposed via Ingress patch (Postgres, Neo4j Bolt)
* WireGuard mesh between nodes
---
## Security Model
* TLS everywhere (cert-manager + Lets Encrypt)
* Namespace isolation per domain (db, ml, graph, observability…)
* Secrets stored in Kubernetes Secrets
* Optional Basic Auth on sensitive UIs
* Keycloak available for future SSO
---
## Scalability Notes
* Currently single control-plane + workers
* Designed to add:
* More workers
* Dedicated control-plane VPS nodes
* GPU nodes (for vLLM / training)
---
## What This Enables
* Research platforms
* Knowledge graph + LLM pipelines
* End-to-end ML lifecycle
* Automated data pipelines
* Production observability-first apps
```mermaid
flowchart TB
%% =========================
%% BetelgeuseBytes AI Platform Full Architecture (CPU-first, K8s)
%% =========================
%% ---- External / Users ----
subgraph EXT["External Users"]
U1["Scholar / Admin User\n"]
U2["API Client\n(curl / SDK / Bots)"]
U3["Annotator\n(Labeling UI)"]
end
%% ---- DNS + TLS + Ingress ----
subgraph EDGE["Edge: DNS → TLS → Ingress"]
DNS["DNS: betelgeusebytes.io\nA/AAAA records → Ingress IP"]
CM["cert-manager\nLet's Encrypt TLS"]
INGRESS["NGINX Ingress Controller\nHTTP(S) + SNI routing"]
TCPMAP["Ingress TCP Services\n(Postgres, Neo4j Bolt)"]
end
%% ---- Kubernetes Cluster ----
subgraph K8S["K8S Cluster"]
direction TB
subgraph NET["Networking"]
CILIUM["Cilium CNI\n(eBPF dataplane / policies)"]
WG["WireGuard\n(node mesh / private networking)"]
end
subgraph DEVOPS["Dev/GitOps"]
GITEA["Gitea\nGit repos"]
ARGOCD["Argo CD\nGitOps deployments"]
end
subgraph OBS["Observability"]
ALLOY["Grafana Alloy\n(collect logs+traces)"]
PROM["Prometheus\n(metrics)"]
LOKI["Loki\n(logs)"]
TEMPO["Tempo\n(traces)"]
GRAF["Grafana\n(dashboards)"]
KSM["kube-state-metrics"]
NODEX["node-exporter"]
end
subgraph DATA["Core Data Layer"]
PG["PostgreSQL\n(app DB / MLflow / Label Studio)\nNamespace: db"]
REDIS["Redis\n(cache)\nNamespace: db"]
ES["Elasticsearch\n(search/log store)\nNamespace: elastic"]
KIB["Kibana\nUI\nNamespace: elastic"]
KAFKA["Kafka\n(event bus)\nNamespace: broker"]
KAFKAUI["Kafka UI\nUI\nNamespace: broker"]
MINIO["MinIO (S3)\n(datasets & artifacts)\nNamespace: storage"]
end
subgraph KG["Knowledge & Retrieval"]
NEO4J["Neo4j\n(knowledge graph)\nNamespace: graph"]
QDRANT["Qdrant\n(vector DB + UI)\nNamespace: vec"]
TEI["Text Embeddings Inference\n(embeddings API)\nNamespace: ai"]
end
subgraph AI["AI / ML Services"]
LLM["LLM Server (CPU)\nOllama / llama.cpp\nNamespace: ai"]
JUP["Jupyter\n(research notebooks)\nNamespace: ml"]
LABEL["Label Studio\n(annotation UI)\nNamespace: ai"]
MLFLOW["MLflow\n(tracking + registry)\nNamespace: mlops/ml"]
end
subgraph PIPE["Automation / Pipelines"]
ARGO_WF["Argo Workflows\n(pipelines)\nNamespace: ml/argo"]
N8N["n8n\n(automation)\nNamespace: automation"]
end
subgraph AUTH["Authentication"]
KEYCLOAK["Keycloak\n(OIDC/SSO)\nNamespace: auth"]
end
subgraph APPS["Custom Applications (to build)"]
ORCH["Hadith Orchestrator API\nNamespace: hadith"]
ADMIN["Hadith Admin UI\nNamespace: hadith"]
NER["NER Service\nNamespace: hadith"]
RE["Relation Extraction Service\nNamespace: hadith"]
end
end
%% ---- Edge wiring ----
U1 --> DNS
U2 --> DNS
U3 --> DNS
DNS --> INGRESS
CM --> INGRESS
%% ---- Public HTTP(S) routes ----
INGRESS -->|hadith-admin.betelgeusebytes.io| ADMIN
INGRESS -->|hadith-api.betelgeusebytes.io| ORCH
INGRESS -->|llm.betelgeusebytes.io| LLM
INGRESS -->|embeddings.betelgeusebytes.io| TEI
INGRESS -->|vector.betelgeusebytes.io| QDRANT
INGRESS -->|neo4j.betelgeusebytes.io| NEO4J
INGRESS -->|label.betelgeusebytes.io| LABEL
INGRESS -->|mlflow.betelgeusebytes.io| MLFLOW
INGRESS -->|minio.betelgeusebytes.io| MINIO
INGRESS -->|argo.betelgeusebytes.io| ARGO_WF
INGRESS -->|auth.betelgeusebytes.io| KEYCLOAK
INGRESS -->|grafana.betelgeusebytes.io| GRAF
INGRESS -->|kibana.betelgeusebytes.io| KIB
INGRESS -->|broker.betelgeusebytes.io| KAFKAUI
%% ---- TCP routes (optional/external) ----
TCPMAP -.-> PG
TCPMAP -.-> NEO4J
%% ---- GitOps flow ----
GITEA -->|manifests + app code| ARGOCD
ARGOCD -->|sync/apply| K8S
%% ---- Auth flows ----
ADMIN -->|OIDC login| KEYCLOAK
ORCH -->|validate JWT / introspect| KEYCLOAK
LABEL -->|optional OIDC| KEYCLOAK
MLFLOW -->|OIDC| KEYCLOAK
%% ---- Orchestrator runtime data flows ----
ORCH -->|reasoning / JSON extraction| LLM
ORCH -->|embed queries/docs| TEI
ORCH -->|vector search| QDRANT
ORCH -->|graph read/write| NEO4J
ORCH -->|metadata/users/jobs| PG
ORCH -->|cache| REDIS
ORCH -->|full-text search| ES
%% ---- NER/RE services (future) ----
ORCH --> NER
ORCH --> RE
NER -->|entities| NEO4J
RE -->|relations| NEO4J
%% ---- Data curation loop ----
LABEL -->|labeled datasets| MINIO
ARGO_WF -->|training data| MINIO
ARGO_WF -->|log metrics| MLFLOW
ARGO_WF -->|publish artifacts| MINIO
MLFLOW -->|model versions| MINIO
ARGO_WF -->|deploy/update services| ARGOCD
%% ---- Event-driven (optional) ----
ORCH -->|events| KAFKA
ARGO_WF -->|consume triggers| KAFKA
N8N -->|integrations/alerts| KAFKA
%% ---- Observability wiring ----
ALLOY --> LOKI
ALLOY --> TEMPO
PROM --> GRAF
LOKI --> GRAF
TEMPO --> GRAF
KSM --> PROM
NODEX --> PROM
%% ---- Internal networking ----
CILIUM --- INGRESS
WG --- CILIUM