betelgeusebytes/ARCHITECTURE.md

94 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BetelgeuseBytes Architecture Overview
## High-Level Architecture
This platform is a **self-hosted, production-grade Kubernetes stack** designed for:
* AI / ML experimentation and serving
* Data engineering & observability
* Knowledge graphs & vector search
* Automation, workflows, and research tooling
The architecture follows a **hub-and-spoke model**:
* **Core Infrastructure**: Kubernetes + networking + storage
* **Platform Services**: databases, messaging, auth, observability
* **ML / AI Services**: labeling, embeddings, LLM serving, notebooks
* **Automation & Workflows**: Argo Workflows, n8n
* **Access Layer**: DNS, Ingress, TLS
---
## Logical Architecture Diagram (Textual)
```
Internet
DNS (betelgeusebytes.io)
Ingress-NGINX (TLS via cert-manager)
├── Platform UIs (Grafana, Kibana, Gitea, Neo4j, MinIO, etc.)
├── ML UIs (Jupyter, Label Studio, MLflow)
├── Automation (n8n, Argo)
└── APIs (Postgres TCP, Neo4j Bolt, Kafka)
Kubernetes Cluster
├── Control Plane
├── Worker Nodes
├── Stateful Workloads (local SSD)
└── Observability Stack
```
---
## Key Design Principles
* **Baremetal friendly** (Hetzner dedicated servers)
* **Local SSD storage** for stateful workloads
* **Everything observable** (logs, metrics, traces)
* **CPU-first ML** with optional GPU expansion
* **Single-tenant but multi-project ready**
---
## Networking
* Cilium CNI (eBPF-based networking)
* NGINX Ingress Controller
* TCP services exposed via Ingress patch (Postgres, Neo4j Bolt)
* WireGuard mesh between nodes
---
## Security Model
* TLS everywhere (cert-manager + Lets Encrypt)
* Namespace isolation per domain (db, ml, graph, observability…)
* Secrets stored in Kubernetes Secrets
* Optional Basic Auth on sensitive UIs
* Keycloak available for future SSO
---
## Scalability Notes
* Currently single control-plane + workers
* Designed to add:
* More workers
* Dedicated control-plane VPS nodes
* GPU nodes (for vLLM / training)
---
## What This Enables
* Research platforms
* Knowledge graph + LLM pipelines
* End-to-end ML lifecycle
* Automated data pipelines
* Production observability-first apps