Add observability stack and supporting scripts
- Introduced combine.sh script to aggregate .txt, .py, .yml, .yaml, .ini files into betelgeusebytes.txt. - Updated Loki configuration to disable retention settings. - Modified Tempo configuration to change storage paths from /tmp to /var. - Refactored Alloy configuration to streamline Prometheus integration and removed unnecessary metrics export. - Enhanced RBAC permissions to include pod log access. - Added security context to Tempo deployment for improved security. - Created README_old.md for documentation of the observability stack. - Developed me.md as an authoritative guide for the AI infrastructure stack. - Implemented test-loki-logs.sh script to validate Loki log collection and connectivity.
This commit is contained in:
parent
dfdd36db3f
commit
404deb1d52
|
|
@ -0,0 +1,93 @@
|
||||||
|
# BetelgeuseBytes – Architecture Overview
|
||||||
|
|
||||||
|
## High-Level Architecture
|
||||||
|
|
||||||
|
This platform is a **self-hosted, production-grade Kubernetes stack** designed for:
|
||||||
|
|
||||||
|
* AI / ML experimentation and serving
|
||||||
|
* Data engineering & observability
|
||||||
|
* Knowledge graphs & vector search
|
||||||
|
* Automation, workflows, and research tooling
|
||||||
|
|
||||||
|
The architecture follows a **hub-and-spoke model**:
|
||||||
|
|
||||||
|
* **Core Infrastructure**: Kubernetes + networking + storage
|
||||||
|
* **Platform Services**: databases, messaging, auth, observability
|
||||||
|
* **ML / AI Services**: labeling, embeddings, LLM serving, notebooks
|
||||||
|
* **Automation & Workflows**: Argo Workflows, n8n
|
||||||
|
* **Access Layer**: DNS, Ingress, TLS
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Logical Architecture Diagram (Textual)
|
||||||
|
|
||||||
|
```
|
||||||
|
Internet
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
DNS (betelgeusebytes.io)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Ingress-NGINX (TLS via cert-manager)
|
||||||
|
│
|
||||||
|
├── Platform UIs (Grafana, Kibana, Gitea, Neo4j, MinIO, etc.)
|
||||||
|
├── ML UIs (Jupyter, Label Studio, MLflow)
|
||||||
|
├── Automation (n8n, Argo)
|
||||||
|
└── APIs (Postgres TCP, Neo4j Bolt, Kafka)
|
||||||
|
|
||||||
|
Kubernetes Cluster
|
||||||
|
├── Control Plane
|
||||||
|
├── Worker Nodes
|
||||||
|
├── Stateful Workloads (local SSD)
|
||||||
|
└── Observability Stack
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Design Principles
|
||||||
|
|
||||||
|
* **Bare‑metal friendly** (Hetzner dedicated servers)
|
||||||
|
* **Local SSD storage** for stateful workloads
|
||||||
|
* **Everything observable** (logs, metrics, traces)
|
||||||
|
* **CPU-first ML** with optional GPU expansion
|
||||||
|
* **Single-tenant but multi-project ready**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Networking
|
||||||
|
|
||||||
|
* Cilium CNI (eBPF-based networking)
|
||||||
|
* NGINX Ingress Controller
|
||||||
|
* TCP services exposed via Ingress patch (Postgres, Neo4j Bolt)
|
||||||
|
* WireGuard mesh between nodes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security Model
|
||||||
|
|
||||||
|
* TLS everywhere (cert-manager + Let’s Encrypt)
|
||||||
|
* Namespace isolation per domain (db, ml, graph, observability…)
|
||||||
|
* Secrets stored in Kubernetes Secrets
|
||||||
|
* Optional Basic Auth on sensitive UIs
|
||||||
|
* Keycloak available for future SSO
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scalability Notes
|
||||||
|
|
||||||
|
* Currently single control-plane + workers
|
||||||
|
* Designed to add:
|
||||||
|
|
||||||
|
* More workers
|
||||||
|
* Dedicated control-plane VPS nodes
|
||||||
|
* GPU nodes (for vLLM / training)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What This Enables
|
||||||
|
|
||||||
|
* Research platforms
|
||||||
|
* Knowledge graph + LLM pipelines
|
||||||
|
* End-to-end ML lifecycle
|
||||||
|
* Automated data pipelines
|
||||||
|
* Production observability-first apps
|
||||||
|
|
@ -0,0 +1,46 @@
|
||||||
|
# Deployment & Operations Guide
|
||||||
|
|
||||||
|
## Deployment Model
|
||||||
|
|
||||||
|
* Declarative Kubernetes manifests
|
||||||
|
* Applied via `kubectl` or Argo CD
|
||||||
|
* No Helm dependency
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## General Rules
|
||||||
|
|
||||||
|
* Stateless apps by default
|
||||||
|
* PVCs required for state
|
||||||
|
* Secrets via Kubernetes Secrets
|
||||||
|
* Config via environment variables
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Deployment Order (Recommended)
|
||||||
|
|
||||||
|
1. Networking (Cilium, Ingress)
|
||||||
|
2. cert-manager
|
||||||
|
3. Storage (PVs)
|
||||||
|
4. Databases (Postgres, Redis, Kafka)
|
||||||
|
5. Observability stack
|
||||||
|
6. ML tooling
|
||||||
|
7. Automation tools
|
||||||
|
8. Custom applications
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Operations
|
||||||
|
|
||||||
|
* Monitor via Grafana
|
||||||
|
* Debug via logs & traces
|
||||||
|
* Upgrade via Git commits
|
||||||
|
* Rollback via Argo CD
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Backup Strategy
|
||||||
|
|
||||||
|
* MinIO buckets versioned
|
||||||
|
* Database snapshots
|
||||||
|
* Git repositories mirrored
|
||||||
|
|
@ -0,0 +1,34 @@
|
||||||
|
# Future Use Cases & Projects
|
||||||
|
|
||||||
|
This platform is intentionally **general‑purpose**.
|
||||||
|
|
||||||
|
## AI & ML
|
||||||
|
|
||||||
|
* RAG platforms
|
||||||
|
* Offline assistants
|
||||||
|
* Agent systems
|
||||||
|
* NLP research
|
||||||
|
|
||||||
|
## Knowledge Graphs
|
||||||
|
|
||||||
|
* Academic citation graphs
|
||||||
|
* Trust & provenance systems
|
||||||
|
* Dependency analysis
|
||||||
|
|
||||||
|
## Data Platforms
|
||||||
|
|
||||||
|
* Event‑driven ETL
|
||||||
|
* Feature stores
|
||||||
|
* Research data lakes
|
||||||
|
|
||||||
|
## Observability & Ops
|
||||||
|
|
||||||
|
* Internal platform monitoring
|
||||||
|
* Security analytics
|
||||||
|
* Audit systems
|
||||||
|
|
||||||
|
## Sovereign Deployments
|
||||||
|
|
||||||
|
* On‑prem AI for enterprises
|
||||||
|
* NGO / government tooling
|
||||||
|
* Privacy‑preserving analytics
|
||||||
|
|
@ -0,0 +1,102 @@
|
||||||
|
# BetelgeuseBytes – Infrastructure & Cluster Configuration
|
||||||
|
|
||||||
|
## Hosting Provider
|
||||||
|
|
||||||
|
* **Provider**: Hetzner
|
||||||
|
* **Server Type**: Dedicated servers
|
||||||
|
* **Region**: EU
|
||||||
|
* **Network**: Private LAN + WireGuard
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Nodes
|
||||||
|
|
||||||
|
### Current Nodes
|
||||||
|
|
||||||
|
| Node | Role | Notes |
|
||||||
|
| --------- | ---------------------- | ------------------- |
|
||||||
|
| hetzner-1 | control-plane + worker | runs core workloads |
|
||||||
|
| hetzner-2 | worker + storage | hosts local SSD PVs |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Kubernetes Setup
|
||||||
|
|
||||||
|
* Kubernetes installed via kubeadm
|
||||||
|
* Single cluster
|
||||||
|
* Control plane is also schedulable
|
||||||
|
|
||||||
|
### CNI
|
||||||
|
|
||||||
|
* **Cilium**
|
||||||
|
|
||||||
|
* eBPF dataplane
|
||||||
|
* kube-proxy replacement
|
||||||
|
* Network policy support
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Storage
|
||||||
|
|
||||||
|
### Persistent Volumes
|
||||||
|
|
||||||
|
* Backed by **local NVMe / SSD**
|
||||||
|
* Manually provisioned PVs
|
||||||
|
* Bound via PVCs
|
||||||
|
|
||||||
|
### Storage Layout
|
||||||
|
|
||||||
|
```
|
||||||
|
/mnt/local-ssd/
|
||||||
|
├── postgres/
|
||||||
|
├── neo4j/
|
||||||
|
├── elasticsearch/
|
||||||
|
├── prometheus/
|
||||||
|
├── loki/
|
||||||
|
├── tempo/
|
||||||
|
├── grafana/
|
||||||
|
├── minio/
|
||||||
|
└── qdrant/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Networking
|
||||||
|
|
||||||
|
* Ingress Controller: nginx
|
||||||
|
* External DNS records → ingress IP
|
||||||
|
* TCP mappings for:
|
||||||
|
|
||||||
|
* PostgreSQL
|
||||||
|
* Neo4j Bolt
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TLS & Certificates
|
||||||
|
|
||||||
|
* cert-manager
|
||||||
|
* ClusterIssuer: Let’s Encrypt
|
||||||
|
* Automatic renewal
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Namespaces
|
||||||
|
|
||||||
|
| Namespace | Purpose |
|
||||||
|
| ------------- | ---------------------------------- |
|
||||||
|
| db | Databases (Postgres, Redis) |
|
||||||
|
| graph | Neo4j |
|
||||||
|
| broker | Kafka |
|
||||||
|
| ml | ML tooling (Jupyter, Argo, MLflow) |
|
||||||
|
| observability | Grafana, Prometheus, Loki, Tempo |
|
||||||
|
| automation | n8n |
|
||||||
|
| devops | Gitea, Argo CD |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What This Infra Enables
|
||||||
|
|
||||||
|
* Full on‑prem AI platform
|
||||||
|
* Predictable performance
|
||||||
|
* Low-latency data access
|
||||||
|
* Independence from cloud providers
|
||||||
|
|
@ -0,0 +1,32 @@
|
||||||
|
# 🔭 Observability Stack
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
- Grafana
|
||||||
|
- Prometheus
|
||||||
|
- Loki
|
||||||
|
- Tempo
|
||||||
|
- Grafana Alloy
|
||||||
|
- kube-state-metrics
|
||||||
|
- node-exporter
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Capabilities
|
||||||
|
|
||||||
|
- Logs ↔ traces ↔ metrics correlation
|
||||||
|
- OTLP-native instrumentation
|
||||||
|
- Centralized dashboards
|
||||||
|
- Alerting-ready
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Instrumentation Rules
|
||||||
|
|
||||||
|
All apps must:
|
||||||
|
- expose `/metrics`
|
||||||
|
- emit structured JSON logs
|
||||||
|
- export OTLP traces
|
||||||
|
|
||||||
148
README.md
148
README.md
|
|
@ -1,43 +1,123 @@
|
||||||
# BetelgeuseBytes K8s — Full Stack (kubectl-only)
|
# 🧠 BetelgeuseBytes AI Platform — Documentation
|
||||||
|
|
||||||
**Nodes**
|
This documentation describes a **self-hosted, CPU-first AI platform** running on Kubernetes,
|
||||||
- Control-plane + worker: hetzner-1 (95.217.89.53)
|
designed to power an **Islamic Hadith Scholar AI** and future AI/data projects.
|
||||||
- Worker: hetzner-2 (138.201.254.97)
|
|
||||||
|
|
||||||
## Bring up the cluster
|
## 📚 Documentation Index
|
||||||
```bash
|
|
||||||
ansible -i ansible/inventories/prod/hosts.ini all -m ping
|
|
||||||
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
|
|
||||||
```
|
|
||||||
|
|
||||||
## Apply apps (edit secrets first)
|
- [Architecture](ARCHITECTURE.md)
|
||||||
```bash
|
- [Infrastructure](INFRASTRUCTURE.md)
|
||||||
kubectl apply -f k8s/00-namespaces.yaml
|
- [Full Stack Overview](STACK.md)
|
||||||
kubectl apply -f k8s/01-secrets/
|
- [Deployment & Operations](DEPLOYMENT.md)
|
||||||
kubectl apply -f k8s/storage/storageclass.yaml
|
- [Observability](OBSERVABILITY.md)
|
||||||
|
- [Roadmap & Next Steps](ROADMAP.md)
|
||||||
|
- [Future Projects & Use Cases](FUTURE-PROJECTS.md)
|
||||||
|
|
||||||
kubectl apply -f k8s/postgres/
|
## 🎯 Current Focus
|
||||||
kubectl apply -f k8s/redis/
|
|
||||||
kubectl apply -f k8s/elastic/elasticsearch.yaml
|
|
||||||
kubectl apply -f k8s/elastic/kibana.yaml
|
|
||||||
|
|
||||||
kubectl apply -f k8s/gitea/
|
- Hadith sanad & matn extraction
|
||||||
kubectl apply -f k8s/jupyter/
|
- Narrator relationship modeling
|
||||||
kubectl apply -f k8s/kafka/kafka.yaml
|
- Knowledge graph construction
|
||||||
kubectl apply -f k8s/kafka/kafka-ui.yaml
|
- Human-in-the-loop verification
|
||||||
kubectl apply -f k8s/neo4j/
|
- Explainable, sovereign AI
|
||||||
|
|
||||||
kubectl apply -f k8s/otlp/
|
## 🧠 What each document gives you
|
||||||
kubectl apply -f k8s/observability/fluent-bit.yaml
|
### ARCHITECTURE
|
||||||
kubectl apply -f k8s/prometheus/
|
|
||||||
kubectl apply -f k8s/grafana/
|
|
||||||
```
|
|
||||||
|
|
||||||
## DNS
|
- Logical system architecture
|
||||||
A records:
|
|
||||||
- apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
|
|
||||||
|
|
||||||
CNAMEs → apps.betelgeusebytes.io:
|
- Data & control flow
|
||||||
- gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
|
|
||||||
|
|
||||||
(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.
|
- Networking and security model
|
||||||
|
|
||||||
|
- Design principles (CPU-first, sovereign, observable)
|
||||||
|
|
||||||
|
- What the architecture enables long-term
|
||||||
|
|
||||||
|
This is what you show to **architects and senior engineers.**
|
||||||
|
|
||||||
|
### INFRASTRUCTURE
|
||||||
|
|
||||||
|
- Hetzner setup (dedicated, CPU-only, SSD)
|
||||||
|
|
||||||
|
- Node roles and responsibilities
|
||||||
|
|
||||||
|
- Kubernetes topology
|
||||||
|
|
||||||
|
- Cilium networking
|
||||||
|
|
||||||
|
- Storage layout on disk
|
||||||
|
|
||||||
|
- Namespaces and isolation strategy
|
||||||
|
|
||||||
|
This is what you show to **ops / SRE / infra people.**
|
||||||
|
|
||||||
|
### STACK
|
||||||
|
|
||||||
|
- Exhaustive list of every deployed component
|
||||||
|
|
||||||
|
- Grouped by domain:
|
||||||
|
|
||||||
|
- Core platform
|
||||||
|
|
||||||
|
- Databases & messaging
|
||||||
|
|
||||||
|
- Knowledge & vectors
|
||||||
|
|
||||||
|
- ML & AI
|
||||||
|
|
||||||
|
- Automation & DevOps
|
||||||
|
|
||||||
|
- Observability
|
||||||
|
|
||||||
|
- Authentication
|
||||||
|
|
||||||
|
For each: **what it does now + what it can be reused for**
|
||||||
|
|
||||||
|
This is the **master mental model** of your platform.
|
||||||
|
|
||||||
|
### DEPLOYMENT
|
||||||
|
|
||||||
|
- How the platform is deployed (kubectl + GitOps)
|
||||||
|
|
||||||
|
- Deployment order
|
||||||
|
|
||||||
|
- Operational rules
|
||||||
|
|
||||||
|
- Backup strategy
|
||||||
|
|
||||||
|
- Day-2 operations mindset
|
||||||
|
|
||||||
|
This is your ***runbook starter.***
|
||||||
|
|
||||||
|
### ROADMAP
|
||||||
|
|
||||||
|
- Clear technical phases:
|
||||||
|
|
||||||
|
- Neo4j isnād schema
|
||||||
|
|
||||||
|
- Authenticity scoring
|
||||||
|
|
||||||
|
- Productization
|
||||||
|
|
||||||
|
- Scaling (GPU, multi-project)
|
||||||
|
|
||||||
|
This keeps the project ***directionally sane.***
|
||||||
|
|
||||||
|
### FUTURE-PROJECTS
|
||||||
|
|
||||||
|
- Explicitly documents that this is **not just a Hadith stack**
|
||||||
|
|
||||||
|
- Lists realistic reuse cases:
|
||||||
|
|
||||||
|
- RAG
|
||||||
|
|
||||||
|
- Knowledge graphs
|
||||||
|
|
||||||
|
- Sovereign AI
|
||||||
|
|
||||||
|
- Digital humanities
|
||||||
|
|
||||||
|
- Research platforms
|
||||||
|
|
||||||
|
This justifies the ***investment in infra quality.***
|
||||||
|
|
@ -0,0 +1,43 @@
|
||||||
|
# BetelgeuseBytes K8s — Full Stack (kubectl-only)
|
||||||
|
|
||||||
|
**Nodes**
|
||||||
|
- Control-plane + worker: hetzner-1 (95.217.89.53)
|
||||||
|
- Worker: hetzner-2 (138.201.254.97)
|
||||||
|
|
||||||
|
## Bring up the cluster
|
||||||
|
```bash
|
||||||
|
ansible -i ansible/inventories/prod/hosts.ini all -m ping
|
||||||
|
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Apply apps (edit secrets first)
|
||||||
|
```bash
|
||||||
|
kubectl apply -f k8s/00-namespaces.yaml
|
||||||
|
kubectl apply -f k8s/01-secrets/
|
||||||
|
kubectl apply -f k8s/storage/storageclass.yaml
|
||||||
|
|
||||||
|
kubectl apply -f k8s/postgres/
|
||||||
|
kubectl apply -f k8s/redis/
|
||||||
|
kubectl apply -f k8s/elastic/elasticsearch.yaml
|
||||||
|
kubectl apply -f k8s/elastic/kibana.yaml
|
||||||
|
|
||||||
|
kubectl apply -f k8s/gitea/
|
||||||
|
kubectl apply -f k8s/jupyter/
|
||||||
|
kubectl apply -f k8s/kafka/kafka.yaml
|
||||||
|
kubectl apply -f k8s/kafka/kafka-ui.yaml
|
||||||
|
kubectl apply -f k8s/neo4j/
|
||||||
|
|
||||||
|
kubectl apply -f k8s/otlp/
|
||||||
|
kubectl apply -f k8s/observability/fluent-bit.yaml
|
||||||
|
kubectl apply -f k8s/prometheus/
|
||||||
|
kubectl apply -f k8s/grafana/
|
||||||
|
```
|
||||||
|
|
||||||
|
## DNS
|
||||||
|
A records:
|
||||||
|
- apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
|
||||||
|
|
||||||
|
CNAMEs → apps.betelgeusebytes.io:
|
||||||
|
- gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
|
||||||
|
|
||||||
|
(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.
|
||||||
|
|
@ -0,0 +1,26 @@
|
||||||
|
# Roadmap & Next Steps
|
||||||
|
|
||||||
|
## Phase 1 – Knowledge Modeling
|
||||||
|
|
||||||
|
* Design Neo4j isnād schema
|
||||||
|
* Identity resolution
|
||||||
|
* Relationship typing
|
||||||
|
|
||||||
|
## Phase 2 – Authenticity Scoring
|
||||||
|
|
||||||
|
* Chain continuity analysis
|
||||||
|
* Narrator reliability
|
||||||
|
* Graph‑based scoring
|
||||||
|
* LLM‑assisted reasoning
|
||||||
|
|
||||||
|
## Phase 3 – Productization
|
||||||
|
|
||||||
|
* Admin dashboards
|
||||||
|
* APIs
|
||||||
|
* Provenance visualization
|
||||||
|
|
||||||
|
## Phase 4 – Scale & Extend
|
||||||
|
|
||||||
|
* GPU nodes
|
||||||
|
* vLLM integration
|
||||||
|
* Multi‑project tenancy
|
||||||
|
|
@ -0,0 +1,153 @@
|
||||||
|
# 🧠 BetelgeuseBytes – Full Stack Catalog
|
||||||
|
|
||||||
|
|
||||||
|
This document lists **every major component deployed in the cluster**, what it is used for today, and what it can be reused for.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Platform
|
||||||
|
|
||||||
|
| Component | Namespace | Purpose | Reuse |
|
||||||
|
| ------------- | ------------- | --------------- | --------------- |
|
||||||
|
| Kubernetes | all | Orchestration | Any platform |
|
||||||
|
| Cilium | kube-system | Networking | Secure clusters |
|
||||||
|
| NGINX Ingress | ingress-nginx | Traffic routing | API gateway |
|
||||||
|
| cert-manager | cert-manager | TLS automation | PKI |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Databases & Messaging
|
||||||
|
|
||||||
|
| Component | URL / Access | Purpose | Reuse |
|
||||||
|
| ------------- | --------------- | --------------- | ---------------- |
|
||||||
|
| PostgreSQL | TCP via Ingress | Relational DB | App backends |
|
||||||
|
| Redis | internal | Cache | Queues |
|
||||||
|
| Kafka | kafka-ui UI | Event streaming | Streaming ETL |
|
||||||
|
| Elasticsearch | Kibana UI | Search + logs | Full‑text search |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Knowledge & Vector
|
||||||
|
|
||||||
|
| Component | URL | Purpose | Reuse |
|
||||||
|
| --------- | ------------------------- | --------------- | --------------- |
|
||||||
|
| Neo4j | neo4j.betelgeusebytes.io | Knowledge graph | Graph analytics |
|
||||||
|
| Qdrant | vector.betelgeusebytes.io | Vector search | RAG |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ML & AI
|
||||||
|
|
||||||
|
| Component | URL | Purpose | Reuse |
|
||||||
|
| ------------ | ----------------------------- | --------------- | ---------------- |
|
||||||
|
| Jupyter | notebook UI | Experiments | Research |
|
||||||
|
| Label Studio | label.betelgeusebytes.io | Annotation | Dataset creation |
|
||||||
|
| MLflow | mlflow.betelgeusebytes.io | Model tracking | MLOps |
|
||||||
|
| Ollama / LLM | llm.betelgeusebytes.io | LLM inference | Agents |
|
||||||
|
| Embeddings | embeddings.betelgeusebytes.io | Text embeddings | Semantic search |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Automation & DevOps
|
||||||
|
|
||||||
|
| Component | URL | Purpose | Reuse |
|
||||||
|
| -------------- | ----------------------- | ------------------- | ----------- |
|
||||||
|
| Argo Workflows | argo.betelgeusebytes.io | Pipelines | ETL |
|
||||||
|
| Argo CD | argocd UI | GitOps | CI/CD |
|
||||||
|
| Gitea | gitea UI | Git hosting | SCM |
|
||||||
|
| n8n | automation UI | Workflow automation | Integration |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Observability (LGTM)
|
||||||
|
|
||||||
|
| Component | Purpose | Reuse |
|
||||||
|
| ---------- | --------------- | ---------------------- |
|
||||||
|
| Grafana | Dashboards | Ops center |
|
||||||
|
| Prometheus | Metrics | Monitoring |
|
||||||
|
| Loki | Logs | Debugging |
|
||||||
|
| Tempo | Traces | Distributed tracing |
|
||||||
|
| Alloy | Telemetry agent | Standardized telemetry |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Authentication
|
||||||
|
|
||||||
|
| Component | Purpose | Reuse |
|
||||||
|
| --------- | ---------- | ----- |
|
||||||
|
| Keycloak | OIDC / SSO | IAM |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Stack Matters
|
||||||
|
|
||||||
|
* Covers **data → ML → serving → observability** end‑to‑end
|
||||||
|
* Suitable for research **and** production
|
||||||
|
* Modular and future‑proof
|
||||||
|
|
||||||
|
|
||||||
|
# 📚 Stack Catalog — Services, URLs, Access & Usage
|
||||||
|
|
||||||
|
This document lists **every deployed component**, how to access it,
|
||||||
|
what it is used for **now**, and what it enables **in the future**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🌐 Public Services (Ingress / HTTPS)
|
||||||
|
|
||||||
|
| Component | URL | Auth | What It Is | Current Usage | Future Usage |
|
||||||
|
|--------|-----|------|------------|---------------|--------------|
|
||||||
|
| LLM Inference | https://llm.betelgeusebytes.io | none / internal | CPU LLM server (Ollama / llama.cpp) | Extract sanad & matn as JSON | Agents, doc AI, RAG |
|
||||||
|
| Embeddings | https://embeddings.betelgeusebytes.io | none / internal | Text Embeddings Inference (HF) | Hadith & bio embeddings | Semantic search |
|
||||||
|
| Vector DB | https://vector.betelgeusebytes.io | none | Qdrant + UI | Similarity search | Recommendations |
|
||||||
|
| Graph DB | https://neo4j.betelgeusebytes.io | Basic Auth | Neo4j Browser | Isnād graph | Knowledge graphs |
|
||||||
|
| Orchestrator | https://hadith-api.betelgeusebytes.io | OIDC | FastAPI router | Core AI API | Any AI backend |
|
||||||
|
| Admin UI | https://hadith-admin.betelgeusebytes.io | OIDC | Next.js UI | Scholar review | Any internal tool |
|
||||||
|
| Labeling | https://label.betelgeusebytes.io | Local / OIDC | Label Studio | NER/RE annotation | Dataset curation |
|
||||||
|
| ML Tracking | https://mlflow.betelgeusebytes.io | OIDC | MLflow UI | Experiments & models | Governance |
|
||||||
|
| Object Storage | https://minio.betelgeusebytes.io | Access key | MinIO Console | Datasets & artifacts | Data lake |
|
||||||
|
| Pipelines | https://argo.betelgeusebytes.io | SA / OIDC | Argo Workflows UI | ML pipelines | ETL |
|
||||||
|
| Auth | https://auth.betelgeusebytes.io | Admin login | Keycloak | SSO & tokens | IAM |
|
||||||
|
| Observability | https://grafana.betelgeusebytes.io | Login | Grafana | Metrics/logs/traces | Ops center |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔐 Authentication & Access Summary
|
||||||
|
|
||||||
|
| System | Auth Method | Who Uses It |
|
||||||
|
|-----|------------|-------------|
|
||||||
|
| Keycloak | Username / Password | Admins |
|
||||||
|
| Admin UI | OIDC (Keycloak) | Scholars |
|
||||||
|
| Orchestrator API | OIDC Bearer Token | Apps |
|
||||||
|
| MLflow | OIDC | ML engineers |
|
||||||
|
| Label Studio | Local / OIDC | Annotators |
|
||||||
|
| Neo4j | Basic Auth | Engineers |
|
||||||
|
| MinIO | Access / Secret key | Pipelines |
|
||||||
|
| Grafana | Login | Operators |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧠 Internal Cluster Services (ClusterIP)
|
||||||
|
|
||||||
|
| Component | Namespace | Purpose |
|
||||||
|
|--------|-----------|--------|
|
||||||
|
| PostgreSQL | db | Relational storage |
|
||||||
|
| Redis | db | Cache / temp state |
|
||||||
|
| Kafka | broker | Event backbone |
|
||||||
|
| Prometheus | observability | Metrics |
|
||||||
|
| Loki | observability | Logs |
|
||||||
|
| Tempo | observability | Traces |
|
||||||
|
| Alloy | observability | Telemetry agent |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🗂 Storage Responsibilities
|
||||||
|
|
||||||
|
| Storage | Used By | Contains |
|
||||||
|
|------|--------|---------|
|
||||||
|
| MinIO | Pipelines, MLflow | Datasets, models |
|
||||||
|
| Neo4j PVC | Graph DB | Isnād graph |
|
||||||
|
| Qdrant PVC | Vector DB | Embeddings |
|
||||||
|
| PostgreSQL PVC | DB | Metadata |
|
||||||
|
| Observability PVCs | LGTM | Logs, metrics, traces |
|
||||||
|
|
||||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,5 @@
|
||||||
|
find . -type f -name "*.txt" -o -name "*.py" -o -name "*.yml" -o -name "*.yaml" -o -name "*.YAML" -o -name "*.ini" | while read file; do
|
||||||
|
echo "=== $file ===" >> betelgeusebytes.txt
|
||||||
|
cat "$file" >> betelgeusebytes.txt
|
||||||
|
echo "" >> betelgeusebytes.txt
|
||||||
|
done
|
||||||
|
|
@ -43,12 +43,9 @@ data:
|
||||||
compactor:
|
compactor:
|
||||||
working_directory: /loki/compactor
|
working_directory: /loki/compactor
|
||||||
compaction_interval: 10m
|
compaction_interval: 10m
|
||||||
retention_enabled: true
|
retention_enabled: false
|
||||||
retention_delete_delay: 2h
|
|
||||||
retention_delete_worker_count: 150
|
|
||||||
|
|
||||||
limits_config:
|
limits_config:
|
||||||
enforce_metric_name: false
|
|
||||||
reject_old_samples: true
|
reject_old_samples: true
|
||||||
reject_old_samples_max_age: 168h # 7 days
|
reject_old_samples_max_age: 168h # 7 days
|
||||||
retention_period: 168h # 7 days
|
retention_period: 168h # 7 days
|
||||||
|
|
@ -91,4 +88,4 @@ data:
|
||||||
replay_memory_ceiling: 1GB
|
replay_memory_ceiling: 1GB
|
||||||
|
|
||||||
analytics:
|
analytics:
|
||||||
reporting_enabled: false
|
reporting_enabled: false
|
||||||
|
|
@ -39,7 +39,7 @@ data:
|
||||||
source: tempo
|
source: tempo
|
||||||
cluster: betelgeuse-k8s
|
cluster: betelgeuse-k8s
|
||||||
storage:
|
storage:
|
||||||
path: /tmp/tempo/generator/wal
|
path: /var/tempo/generator/wal
|
||||||
remote_write:
|
remote_write:
|
||||||
- url: http://prometheus.observability.svc.cluster.local:9090/api/v1/write
|
- url: http://prometheus.observability.svc.cluster.local:9090/api/v1/write
|
||||||
send_exemplars: true
|
send_exemplars: true
|
||||||
|
|
@ -48,17 +48,14 @@ data:
|
||||||
trace:
|
trace:
|
||||||
backend: local
|
backend: local
|
||||||
wal:
|
wal:
|
||||||
path: /tmp/tempo/wal
|
path: /var/tempo/wal
|
||||||
local:
|
local:
|
||||||
path: /tmp/tempo/blocks
|
path: /var/tempo/blocks
|
||||||
pool:
|
pool:
|
||||||
max_workers: 100
|
max_workers: 100
|
||||||
queue_depth: 10000
|
queue_depth: 10000
|
||||||
|
|
||||||
querier:
|
# Single instance mode - no need for frontend/querier split
|
||||||
frontend_worker:
|
|
||||||
frontend_address: tempo.observability.svc.cluster.local:9095
|
|
||||||
|
|
||||||
query_frontend:
|
query_frontend:
|
||||||
search:
|
search:
|
||||||
duration_slo: 5s
|
duration_slo: 5s
|
||||||
|
|
@ -69,4 +66,4 @@ data:
|
||||||
overrides:
|
overrides:
|
||||||
defaults:
|
defaults:
|
||||||
metrics_generator:
|
metrics_generator:
|
||||||
processors: [service-graphs, span-metrics]
|
processors: [service-graphs, span-metrics]
|
||||||
|
|
@ -124,7 +124,6 @@ data:
|
||||||
|
|
||||||
output {
|
output {
|
||||||
traces = [otelcol.exporter.otlp.tempo.input]
|
traces = [otelcol.exporter.otlp.tempo.input]
|
||||||
metrics = [otelcol.exporter.prometheus.metrics.input]
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -138,22 +137,7 @@ data:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Export OTLP metrics to Prometheus
|
|
||||||
otelcol.exporter.prometheus "metrics" {
|
|
||||||
forward_to = [prometheus.remote_write.local.receiver]
|
|
||||||
}
|
|
||||||
|
|
||||||
// Remote write to Prometheus
|
|
||||||
prometheus.remote_write "local" {
|
|
||||||
endpoint {
|
|
||||||
url = "http://prometheus.observability.svc.cluster.local:9090/api/v1/write"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Scrape local metrics (Alloy's own metrics)
|
// Scrape local metrics (Alloy's own metrics)
|
||||||
prometheus.scrape "alloy" {
|
// Prometheus will scrape these via service discovery
|
||||||
targets = [{
|
prometheus.exporter.self "alloy" {
|
||||||
__address__ = "localhost:12345",
|
}
|
||||||
}]
|
|
||||||
forward_to = [prometheus.remote_write.local.receiver]
|
|
||||||
}
|
|
||||||
|
|
@ -66,6 +66,7 @@ rules:
|
||||||
- services
|
- services
|
||||||
- endpoints
|
- endpoints
|
||||||
- pods
|
- pods
|
||||||
|
- pods/log
|
||||||
verbs: ["get", "list", "watch"]
|
verbs: ["get", "list", "watch"]
|
||||||
- apiGroups:
|
- apiGroups:
|
||||||
- extensions
|
- extensions
|
||||||
|
|
@ -175,4 +176,4 @@ roleRef:
|
||||||
subjects:
|
subjects:
|
||||||
- kind: ServiceAccount
|
- kind: ServiceAccount
|
||||||
name: kube-state-metrics
|
name: kube-state-metrics
|
||||||
namespace: observability
|
namespace: observability
|
||||||
|
|
@ -21,6 +21,11 @@ spec:
|
||||||
spec:
|
spec:
|
||||||
nodeSelector:
|
nodeSelector:
|
||||||
kubernetes.io/hostname: hetzner-2
|
kubernetes.io/hostname: hetzner-2
|
||||||
|
securityContext:
|
||||||
|
fsGroup: 10001
|
||||||
|
runAsGroup: 10001
|
||||||
|
runAsNonRoot: true
|
||||||
|
runAsUser: 10001
|
||||||
containers:
|
containers:
|
||||||
- name: tempo
|
- name: tempo
|
||||||
image: grafana/tempo:2.6.1
|
image: grafana/tempo:2.6.1
|
||||||
|
|
@ -70,7 +75,7 @@ spec:
|
||||||
- name: tempo-config
|
- name: tempo-config
|
||||||
mountPath: /etc/tempo
|
mountPath: /etc/tempo
|
||||||
- name: tempo-data
|
- name: tempo-data
|
||||||
mountPath: /tmp/tempo
|
mountPath: /var/tempo
|
||||||
volumes:
|
volumes:
|
||||||
- name: tempo-config
|
- name: tempo-config
|
||||||
configMap:
|
configMap:
|
||||||
|
|
@ -115,4 +120,4 @@ spec:
|
||||||
protocol: TCP
|
protocol: TCP
|
||||||
name: zipkin
|
name: zipkin
|
||||||
selector:
|
selector:
|
||||||
app: tempo
|
app: tempo
|
||||||
|
|
@ -0,0 +1,388 @@
|
||||||
|
# 🧠 BetelgeuseBytes — Full AI Infrastructure Stack
|
||||||
|
## Authoritative README, Architecture & Onboarding Guide
|
||||||
|
|
||||||
|
This repository documents the **entire self-hosted AI infrastructure stack** running on a Kubernetes cluster hosted on **Hetzner dedicated servers**.
|
||||||
|
|
||||||
|
The stack currently powers an **Islamic Hadith Scholar AI**, but it is intentionally designed as a **general-purpose, sovereign AI, MLOps, and data platform** that can support many future projects.
|
||||||
|
|
||||||
|
This document is the **single source of truth** for:
|
||||||
|
- architecture (logical & physical)
|
||||||
|
- infrastructure configuration
|
||||||
|
- networking & DNS
|
||||||
|
- every deployed component
|
||||||
|
- why each component exists
|
||||||
|
- how to build new systems on top of the platform
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Mission & Design Philosophy
|
||||||
|
|
||||||
|
### Current Mission
|
||||||
|
Build an AI system that can:
|
||||||
|
|
||||||
|
- Parse classical Islamic texts
|
||||||
|
- Extract **Sanad** (chains of narrators) and **Matn** (hadith text)
|
||||||
|
- Identify narrators and their relationships:
|
||||||
|
- teacher / student
|
||||||
|
- familial lineage
|
||||||
|
- Construct a **verifiable knowledge graph**
|
||||||
|
- Support **human scholarly review**
|
||||||
|
- Provide **transparent and explainable reasoning**
|
||||||
|
- Operate **fully on-prem**, CPU-first, without SaaS or GPU dependency
|
||||||
|
|
||||||
|
### Core Principles
|
||||||
|
- **Sovereignty** — no external cloud lock-in
|
||||||
|
- **Explainability** — graph + provenance, not black boxes
|
||||||
|
- **Human-in-the-loop** — scholars remain in control
|
||||||
|
- **Observability-first** — everything is measurable and traceable
|
||||||
|
- **Composable** — every part can be reused or replaced
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Physical Infrastructure (Hetzner)
|
||||||
|
|
||||||
|
### Nodes
|
||||||
|
- **Provider:** Hetzner
|
||||||
|
- **Type:** Dedicated servers
|
||||||
|
- **Architecture:** x86_64
|
||||||
|
- **GPU:** None (CPU-only by design)
|
||||||
|
- **Storage:** Local NVMe / SSD
|
||||||
|
|
||||||
|
### Node Roles (Logical)
|
||||||
|
| Node Type | Responsibilities |
|
||||||
|
|---------|------------------|
|
||||||
|
| Control / Worker | Kubernetes control plane + workloads |
|
||||||
|
| Storage-heavy | Databases, MinIO, observability data |
|
||||||
|
| Compute-heavy | LLM inference, embeddings, pipelines |
|
||||||
|
|
||||||
|
> The cluster is intentionally **single-region and on-prem-like**, optimized for predictability and data locality.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Kubernetes Infrastructure Configuration
|
||||||
|
|
||||||
|
### Kubernetes
|
||||||
|
- Runtime for **all services**
|
||||||
|
- Namespaced isolation
|
||||||
|
- Explicit PersistentVolumeClaims
|
||||||
|
- Declarative configuration (GitOps)
|
||||||
|
|
||||||
|
### Namespaces (Conceptual)
|
||||||
|
| Namespace | Purpose |
|
||||||
|
|--------|--------|
|
||||||
|
| `ai` | LLMs, embeddings, labeling |
|
||||||
|
| `vec` | Vector database |
|
||||||
|
| `graph` | Knowledge graph |
|
||||||
|
| `db` | Relational databases |
|
||||||
|
| `storage` | Object storage |
|
||||||
|
| `mlops` | MLflow |
|
||||||
|
| `ml` | Argo Workflows |
|
||||||
|
| `auth` | Keycloak |
|
||||||
|
| `observability` | LGTM stack |
|
||||||
|
| `hadith` | Custom apps (orchestrator, UI) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Networking & DNS
|
||||||
|
|
||||||
|
### Ingress
|
||||||
|
- **NGINX Ingress Controller**
|
||||||
|
- HTTPS termination at ingress
|
||||||
|
- Internal services communicate via ClusterIP
|
||||||
|
|
||||||
|
### TLS
|
||||||
|
- **cert-manager**
|
||||||
|
- Let’s Encrypt
|
||||||
|
- Automatic renewal
|
||||||
|
|
||||||
|
### Public Endpoints
|
||||||
|
|
||||||
|
| URL | Service |
|
||||||
|
|----|--------|
|
||||||
|
| https://llm.betelgeusebytes.io | LLM inference (Ollama / llama.cpp) |
|
||||||
|
| https://embeddings.betelgeusebytes.io | Text Embeddings Inference |
|
||||||
|
| https://vector.betelgeusebytes.io | Qdrant + UI |
|
||||||
|
| https://neo4j.betelgeusebytes.io | Neo4j Browser |
|
||||||
|
| https://hadith-api.betelgeusebytes.io | FastAPI Orchestrator |
|
||||||
|
| https://hadith-admin.betelgeusebytes.io | Admin / Curation UI |
|
||||||
|
| https://label.betelgeusebytes.io | Label Studio |
|
||||||
|
| https://mlflow.betelgeusebytes.io | MLflow |
|
||||||
|
| https://minio.betelgeusebytes.io | MinIO Console |
|
||||||
|
| https://argo.betelgeusebytes.io | Argo Workflows |
|
||||||
|
| https://auth.betelgeusebytes.io | Keycloak |
|
||||||
|
| https://grafana.betelgeusebytes.io | Grafana |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Full Logical Architecture
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart LR
|
||||||
|
User --> AdminUI --> Orchestrator
|
||||||
|
|
||||||
|
Orchestrator --> LLM
|
||||||
|
Orchestrator --> TEI --> Qdrant
|
||||||
|
Orchestrator --> Neo4j
|
||||||
|
Orchestrator --> PostgreSQL
|
||||||
|
Orchestrator --> Redis
|
||||||
|
|
||||||
|
LabelStudio --> MinIO
|
||||||
|
MinIO --> ArgoWF --> MLflow
|
||||||
|
MLflow --> Models --> Orchestrator
|
||||||
|
|
||||||
|
Kafka --> ArgoWF
|
||||||
|
|
||||||
|
Alloy --> Prometheus --> Grafana
|
||||||
|
Alloy --> Loki --> Grafana
|
||||||
|
Alloy --> Tempo --> Grafana
|
||||||
|
```
|
||||||
|
6. AI & Reasoning Layer
|
||||||
|
Ollama / llama.cpp (CPU LLM)
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
JSON-structured extraction
|
||||||
|
|
||||||
|
Sanad / matn reasoning
|
||||||
|
|
||||||
|
Deterministic outputs
|
||||||
|
|
||||||
|
No GPU dependency
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Offline assistants
|
||||||
|
|
||||||
|
Document intelligence
|
||||||
|
|
||||||
|
Agent frameworks
|
||||||
|
|
||||||
|
Replaceable by vLLM when GPUs are added
|
||||||
|
|
||||||
|
Text Embeddings Inference (TEI)
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
Embeddings for hadith texts and biographies
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
RAG systems
|
||||||
|
|
||||||
|
Semantic search
|
||||||
|
|
||||||
|
Deduplication
|
||||||
|
|
||||||
|
Similarity clustering
|
||||||
|
|
||||||
|
Qdrant (Vector Database)
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
Stores embeddings
|
||||||
|
|
||||||
|
Similarity search
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Recommendation systems
|
||||||
|
|
||||||
|
Agent memory
|
||||||
|
|
||||||
|
Multimodal retrieval
|
||||||
|
|
||||||
|
Includes Web UI.
|
||||||
|
|
||||||
|
7. Knowledge & Data Layer
|
||||||
|
Neo4j (Graph Database)
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
Isnād chains
|
||||||
|
|
||||||
|
Narrator relationships
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Knowledge graphs
|
||||||
|
|
||||||
|
Trust networks
|
||||||
|
|
||||||
|
Provenance systems
|
||||||
|
|
||||||
|
PostgreSQL
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
App data
|
||||||
|
|
||||||
|
MLflow backend
|
||||||
|
|
||||||
|
Label Studio DB
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Feature stores
|
||||||
|
|
||||||
|
Metadata catalogs
|
||||||
|
|
||||||
|
Transactional apps
|
||||||
|
|
||||||
|
Redis
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
Caching
|
||||||
|
|
||||||
|
Temporary state
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Job queues
|
||||||
|
|
||||||
|
Rate limiting
|
||||||
|
|
||||||
|
Sessions
|
||||||
|
|
||||||
|
Kafka
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
Optional async backbone
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Streaming ingestion
|
||||||
|
|
||||||
|
Event-driven ML
|
||||||
|
|
||||||
|
Audit pipelines
|
||||||
|
|
||||||
|
MinIO (S3)
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
Datasets
|
||||||
|
|
||||||
|
Model artifacts
|
||||||
|
|
||||||
|
Pipeline outputs
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Data lake
|
||||||
|
|
||||||
|
Backups
|
||||||
|
|
||||||
|
Feature storage
|
||||||
|
|
||||||
|
8. MLOps & Human-in-the-Loop
|
||||||
|
Label Studio
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
Human annotation of narrators & relations
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Any labeling task (text, image, audio)
|
||||||
|
|
||||||
|
MLflow
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
Experiment tracking
|
||||||
|
|
||||||
|
Model registry
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Governance
|
||||||
|
|
||||||
|
Model promotion
|
||||||
|
|
||||||
|
Auditing
|
||||||
|
|
||||||
|
Argo Workflows
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
ETL & training pipelines
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
Batch inference
|
||||||
|
|
||||||
|
Scheduled automation
|
||||||
|
|
||||||
|
Data engineering
|
||||||
|
|
||||||
|
9. Authentication & Security
|
||||||
|
Keycloak
|
||||||
|
Current usage
|
||||||
|
|
||||||
|
SSO for Admin UI, MLflow, Label Studio
|
||||||
|
|
||||||
|
Future usage
|
||||||
|
|
||||||
|
API authentication
|
||||||
|
|
||||||
|
Multi-tenant access
|
||||||
|
|
||||||
|
Organization-wide IAM
|
||||||
|
|
||||||
|
10. Observability Stack (LGTM)
|
||||||
|
Components
|
||||||
|
Grafana
|
||||||
|
|
||||||
|
Prometheus
|
||||||
|
|
||||||
|
Loki
|
||||||
|
|
||||||
|
Tempo
|
||||||
|
|
||||||
|
Grafana Alloy
|
||||||
|
|
||||||
|
kube-state-metrics
|
||||||
|
|
||||||
|
node-exporter
|
||||||
|
|
||||||
|
Capabilities
|
||||||
|
Metrics, logs, traces
|
||||||
|
|
||||||
|
Automatic correlation
|
||||||
|
|
||||||
|
OTLP-native
|
||||||
|
|
||||||
|
Local SSD persistence
|
||||||
|
|
||||||
|
11. Design Rules for All Custom Services
|
||||||
|
All services must:
|
||||||
|
|
||||||
|
be stateless
|
||||||
|
|
||||||
|
use env vars & Kubernetes Secrets
|
||||||
|
|
||||||
|
authenticate via Keycloak
|
||||||
|
|
||||||
|
emit:
|
||||||
|
|
||||||
|
Prometheus metrics
|
||||||
|
|
||||||
|
OTLP traces
|
||||||
|
|
||||||
|
structured JSON logs
|
||||||
|
|
||||||
|
be deployable via kubectl & Argo CD
|
||||||
|
|
||||||
|
12. Future Use Cases (Beyond Hadith)
|
||||||
|
This platform can support:
|
||||||
|
|
||||||
|
General Knowledge Graph AI
|
||||||
|
|
||||||
|
Legal / scholarly document analysis
|
||||||
|
|
||||||
|
Enterprise RAG systems
|
||||||
|
|
||||||
|
Research data platforms
|
||||||
|
|
||||||
|
Explainable AI systems
|
||||||
|
|
||||||
|
Internal search engines
|
||||||
|
|
||||||
|
Agent-based systems
|
||||||
|
|
||||||
|
Provenance & trust scoring engines
|
||||||
|
|
||||||
|
Digital humanities projects
|
||||||
|
|
||||||
|
Offline sovereign AI deployments
|
||||||
|
|
@ -0,0 +1,158 @@
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
RED='\033[0;31m'
|
||||||
|
YELLOW='\033[1;33m'
|
||||||
|
BLUE='\033[0;34m'
|
||||||
|
NC='\033[0m'
|
||||||
|
|
||||||
|
echo -e "${BLUE}========================================${NC}"
|
||||||
|
echo -e "${BLUE} Loki Log Collection Test${NC}"
|
||||||
|
echo -e "${BLUE}========================================${NC}"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
PASS=0
|
||||||
|
FAIL=0
|
||||||
|
|
||||||
|
# Test 1: Check Alloy DaemonSet
|
||||||
|
echo -e "${YELLOW}Test 1: Checking Alloy DaemonSet...${NC}"
|
||||||
|
if kubectl get pods -n observability -l app=alloy --no-headers 2>/dev/null | grep -q "Running"; then
|
||||||
|
ALLOY_COUNT=$(kubectl get pods -n observability -l app=alloy --no-headers | grep -c "Running")
|
||||||
|
echo -e "${GREEN}✓ Alloy is running ($ALLOY_COUNT pod(s))${NC}"
|
||||||
|
PASS=$((PASS+1))
|
||||||
|
else
|
||||||
|
echo -e "${RED}✗ Alloy is not running${NC}"
|
||||||
|
FAIL=$((FAIL+1))
|
||||||
|
fi
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Test 2: Check Loki pod
|
||||||
|
echo -e "${YELLOW}Test 2: Checking Loki pod...${NC}"
|
||||||
|
if kubectl get pods -n observability -l app=loki --no-headers 2>/dev/null | grep -q "Running"; then
|
||||||
|
echo -e "${GREEN}✓ Loki is running${NC}"
|
||||||
|
PASS=$((PASS+1))
|
||||||
|
else
|
||||||
|
echo -e "${RED}✗ Loki is not running${NC}"
|
||||||
|
FAIL=$((FAIL+1))
|
||||||
|
fi
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Test 3: Test Loki readiness endpoint
|
||||||
|
echo -e "${YELLOW}Test 3: Testing Loki readiness endpoint...${NC}"
|
||||||
|
READY=$(kubectl run test-loki-ready-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
|
||||||
|
curl -s -m 5 http://loki.observability.svc.cluster.local:3100/ready 2>/dev/null || echo "failed")
|
||||||
|
|
||||||
|
if [ "$READY" = "ready" ]; then
|
||||||
|
echo -e "${GREEN}✓ Loki is ready${NC}"
|
||||||
|
PASS=$((PASS+1))
|
||||||
|
else
|
||||||
|
echo -e "${RED}✗ Loki is not ready (response: $READY)${NC}"
|
||||||
|
FAIL=$((FAIL+1))
|
||||||
|
fi
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Test 4: Check Alloy can connect to Loki
|
||||||
|
echo -e "${YELLOW}Test 4: Checking Alloy → Loki connectivity...${NC}"
|
||||||
|
ALLOY_ERRORS=$(kubectl logs -n observability -l app=alloy --tail=50 2>/dev/null | grep -i "error.*loki" | wc -l)
|
||||||
|
if [ "$ALLOY_ERRORS" -eq 0 ]; then
|
||||||
|
echo -e "${GREEN}✓ No Alloy → Loki connection errors${NC}"
|
||||||
|
PASS=$((PASS+1))
|
||||||
|
else
|
||||||
|
echo -e "${RED}✗ Found $ALLOY_ERRORS error(s) in Alloy logs${NC}"
|
||||||
|
kubectl logs -n observability -l app=alloy --tail=20 | grep -i error
|
||||||
|
FAIL=$((FAIL+1))
|
||||||
|
fi
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Test 5: Create test pod and verify logs
|
||||||
|
echo -e "${YELLOW}Test 5: Creating test pod and verifying log collection...${NC}"
|
||||||
|
|
||||||
|
# Clean up any existing test pod
|
||||||
|
kubectl delete pod test-logger-verify --ignore-not-found 2>/dev/null
|
||||||
|
|
||||||
|
# Create test pod
|
||||||
|
echo " Creating test pod that logs every second..."
|
||||||
|
kubectl run test-logger-verify --image=busybox --restart=Never -- sh -c \
|
||||||
|
'for i in 1 2 3 4 5 6 7 8 9 10; do echo "LOKI-TEST-LOG: Message number $i at $(date)"; sleep 1; done' \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
|
||||||
|
# Wait for pod to start and generate logs
|
||||||
|
echo " Waiting 15 seconds for logs to be collected..."
|
||||||
|
sleep 15
|
||||||
|
|
||||||
|
# Query Loki API for test logs
|
||||||
|
echo " Querying Loki for test logs..."
|
||||||
|
START_TIME=$(date -u -d '2 minutes ago' +%s)000000000
|
||||||
|
END_TIME=$(date -u +%s)000000000
|
||||||
|
|
||||||
|
QUERY_RESULT=$(kubectl run test-loki-query-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
|
||||||
|
curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
|
||||||
|
--data-urlencode 'query={pod="test-logger-verify"}' \
|
||||||
|
--data-urlencode "start=$START_TIME" \
|
||||||
|
--data-urlencode "end=$END_TIME" 2>/dev/null || echo "failed")
|
||||||
|
|
||||||
|
if echo "$QUERY_RESULT" | grep -q "LOKI-TEST-LOG"; then
|
||||||
|
LOG_COUNT=$(echo "$QUERY_RESULT" | grep -o "LOKI-TEST-LOG" | wc -l)
|
||||||
|
echo -e "${GREEN}✓ Found $LOG_COUNT test log messages in Loki${NC}"
|
||||||
|
PASS=$((PASS+1))
|
||||||
|
else
|
||||||
|
echo -e "${RED}✗ Test logs not found in Loki${NC}"
|
||||||
|
echo " Response: ${QUERY_RESULT:0:200}"
|
||||||
|
FAIL=$((FAIL+1))
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Clean up test pod
|
||||||
|
kubectl delete pod test-logger-verify --ignore-not-found >/dev/null 2>&1
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Test 6: Check observability namespace logs
|
||||||
|
echo -e "${YELLOW}Test 6: Checking for observability namespace logs...${NC}"
|
||||||
|
|
||||||
|
OBS_QUERY=$(kubectl run test-loki-obs-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
|
||||||
|
curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
|
||||||
|
--data-urlencode 'query={namespace="observability"}' \
|
||||||
|
--data-urlencode "start=$START_TIME" \
|
||||||
|
--data-urlencode "end=$END_TIME" \
|
||||||
|
--data-urlencode "limit=10" 2>/dev/null || echo "failed")
|
||||||
|
|
||||||
|
if echo "$OBS_QUERY" | grep -q '"values":\[\['; then
|
||||||
|
echo -e "${GREEN}✓ Observability namespace logs found in Loki${NC}"
|
||||||
|
PASS=$((PASS+1))
|
||||||
|
else
|
||||||
|
echo -e "${RED}✗ No logs found for observability namespace${NC}"
|
||||||
|
FAIL=$((FAIL+1))
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo -e "${BLUE}========================================${NC}"
|
||||||
|
echo -e "${BLUE} Test Results${NC}"
|
||||||
|
echo -e "${BLUE}========================================${NC}"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
TOTAL=$((PASS+FAIL))
|
||||||
|
echo -e "Passed: ${GREEN}$PASS${NC} / $TOTAL"
|
||||||
|
echo -e "Failed: ${RED}$FAIL${NC} / $TOTAL"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
if [ $FAIL -eq 0 ]; then
|
||||||
|
echo -e "${GREEN}✓✓✓ All tests passed! Logs are flowing to Loki! ✓✓✓${NC}"
|
||||||
|
echo ""
|
||||||
|
echo "Next steps:"
|
||||||
|
echo " 1. Open Grafana: https://grafana.betelgeusebytes.io"
|
||||||
|
echo " 2. Go to Explore → Loki"
|
||||||
|
echo " 3. Query: {namespace=\"observability\"}"
|
||||||
|
echo ""
|
||||||
|
else
|
||||||
|
echo -e "${RED}✗✗✗ Some tests failed. Check the output above for details. ✗✗✗${NC}"
|
||||||
|
echo ""
|
||||||
|
echo "Troubleshooting:"
|
||||||
|
echo " - Check Alloy logs: kubectl logs -n observability -l app=alloy"
|
||||||
|
echo " - Check Loki logs: kubectl logs -n observability loki-0"
|
||||||
|
echo " - Verify services: kubectl get svc -n observability"
|
||||||
|
echo " - See full guide: VERIFY-LOKI-LOGS.md"
|
||||||
|
echo ""
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
Loading…
Reference in New Issue