Add observability stack and supporting scripts
- Introduced combine.sh script to aggregate .txt, .py, .yml, .yaml, .ini files into betelgeusebytes.txt. - Updated Loki configuration to disable retention settings. - Modified Tempo configuration to change storage paths from /tmp to /var. - Refactored Alloy configuration to streamline Prometheus integration and removed unnecessary metrics export. - Enhanced RBAC permissions to include pod log access. - Added security context to Tempo deployment for improved security. - Created README_old.md for documentation of the observability stack. - Developed me.md as an authoritative guide for the AI infrastructure stack. - Implemented test-loki-logs.sh script to validate Loki log collection and connectivity.
This commit is contained in:
parent
dfdd36db3f
commit
404deb1d52
|
|
@ -0,0 +1,93 @@
|
|||
# BetelgeuseBytes – Architecture Overview
|
||||
|
||||
## High-Level Architecture
|
||||
|
||||
This platform is a **self-hosted, production-grade Kubernetes stack** designed for:
|
||||
|
||||
* AI / ML experimentation and serving
|
||||
* Data engineering & observability
|
||||
* Knowledge graphs & vector search
|
||||
* Automation, workflows, and research tooling
|
||||
|
||||
The architecture follows a **hub-and-spoke model**:
|
||||
|
||||
* **Core Infrastructure**: Kubernetes + networking + storage
|
||||
* **Platform Services**: databases, messaging, auth, observability
|
||||
* **ML / AI Services**: labeling, embeddings, LLM serving, notebooks
|
||||
* **Automation & Workflows**: Argo Workflows, n8n
|
||||
* **Access Layer**: DNS, Ingress, TLS
|
||||
|
||||
---
|
||||
|
||||
## Logical Architecture Diagram (Textual)
|
||||
|
||||
```
|
||||
Internet
|
||||
│
|
||||
▼
|
||||
DNS (betelgeusebytes.io)
|
||||
│
|
||||
▼
|
||||
Ingress-NGINX (TLS via cert-manager)
|
||||
│
|
||||
├── Platform UIs (Grafana, Kibana, Gitea, Neo4j, MinIO, etc.)
|
||||
├── ML UIs (Jupyter, Label Studio, MLflow)
|
||||
├── Automation (n8n, Argo)
|
||||
└── APIs (Postgres TCP, Neo4j Bolt, Kafka)
|
||||
|
||||
Kubernetes Cluster
|
||||
├── Control Plane
|
||||
├── Worker Nodes
|
||||
├── Stateful Workloads (local SSD)
|
||||
└── Observability Stack
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Design Principles
|
||||
|
||||
* **Bare‑metal friendly** (Hetzner dedicated servers)
|
||||
* **Local SSD storage** for stateful workloads
|
||||
* **Everything observable** (logs, metrics, traces)
|
||||
* **CPU-first ML** with optional GPU expansion
|
||||
* **Single-tenant but multi-project ready**
|
||||
|
||||
---
|
||||
|
||||
## Networking
|
||||
|
||||
* Cilium CNI (eBPF-based networking)
|
||||
* NGINX Ingress Controller
|
||||
* TCP services exposed via Ingress patch (Postgres, Neo4j Bolt)
|
||||
* WireGuard mesh between nodes
|
||||
|
||||
---
|
||||
|
||||
## Security Model
|
||||
|
||||
* TLS everywhere (cert-manager + Let’s Encrypt)
|
||||
* Namespace isolation per domain (db, ml, graph, observability…)
|
||||
* Secrets stored in Kubernetes Secrets
|
||||
* Optional Basic Auth on sensitive UIs
|
||||
* Keycloak available for future SSO
|
||||
|
||||
---
|
||||
|
||||
## Scalability Notes
|
||||
|
||||
* Currently single control-plane + workers
|
||||
* Designed to add:
|
||||
|
||||
* More workers
|
||||
* Dedicated control-plane VPS nodes
|
||||
* GPU nodes (for vLLM / training)
|
||||
|
||||
---
|
||||
|
||||
## What This Enables
|
||||
|
||||
* Research platforms
|
||||
* Knowledge graph + LLM pipelines
|
||||
* End-to-end ML lifecycle
|
||||
* Automated data pipelines
|
||||
* Production observability-first apps
|
||||
|
|
@ -0,0 +1,46 @@
|
|||
# Deployment & Operations Guide
|
||||
|
||||
## Deployment Model
|
||||
|
||||
* Declarative Kubernetes manifests
|
||||
* Applied via `kubectl` or Argo CD
|
||||
* No Helm dependency
|
||||
|
||||
---
|
||||
|
||||
## General Rules
|
||||
|
||||
* Stateless apps by default
|
||||
* PVCs required for state
|
||||
* Secrets via Kubernetes Secrets
|
||||
* Config via environment variables
|
||||
|
||||
---
|
||||
|
||||
## Deployment Order (Recommended)
|
||||
|
||||
1. Networking (Cilium, Ingress)
|
||||
2. cert-manager
|
||||
3. Storage (PVs)
|
||||
4. Databases (Postgres, Redis, Kafka)
|
||||
5. Observability stack
|
||||
6. ML tooling
|
||||
7. Automation tools
|
||||
8. Custom applications
|
||||
|
||||
---
|
||||
|
||||
## Operations
|
||||
|
||||
* Monitor via Grafana
|
||||
* Debug via logs & traces
|
||||
* Upgrade via Git commits
|
||||
* Rollback via Argo CD
|
||||
|
||||
---
|
||||
|
||||
## Backup Strategy
|
||||
|
||||
* MinIO buckets versioned
|
||||
* Database snapshots
|
||||
* Git repositories mirrored
|
||||
|
|
@ -0,0 +1,34 @@
|
|||
# Future Use Cases & Projects
|
||||
|
||||
This platform is intentionally **general‑purpose**.
|
||||
|
||||
## AI & ML
|
||||
|
||||
* RAG platforms
|
||||
* Offline assistants
|
||||
* Agent systems
|
||||
* NLP research
|
||||
|
||||
## Knowledge Graphs
|
||||
|
||||
* Academic citation graphs
|
||||
* Trust & provenance systems
|
||||
* Dependency analysis
|
||||
|
||||
## Data Platforms
|
||||
|
||||
* Event‑driven ETL
|
||||
* Feature stores
|
||||
* Research data lakes
|
||||
|
||||
## Observability & Ops
|
||||
|
||||
* Internal platform monitoring
|
||||
* Security analytics
|
||||
* Audit systems
|
||||
|
||||
## Sovereign Deployments
|
||||
|
||||
* On‑prem AI for enterprises
|
||||
* NGO / government tooling
|
||||
* Privacy‑preserving analytics
|
||||
|
|
@ -0,0 +1,102 @@
|
|||
# BetelgeuseBytes – Infrastructure & Cluster Configuration
|
||||
|
||||
## Hosting Provider
|
||||
|
||||
* **Provider**: Hetzner
|
||||
* **Server Type**: Dedicated servers
|
||||
* **Region**: EU
|
||||
* **Network**: Private LAN + WireGuard
|
||||
|
||||
---
|
||||
|
||||
## Nodes
|
||||
|
||||
### Current Nodes
|
||||
|
||||
| Node | Role | Notes |
|
||||
| --------- | ---------------------- | ------------------- |
|
||||
| hetzner-1 | control-plane + worker | runs core workloads |
|
||||
| hetzner-2 | worker + storage | hosts local SSD PVs |
|
||||
|
||||
---
|
||||
|
||||
## Kubernetes Setup
|
||||
|
||||
* Kubernetes installed via kubeadm
|
||||
* Single cluster
|
||||
* Control plane is also schedulable
|
||||
|
||||
### CNI
|
||||
|
||||
* **Cilium**
|
||||
|
||||
* eBPF dataplane
|
||||
* kube-proxy replacement
|
||||
* Network policy support
|
||||
|
||||
---
|
||||
|
||||
## Storage
|
||||
|
||||
### Persistent Volumes
|
||||
|
||||
* Backed by **local NVMe / SSD**
|
||||
* Manually provisioned PVs
|
||||
* Bound via PVCs
|
||||
|
||||
### Storage Layout
|
||||
|
||||
```
|
||||
/mnt/local-ssd/
|
||||
├── postgres/
|
||||
├── neo4j/
|
||||
├── elasticsearch/
|
||||
├── prometheus/
|
||||
├── loki/
|
||||
├── tempo/
|
||||
├── grafana/
|
||||
├── minio/
|
||||
└── qdrant/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Networking
|
||||
|
||||
* Ingress Controller: nginx
|
||||
* External DNS records → ingress IP
|
||||
* TCP mappings for:
|
||||
|
||||
* PostgreSQL
|
||||
* Neo4j Bolt
|
||||
|
||||
---
|
||||
|
||||
## TLS & Certificates
|
||||
|
||||
* cert-manager
|
||||
* ClusterIssuer: Let’s Encrypt
|
||||
* Automatic renewal
|
||||
|
||||
---
|
||||
|
||||
## Namespaces
|
||||
|
||||
| Namespace | Purpose |
|
||||
| ------------- | ---------------------------------- |
|
||||
| db | Databases (Postgres, Redis) |
|
||||
| graph | Neo4j |
|
||||
| broker | Kafka |
|
||||
| ml | ML tooling (Jupyter, Argo, MLflow) |
|
||||
| observability | Grafana, Prometheus, Loki, Tempo |
|
||||
| automation | n8n |
|
||||
| devops | Gitea, Argo CD |
|
||||
|
||||
---
|
||||
|
||||
## What This Infra Enables
|
||||
|
||||
* Full on‑prem AI platform
|
||||
* Predictable performance
|
||||
* Low-latency data access
|
||||
* Independence from cloud providers
|
||||
|
|
@ -0,0 +1,32 @@
|
|||
# 🔭 Observability Stack
|
||||
|
||||
---
|
||||
|
||||
## Components
|
||||
|
||||
- Grafana
|
||||
- Prometheus
|
||||
- Loki
|
||||
- Tempo
|
||||
- Grafana Alloy
|
||||
- kube-state-metrics
|
||||
- node-exporter
|
||||
|
||||
---
|
||||
|
||||
## Capabilities
|
||||
|
||||
- Logs ↔ traces ↔ metrics correlation
|
||||
- OTLP-native instrumentation
|
||||
- Centralized dashboards
|
||||
- Alerting-ready
|
||||
|
||||
---
|
||||
|
||||
## Instrumentation Rules
|
||||
|
||||
All apps must:
|
||||
- expose `/metrics`
|
||||
- emit structured JSON logs
|
||||
- export OTLP traces
|
||||
|
||||
148
README.md
148
README.md
|
|
@ -1,43 +1,123 @@
|
|||
# BetelgeuseBytes K8s — Full Stack (kubectl-only)
|
||||
# 🧠 BetelgeuseBytes AI Platform — Documentation
|
||||
|
||||
**Nodes**
|
||||
- Control-plane + worker: hetzner-1 (95.217.89.53)
|
||||
- Worker: hetzner-2 (138.201.254.97)
|
||||
This documentation describes a **self-hosted, CPU-first AI platform** running on Kubernetes,
|
||||
designed to power an **Islamic Hadith Scholar AI** and future AI/data projects.
|
||||
|
||||
## Bring up the cluster
|
||||
```bash
|
||||
ansible -i ansible/inventories/prod/hosts.ini all -m ping
|
||||
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
|
||||
```
|
||||
## 📚 Documentation Index
|
||||
|
||||
## Apply apps (edit secrets first)
|
||||
```bash
|
||||
kubectl apply -f k8s/00-namespaces.yaml
|
||||
kubectl apply -f k8s/01-secrets/
|
||||
kubectl apply -f k8s/storage/storageclass.yaml
|
||||
- [Architecture](ARCHITECTURE.md)
|
||||
- [Infrastructure](INFRASTRUCTURE.md)
|
||||
- [Full Stack Overview](STACK.md)
|
||||
- [Deployment & Operations](DEPLOYMENT.md)
|
||||
- [Observability](OBSERVABILITY.md)
|
||||
- [Roadmap & Next Steps](ROADMAP.md)
|
||||
- [Future Projects & Use Cases](FUTURE-PROJECTS.md)
|
||||
|
||||
kubectl apply -f k8s/postgres/
|
||||
kubectl apply -f k8s/redis/
|
||||
kubectl apply -f k8s/elastic/elasticsearch.yaml
|
||||
kubectl apply -f k8s/elastic/kibana.yaml
|
||||
## 🎯 Current Focus
|
||||
|
||||
kubectl apply -f k8s/gitea/
|
||||
kubectl apply -f k8s/jupyter/
|
||||
kubectl apply -f k8s/kafka/kafka.yaml
|
||||
kubectl apply -f k8s/kafka/kafka-ui.yaml
|
||||
kubectl apply -f k8s/neo4j/
|
||||
- Hadith sanad & matn extraction
|
||||
- Narrator relationship modeling
|
||||
- Knowledge graph construction
|
||||
- Human-in-the-loop verification
|
||||
- Explainable, sovereign AI
|
||||
|
||||
kubectl apply -f k8s/otlp/
|
||||
kubectl apply -f k8s/observability/fluent-bit.yaml
|
||||
kubectl apply -f k8s/prometheus/
|
||||
kubectl apply -f k8s/grafana/
|
||||
```
|
||||
## 🧠 What each document gives you
|
||||
### ARCHITECTURE
|
||||
|
||||
## DNS
|
||||
A records:
|
||||
- apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
|
||||
- Logical system architecture
|
||||
|
||||
CNAMEs → apps.betelgeusebytes.io:
|
||||
- gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
|
||||
- Data & control flow
|
||||
|
||||
(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.
|
||||
- Networking and security model
|
||||
|
||||
- Design principles (CPU-first, sovereign, observable)
|
||||
|
||||
- What the architecture enables long-term
|
||||
|
||||
This is what you show to **architects and senior engineers.**
|
||||
|
||||
### INFRASTRUCTURE
|
||||
|
||||
- Hetzner setup (dedicated, CPU-only, SSD)
|
||||
|
||||
- Node roles and responsibilities
|
||||
|
||||
- Kubernetes topology
|
||||
|
||||
- Cilium networking
|
||||
|
||||
- Storage layout on disk
|
||||
|
||||
- Namespaces and isolation strategy
|
||||
|
||||
This is what you show to **ops / SRE / infra people.**
|
||||
|
||||
### STACK
|
||||
|
||||
- Exhaustive list of every deployed component
|
||||
|
||||
- Grouped by domain:
|
||||
|
||||
- Core platform
|
||||
|
||||
- Databases & messaging
|
||||
|
||||
- Knowledge & vectors
|
||||
|
||||
- ML & AI
|
||||
|
||||
- Automation & DevOps
|
||||
|
||||
- Observability
|
||||
|
||||
- Authentication
|
||||
|
||||
For each: **what it does now + what it can be reused for**
|
||||
|
||||
This is the **master mental model** of your platform.
|
||||
|
||||
### DEPLOYMENT
|
||||
|
||||
- How the platform is deployed (kubectl + GitOps)
|
||||
|
||||
- Deployment order
|
||||
|
||||
- Operational rules
|
||||
|
||||
- Backup strategy
|
||||
|
||||
- Day-2 operations mindset
|
||||
|
||||
This is your ***runbook starter.***
|
||||
|
||||
### ROADMAP
|
||||
|
||||
- Clear technical phases:
|
||||
|
||||
- Neo4j isnād schema
|
||||
|
||||
- Authenticity scoring
|
||||
|
||||
- Productization
|
||||
|
||||
- Scaling (GPU, multi-project)
|
||||
|
||||
This keeps the project ***directionally sane.***
|
||||
|
||||
### FUTURE-PROJECTS
|
||||
|
||||
- Explicitly documents that this is **not just a Hadith stack**
|
||||
|
||||
- Lists realistic reuse cases:
|
||||
|
||||
- RAG
|
||||
|
||||
- Knowledge graphs
|
||||
|
||||
- Sovereign AI
|
||||
|
||||
- Digital humanities
|
||||
|
||||
- Research platforms
|
||||
|
||||
This justifies the ***investment in infra quality.***
|
||||
|
|
@ -0,0 +1,43 @@
|
|||
# BetelgeuseBytes K8s — Full Stack (kubectl-only)
|
||||
|
||||
**Nodes**
|
||||
- Control-plane + worker: hetzner-1 (95.217.89.53)
|
||||
- Worker: hetzner-2 (138.201.254.97)
|
||||
|
||||
## Bring up the cluster
|
||||
```bash
|
||||
ansible -i ansible/inventories/prod/hosts.ini all -m ping
|
||||
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
|
||||
```
|
||||
|
||||
## Apply apps (edit secrets first)
|
||||
```bash
|
||||
kubectl apply -f k8s/00-namespaces.yaml
|
||||
kubectl apply -f k8s/01-secrets/
|
||||
kubectl apply -f k8s/storage/storageclass.yaml
|
||||
|
||||
kubectl apply -f k8s/postgres/
|
||||
kubectl apply -f k8s/redis/
|
||||
kubectl apply -f k8s/elastic/elasticsearch.yaml
|
||||
kubectl apply -f k8s/elastic/kibana.yaml
|
||||
|
||||
kubectl apply -f k8s/gitea/
|
||||
kubectl apply -f k8s/jupyter/
|
||||
kubectl apply -f k8s/kafka/kafka.yaml
|
||||
kubectl apply -f k8s/kafka/kafka-ui.yaml
|
||||
kubectl apply -f k8s/neo4j/
|
||||
|
||||
kubectl apply -f k8s/otlp/
|
||||
kubectl apply -f k8s/observability/fluent-bit.yaml
|
||||
kubectl apply -f k8s/prometheus/
|
||||
kubectl apply -f k8s/grafana/
|
||||
```
|
||||
|
||||
## DNS
|
||||
A records:
|
||||
- apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
|
||||
|
||||
CNAMEs → apps.betelgeusebytes.io:
|
||||
- gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
|
||||
|
||||
(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.
|
||||
|
|
@ -0,0 +1,26 @@
|
|||
# Roadmap & Next Steps
|
||||
|
||||
## Phase 1 – Knowledge Modeling
|
||||
|
||||
* Design Neo4j isnād schema
|
||||
* Identity resolution
|
||||
* Relationship typing
|
||||
|
||||
## Phase 2 – Authenticity Scoring
|
||||
|
||||
* Chain continuity analysis
|
||||
* Narrator reliability
|
||||
* Graph‑based scoring
|
||||
* LLM‑assisted reasoning
|
||||
|
||||
## Phase 3 – Productization
|
||||
|
||||
* Admin dashboards
|
||||
* APIs
|
||||
* Provenance visualization
|
||||
|
||||
## Phase 4 – Scale & Extend
|
||||
|
||||
* GPU nodes
|
||||
* vLLM integration
|
||||
* Multi‑project tenancy
|
||||
|
|
@ -0,0 +1,153 @@
|
|||
# 🧠 BetelgeuseBytes – Full Stack Catalog
|
||||
|
||||
|
||||
This document lists **every major component deployed in the cluster**, what it is used for today, and what it can be reused for.
|
||||
|
||||
---
|
||||
|
||||
## Core Platform
|
||||
|
||||
| Component | Namespace | Purpose | Reuse |
|
||||
| ------------- | ------------- | --------------- | --------------- |
|
||||
| Kubernetes | all | Orchestration | Any platform |
|
||||
| Cilium | kube-system | Networking | Secure clusters |
|
||||
| NGINX Ingress | ingress-nginx | Traffic routing | API gateway |
|
||||
| cert-manager | cert-manager | TLS automation | PKI |
|
||||
|
||||
---
|
||||
|
||||
## Databases & Messaging
|
||||
|
||||
| Component | URL / Access | Purpose | Reuse |
|
||||
| ------------- | --------------- | --------------- | ---------------- |
|
||||
| PostgreSQL | TCP via Ingress | Relational DB | App backends |
|
||||
| Redis | internal | Cache | Queues |
|
||||
| Kafka | kafka-ui UI | Event streaming | Streaming ETL |
|
||||
| Elasticsearch | Kibana UI | Search + logs | Full‑text search |
|
||||
|
||||
---
|
||||
|
||||
## Knowledge & Vector
|
||||
|
||||
| Component | URL | Purpose | Reuse |
|
||||
| --------- | ------------------------- | --------------- | --------------- |
|
||||
| Neo4j | neo4j.betelgeusebytes.io | Knowledge graph | Graph analytics |
|
||||
| Qdrant | vector.betelgeusebytes.io | Vector search | RAG |
|
||||
|
||||
---
|
||||
|
||||
## ML & AI
|
||||
|
||||
| Component | URL | Purpose | Reuse |
|
||||
| ------------ | ----------------------------- | --------------- | ---------------- |
|
||||
| Jupyter | notebook UI | Experiments | Research |
|
||||
| Label Studio | label.betelgeusebytes.io | Annotation | Dataset creation |
|
||||
| MLflow | mlflow.betelgeusebytes.io | Model tracking | MLOps |
|
||||
| Ollama / LLM | llm.betelgeusebytes.io | LLM inference | Agents |
|
||||
| Embeddings | embeddings.betelgeusebytes.io | Text embeddings | Semantic search |
|
||||
|
||||
---
|
||||
|
||||
## Automation & DevOps
|
||||
|
||||
| Component | URL | Purpose | Reuse |
|
||||
| -------------- | ----------------------- | ------------------- | ----------- |
|
||||
| Argo Workflows | argo.betelgeusebytes.io | Pipelines | ETL |
|
||||
| Argo CD | argocd UI | GitOps | CI/CD |
|
||||
| Gitea | gitea UI | Git hosting | SCM |
|
||||
| n8n | automation UI | Workflow automation | Integration |
|
||||
|
||||
---
|
||||
|
||||
## Observability (LGTM)
|
||||
|
||||
| Component | Purpose | Reuse |
|
||||
| ---------- | --------------- | ---------------------- |
|
||||
| Grafana | Dashboards | Ops center |
|
||||
| Prometheus | Metrics | Monitoring |
|
||||
| Loki | Logs | Debugging |
|
||||
| Tempo | Traces | Distributed tracing |
|
||||
| Alloy | Telemetry agent | Standardized telemetry |
|
||||
|
||||
---
|
||||
|
||||
## Authentication
|
||||
|
||||
| Component | Purpose | Reuse |
|
||||
| --------- | ---------- | ----- |
|
||||
| Keycloak | OIDC / SSO | IAM |
|
||||
|
||||
---
|
||||
|
||||
## Why This Stack Matters
|
||||
|
||||
* Covers **data → ML → serving → observability** end‑to‑end
|
||||
* Suitable for research **and** production
|
||||
* Modular and future‑proof
|
||||
|
||||
|
||||
# 📚 Stack Catalog — Services, URLs, Access & Usage
|
||||
|
||||
This document lists **every deployed component**, how to access it,
|
||||
what it is used for **now**, and what it enables **in the future**.
|
||||
|
||||
---
|
||||
|
||||
## 🌐 Public Services (Ingress / HTTPS)
|
||||
|
||||
| Component | URL | Auth | What It Is | Current Usage | Future Usage |
|
||||
|--------|-----|------|------------|---------------|--------------|
|
||||
| LLM Inference | https://llm.betelgeusebytes.io | none / internal | CPU LLM server (Ollama / llama.cpp) | Extract sanad & matn as JSON | Agents, doc AI, RAG |
|
||||
| Embeddings | https://embeddings.betelgeusebytes.io | none / internal | Text Embeddings Inference (HF) | Hadith & bio embeddings | Semantic search |
|
||||
| Vector DB | https://vector.betelgeusebytes.io | none | Qdrant + UI | Similarity search | Recommendations |
|
||||
| Graph DB | https://neo4j.betelgeusebytes.io | Basic Auth | Neo4j Browser | Isnād graph | Knowledge graphs |
|
||||
| Orchestrator | https://hadith-api.betelgeusebytes.io | OIDC | FastAPI router | Core AI API | Any AI backend |
|
||||
| Admin UI | https://hadith-admin.betelgeusebytes.io | OIDC | Next.js UI | Scholar review | Any internal tool |
|
||||
| Labeling | https://label.betelgeusebytes.io | Local / OIDC | Label Studio | NER/RE annotation | Dataset curation |
|
||||
| ML Tracking | https://mlflow.betelgeusebytes.io | OIDC | MLflow UI | Experiments & models | Governance |
|
||||
| Object Storage | https://minio.betelgeusebytes.io | Access key | MinIO Console | Datasets & artifacts | Data lake |
|
||||
| Pipelines | https://argo.betelgeusebytes.io | SA / OIDC | Argo Workflows UI | ML pipelines | ETL |
|
||||
| Auth | https://auth.betelgeusebytes.io | Admin login | Keycloak | SSO & tokens | IAM |
|
||||
| Observability | https://grafana.betelgeusebytes.io | Login | Grafana | Metrics/logs/traces | Ops center |
|
||||
|
||||
---
|
||||
|
||||
## 🔐 Authentication & Access Summary
|
||||
|
||||
| System | Auth Method | Who Uses It |
|
||||
|-----|------------|-------------|
|
||||
| Keycloak | Username / Password | Admins |
|
||||
| Admin UI | OIDC (Keycloak) | Scholars |
|
||||
| Orchestrator API | OIDC Bearer Token | Apps |
|
||||
| MLflow | OIDC | ML engineers |
|
||||
| Label Studio | Local / OIDC | Annotators |
|
||||
| Neo4j | Basic Auth | Engineers |
|
||||
| MinIO | Access / Secret key | Pipelines |
|
||||
| Grafana | Login | Operators |
|
||||
|
||||
---
|
||||
|
||||
## 🧠 Internal Cluster Services (ClusterIP)
|
||||
|
||||
| Component | Namespace | Purpose |
|
||||
|--------|-----------|--------|
|
||||
| PostgreSQL | db | Relational storage |
|
||||
| Redis | db | Cache / temp state |
|
||||
| Kafka | broker | Event backbone |
|
||||
| Prometheus | observability | Metrics |
|
||||
| Loki | observability | Logs |
|
||||
| Tempo | observability | Traces |
|
||||
| Alloy | observability | Telemetry agent |
|
||||
|
||||
---
|
||||
|
||||
## 🗂 Storage Responsibilities
|
||||
|
||||
| Storage | Used By | Contains |
|
||||
|------|--------|---------|
|
||||
| MinIO | Pipelines, MLflow | Datasets, models |
|
||||
| Neo4j PVC | Graph DB | Isnād graph |
|
||||
| Qdrant PVC | Vector DB | Embeddings |
|
||||
| PostgreSQL PVC | DB | Metadata |
|
||||
| Observability PVCs | LGTM | Logs, metrics, traces |
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,5 @@
|
|||
find . -type f -name "*.txt" -o -name "*.py" -o -name "*.yml" -o -name "*.yaml" -o -name "*.YAML" -o -name "*.ini" | while read file; do
|
||||
echo "=== $file ===" >> betelgeusebytes.txt
|
||||
cat "$file" >> betelgeusebytes.txt
|
||||
echo "" >> betelgeusebytes.txt
|
||||
done
|
||||
|
|
@ -43,12 +43,9 @@ data:
|
|||
compactor:
|
||||
working_directory: /loki/compactor
|
||||
compaction_interval: 10m
|
||||
retention_enabled: true
|
||||
retention_delete_delay: 2h
|
||||
retention_delete_worker_count: 150
|
||||
retention_enabled: false
|
||||
|
||||
limits_config:
|
||||
enforce_metric_name: false
|
||||
reject_old_samples: true
|
||||
reject_old_samples_max_age: 168h # 7 days
|
||||
retention_period: 168h # 7 days
|
||||
|
|
|
|||
|
|
@ -39,7 +39,7 @@ data:
|
|||
source: tempo
|
||||
cluster: betelgeuse-k8s
|
||||
storage:
|
||||
path: /tmp/tempo/generator/wal
|
||||
path: /var/tempo/generator/wal
|
||||
remote_write:
|
||||
- url: http://prometheus.observability.svc.cluster.local:9090/api/v1/write
|
||||
send_exemplars: true
|
||||
|
|
@ -48,17 +48,14 @@ data:
|
|||
trace:
|
||||
backend: local
|
||||
wal:
|
||||
path: /tmp/tempo/wal
|
||||
path: /var/tempo/wal
|
||||
local:
|
||||
path: /tmp/tempo/blocks
|
||||
path: /var/tempo/blocks
|
||||
pool:
|
||||
max_workers: 100
|
||||
queue_depth: 10000
|
||||
|
||||
querier:
|
||||
frontend_worker:
|
||||
frontend_address: tempo.observability.svc.cluster.local:9095
|
||||
|
||||
# Single instance mode - no need for frontend/querier split
|
||||
query_frontend:
|
||||
search:
|
||||
duration_slo: 5s
|
||||
|
|
|
|||
|
|
@ -124,7 +124,6 @@ data:
|
|||
|
||||
output {
|
||||
traces = [otelcol.exporter.otlp.tempo.input]
|
||||
metrics = [otelcol.exporter.prometheus.metrics.input]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -138,22 +137,7 @@ data:
|
|||
}
|
||||
}
|
||||
|
||||
// Export OTLP metrics to Prometheus
|
||||
otelcol.exporter.prometheus "metrics" {
|
||||
forward_to = [prometheus.remote_write.local.receiver]
|
||||
}
|
||||
|
||||
// Remote write to Prometheus
|
||||
prometheus.remote_write "local" {
|
||||
endpoint {
|
||||
url = "http://prometheus.observability.svc.cluster.local:9090/api/v1/write"
|
||||
}
|
||||
}
|
||||
|
||||
// Scrape local metrics (Alloy's own metrics)
|
||||
prometheus.scrape "alloy" {
|
||||
targets = [{
|
||||
__address__ = "localhost:12345",
|
||||
}]
|
||||
forward_to = [prometheus.remote_write.local.receiver]
|
||||
// Prometheus will scrape these via service discovery
|
||||
prometheus.exporter.self "alloy" {
|
||||
}
|
||||
|
|
@ -66,6 +66,7 @@ rules:
|
|||
- services
|
||||
- endpoints
|
||||
- pods
|
||||
- pods/log
|
||||
verbs: ["get", "list", "watch"]
|
||||
- apiGroups:
|
||||
- extensions
|
||||
|
|
|
|||
|
|
@ -21,6 +21,11 @@ spec:
|
|||
spec:
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: hetzner-2
|
||||
securityContext:
|
||||
fsGroup: 10001
|
||||
runAsGroup: 10001
|
||||
runAsNonRoot: true
|
||||
runAsUser: 10001
|
||||
containers:
|
||||
- name: tempo
|
||||
image: grafana/tempo:2.6.1
|
||||
|
|
@ -70,7 +75,7 @@ spec:
|
|||
- name: tempo-config
|
||||
mountPath: /etc/tempo
|
||||
- name: tempo-data
|
||||
mountPath: /tmp/tempo
|
||||
mountPath: /var/tempo
|
||||
volumes:
|
||||
- name: tempo-config
|
||||
configMap:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,388 @@
|
|||
# 🧠 BetelgeuseBytes — Full AI Infrastructure Stack
|
||||
## Authoritative README, Architecture & Onboarding Guide
|
||||
|
||||
This repository documents the **entire self-hosted AI infrastructure stack** running on a Kubernetes cluster hosted on **Hetzner dedicated servers**.
|
||||
|
||||
The stack currently powers an **Islamic Hadith Scholar AI**, but it is intentionally designed as a **general-purpose, sovereign AI, MLOps, and data platform** that can support many future projects.
|
||||
|
||||
This document is the **single source of truth** for:
|
||||
- architecture (logical & physical)
|
||||
- infrastructure configuration
|
||||
- networking & DNS
|
||||
- every deployed component
|
||||
- why each component exists
|
||||
- how to build new systems on top of the platform
|
||||
|
||||
---
|
||||
|
||||
## 1. Mission & Design Philosophy
|
||||
|
||||
### Current Mission
|
||||
Build an AI system that can:
|
||||
|
||||
- Parse classical Islamic texts
|
||||
- Extract **Sanad** (chains of narrators) and **Matn** (hadith text)
|
||||
- Identify narrators and their relationships:
|
||||
- teacher / student
|
||||
- familial lineage
|
||||
- Construct a **verifiable knowledge graph**
|
||||
- Support **human scholarly review**
|
||||
- Provide **transparent and explainable reasoning**
|
||||
- Operate **fully on-prem**, CPU-first, without SaaS or GPU dependency
|
||||
|
||||
### Core Principles
|
||||
- **Sovereignty** — no external cloud lock-in
|
||||
- **Explainability** — graph + provenance, not black boxes
|
||||
- **Human-in-the-loop** — scholars remain in control
|
||||
- **Observability-first** — everything is measurable and traceable
|
||||
- **Composable** — every part can be reused or replaced
|
||||
|
||||
---
|
||||
|
||||
## 2. Physical Infrastructure (Hetzner)
|
||||
|
||||
### Nodes
|
||||
- **Provider:** Hetzner
|
||||
- **Type:** Dedicated servers
|
||||
- **Architecture:** x86_64
|
||||
- **GPU:** None (CPU-only by design)
|
||||
- **Storage:** Local NVMe / SSD
|
||||
|
||||
### Node Roles (Logical)
|
||||
| Node Type | Responsibilities |
|
||||
|---------|------------------|
|
||||
| Control / Worker | Kubernetes control plane + workloads |
|
||||
| Storage-heavy | Databases, MinIO, observability data |
|
||||
| Compute-heavy | LLM inference, embeddings, pipelines |
|
||||
|
||||
> The cluster is intentionally **single-region and on-prem-like**, optimized for predictability and data locality.
|
||||
|
||||
---
|
||||
|
||||
## 3. Kubernetes Infrastructure Configuration
|
||||
|
||||
### Kubernetes
|
||||
- Runtime for **all services**
|
||||
- Namespaced isolation
|
||||
- Explicit PersistentVolumeClaims
|
||||
- Declarative configuration (GitOps)
|
||||
|
||||
### Namespaces (Conceptual)
|
||||
| Namespace | Purpose |
|
||||
|--------|--------|
|
||||
| `ai` | LLMs, embeddings, labeling |
|
||||
| `vec` | Vector database |
|
||||
| `graph` | Knowledge graph |
|
||||
| `db` | Relational databases |
|
||||
| `storage` | Object storage |
|
||||
| `mlops` | MLflow |
|
||||
| `ml` | Argo Workflows |
|
||||
| `auth` | Keycloak |
|
||||
| `observability` | LGTM stack |
|
||||
| `hadith` | Custom apps (orchestrator, UI) |
|
||||
|
||||
---
|
||||
|
||||
## 4. Networking & DNS
|
||||
|
||||
### Ingress
|
||||
- **NGINX Ingress Controller**
|
||||
- HTTPS termination at ingress
|
||||
- Internal services communicate via ClusterIP
|
||||
|
||||
### TLS
|
||||
- **cert-manager**
|
||||
- Let’s Encrypt
|
||||
- Automatic renewal
|
||||
|
||||
### Public Endpoints
|
||||
|
||||
| URL | Service |
|
||||
|----|--------|
|
||||
| https://llm.betelgeusebytes.io | LLM inference (Ollama / llama.cpp) |
|
||||
| https://embeddings.betelgeusebytes.io | Text Embeddings Inference |
|
||||
| https://vector.betelgeusebytes.io | Qdrant + UI |
|
||||
| https://neo4j.betelgeusebytes.io | Neo4j Browser |
|
||||
| https://hadith-api.betelgeusebytes.io | FastAPI Orchestrator |
|
||||
| https://hadith-admin.betelgeusebytes.io | Admin / Curation UI |
|
||||
| https://label.betelgeusebytes.io | Label Studio |
|
||||
| https://mlflow.betelgeusebytes.io | MLflow |
|
||||
| https://minio.betelgeusebytes.io | MinIO Console |
|
||||
| https://argo.betelgeusebytes.io | Argo Workflows |
|
||||
| https://auth.betelgeusebytes.io | Keycloak |
|
||||
| https://grafana.betelgeusebytes.io | Grafana |
|
||||
|
||||
---
|
||||
|
||||
## 5. Full Logical Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
User --> AdminUI --> Orchestrator
|
||||
|
||||
Orchestrator --> LLM
|
||||
Orchestrator --> TEI --> Qdrant
|
||||
Orchestrator --> Neo4j
|
||||
Orchestrator --> PostgreSQL
|
||||
Orchestrator --> Redis
|
||||
|
||||
LabelStudio --> MinIO
|
||||
MinIO --> ArgoWF --> MLflow
|
||||
MLflow --> Models --> Orchestrator
|
||||
|
||||
Kafka --> ArgoWF
|
||||
|
||||
Alloy --> Prometheus --> Grafana
|
||||
Alloy --> Loki --> Grafana
|
||||
Alloy --> Tempo --> Grafana
|
||||
```
|
||||
6. AI & Reasoning Layer
|
||||
Ollama / llama.cpp (CPU LLM)
|
||||
Current usage
|
||||
|
||||
JSON-structured extraction
|
||||
|
||||
Sanad / matn reasoning
|
||||
|
||||
Deterministic outputs
|
||||
|
||||
No GPU dependency
|
||||
|
||||
Future usage
|
||||
|
||||
Offline assistants
|
||||
|
||||
Document intelligence
|
||||
|
||||
Agent frameworks
|
||||
|
||||
Replaceable by vLLM when GPUs are added
|
||||
|
||||
Text Embeddings Inference (TEI)
|
||||
Current usage
|
||||
|
||||
Embeddings for hadith texts and biographies
|
||||
|
||||
Future usage
|
||||
|
||||
RAG systems
|
||||
|
||||
Semantic search
|
||||
|
||||
Deduplication
|
||||
|
||||
Similarity clustering
|
||||
|
||||
Qdrant (Vector Database)
|
||||
Current usage
|
||||
|
||||
Stores embeddings
|
||||
|
||||
Similarity search
|
||||
|
||||
Future usage
|
||||
|
||||
Recommendation systems
|
||||
|
||||
Agent memory
|
||||
|
||||
Multimodal retrieval
|
||||
|
||||
Includes Web UI.
|
||||
|
||||
7. Knowledge & Data Layer
|
||||
Neo4j (Graph Database)
|
||||
Current usage
|
||||
|
||||
Isnād chains
|
||||
|
||||
Narrator relationships
|
||||
|
||||
Future usage
|
||||
|
||||
Knowledge graphs
|
||||
|
||||
Trust networks
|
||||
|
||||
Provenance systems
|
||||
|
||||
PostgreSQL
|
||||
Current usage
|
||||
|
||||
App data
|
||||
|
||||
MLflow backend
|
||||
|
||||
Label Studio DB
|
||||
|
||||
Future usage
|
||||
|
||||
Feature stores
|
||||
|
||||
Metadata catalogs
|
||||
|
||||
Transactional apps
|
||||
|
||||
Redis
|
||||
Current usage
|
||||
|
||||
Caching
|
||||
|
||||
Temporary state
|
||||
|
||||
Future usage
|
||||
|
||||
Job queues
|
||||
|
||||
Rate limiting
|
||||
|
||||
Sessions
|
||||
|
||||
Kafka
|
||||
Current usage
|
||||
|
||||
Optional async backbone
|
||||
|
||||
Future usage
|
||||
|
||||
Streaming ingestion
|
||||
|
||||
Event-driven ML
|
||||
|
||||
Audit pipelines
|
||||
|
||||
MinIO (S3)
|
||||
Current usage
|
||||
|
||||
Datasets
|
||||
|
||||
Model artifacts
|
||||
|
||||
Pipeline outputs
|
||||
|
||||
Future usage
|
||||
|
||||
Data lake
|
||||
|
||||
Backups
|
||||
|
||||
Feature storage
|
||||
|
||||
8. MLOps & Human-in-the-Loop
|
||||
Label Studio
|
||||
Current usage
|
||||
|
||||
Human annotation of narrators & relations
|
||||
|
||||
Future usage
|
||||
|
||||
Any labeling task (text, image, audio)
|
||||
|
||||
MLflow
|
||||
Current usage
|
||||
|
||||
Experiment tracking
|
||||
|
||||
Model registry
|
||||
|
||||
Future usage
|
||||
|
||||
Governance
|
||||
|
||||
Model promotion
|
||||
|
||||
Auditing
|
||||
|
||||
Argo Workflows
|
||||
Current usage
|
||||
|
||||
ETL & training pipelines
|
||||
|
||||
Future usage
|
||||
|
||||
Batch inference
|
||||
|
||||
Scheduled automation
|
||||
|
||||
Data engineering
|
||||
|
||||
9. Authentication & Security
|
||||
Keycloak
|
||||
Current usage
|
||||
|
||||
SSO for Admin UI, MLflow, Label Studio
|
||||
|
||||
Future usage
|
||||
|
||||
API authentication
|
||||
|
||||
Multi-tenant access
|
||||
|
||||
Organization-wide IAM
|
||||
|
||||
10. Observability Stack (LGTM)
|
||||
Components
|
||||
Grafana
|
||||
|
||||
Prometheus
|
||||
|
||||
Loki
|
||||
|
||||
Tempo
|
||||
|
||||
Grafana Alloy
|
||||
|
||||
kube-state-metrics
|
||||
|
||||
node-exporter
|
||||
|
||||
Capabilities
|
||||
Metrics, logs, traces
|
||||
|
||||
Automatic correlation
|
||||
|
||||
OTLP-native
|
||||
|
||||
Local SSD persistence
|
||||
|
||||
11. Design Rules for All Custom Services
|
||||
All services must:
|
||||
|
||||
be stateless
|
||||
|
||||
use env vars & Kubernetes Secrets
|
||||
|
||||
authenticate via Keycloak
|
||||
|
||||
emit:
|
||||
|
||||
Prometheus metrics
|
||||
|
||||
OTLP traces
|
||||
|
||||
structured JSON logs
|
||||
|
||||
be deployable via kubectl & Argo CD
|
||||
|
||||
12. Future Use Cases (Beyond Hadith)
|
||||
This platform can support:
|
||||
|
||||
General Knowledge Graph AI
|
||||
|
||||
Legal / scholarly document analysis
|
||||
|
||||
Enterprise RAG systems
|
||||
|
||||
Research data platforms
|
||||
|
||||
Explainable AI systems
|
||||
|
||||
Internal search engines
|
||||
|
||||
Agent-based systems
|
||||
|
||||
Provenance & trust scoring engines
|
||||
|
||||
Digital humanities projects
|
||||
|
||||
Offline sovereign AI deployments
|
||||
|
|
@ -0,0 +1,158 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
GREEN='\033[0;32m'
|
||||
RED='\033[0;31m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m'
|
||||
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo -e "${BLUE} Loki Log Collection Test${NC}"
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo ""
|
||||
|
||||
PASS=0
|
||||
FAIL=0
|
||||
|
||||
# Test 1: Check Alloy DaemonSet
|
||||
echo -e "${YELLOW}Test 1: Checking Alloy DaemonSet...${NC}"
|
||||
if kubectl get pods -n observability -l app=alloy --no-headers 2>/dev/null | grep -q "Running"; then
|
||||
ALLOY_COUNT=$(kubectl get pods -n observability -l app=alloy --no-headers | grep -c "Running")
|
||||
echo -e "${GREEN}✓ Alloy is running ($ALLOY_COUNT pod(s))${NC}"
|
||||
PASS=$((PASS+1))
|
||||
else
|
||||
echo -e "${RED}✗ Alloy is not running${NC}"
|
||||
FAIL=$((FAIL+1))
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Test 2: Check Loki pod
|
||||
echo -e "${YELLOW}Test 2: Checking Loki pod...${NC}"
|
||||
if kubectl get pods -n observability -l app=loki --no-headers 2>/dev/null | grep -q "Running"; then
|
||||
echo -e "${GREEN}✓ Loki is running${NC}"
|
||||
PASS=$((PASS+1))
|
||||
else
|
||||
echo -e "${RED}✗ Loki is not running${NC}"
|
||||
FAIL=$((FAIL+1))
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Test 3: Test Loki readiness endpoint
|
||||
echo -e "${YELLOW}Test 3: Testing Loki readiness endpoint...${NC}"
|
||||
READY=$(kubectl run test-loki-ready-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
|
||||
curl -s -m 5 http://loki.observability.svc.cluster.local:3100/ready 2>/dev/null || echo "failed")
|
||||
|
||||
if [ "$READY" = "ready" ]; then
|
||||
echo -e "${GREEN}✓ Loki is ready${NC}"
|
||||
PASS=$((PASS+1))
|
||||
else
|
||||
echo -e "${RED}✗ Loki is not ready (response: $READY)${NC}"
|
||||
FAIL=$((FAIL+1))
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Test 4: Check Alloy can connect to Loki
|
||||
echo -e "${YELLOW}Test 4: Checking Alloy → Loki connectivity...${NC}"
|
||||
ALLOY_ERRORS=$(kubectl logs -n observability -l app=alloy --tail=50 2>/dev/null | grep -i "error.*loki" | wc -l)
|
||||
if [ "$ALLOY_ERRORS" -eq 0 ]; then
|
||||
echo -e "${GREEN}✓ No Alloy → Loki connection errors${NC}"
|
||||
PASS=$((PASS+1))
|
||||
else
|
||||
echo -e "${RED}✗ Found $ALLOY_ERRORS error(s) in Alloy logs${NC}"
|
||||
kubectl logs -n observability -l app=alloy --tail=20 | grep -i error
|
||||
FAIL=$((FAIL+1))
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Test 5: Create test pod and verify logs
|
||||
echo -e "${YELLOW}Test 5: Creating test pod and verifying log collection...${NC}"
|
||||
|
||||
# Clean up any existing test pod
|
||||
kubectl delete pod test-logger-verify --ignore-not-found 2>/dev/null
|
||||
|
||||
# Create test pod
|
||||
echo " Creating test pod that logs every second..."
|
||||
kubectl run test-logger-verify --image=busybox --restart=Never -- sh -c \
|
||||
'for i in 1 2 3 4 5 6 7 8 9 10; do echo "LOKI-TEST-LOG: Message number $i at $(date)"; sleep 1; done' \
|
||||
>/dev/null 2>&1
|
||||
|
||||
# Wait for pod to start and generate logs
|
||||
echo " Waiting 15 seconds for logs to be collected..."
|
||||
sleep 15
|
||||
|
||||
# Query Loki API for test logs
|
||||
echo " Querying Loki for test logs..."
|
||||
START_TIME=$(date -u -d '2 minutes ago' +%s)000000000
|
||||
END_TIME=$(date -u +%s)000000000
|
||||
|
||||
QUERY_RESULT=$(kubectl run test-loki-query-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
|
||||
curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
|
||||
--data-urlencode 'query={pod="test-logger-verify"}' \
|
||||
--data-urlencode "start=$START_TIME" \
|
||||
--data-urlencode "end=$END_TIME" 2>/dev/null || echo "failed")
|
||||
|
||||
if echo "$QUERY_RESULT" | grep -q "LOKI-TEST-LOG"; then
|
||||
LOG_COUNT=$(echo "$QUERY_RESULT" | grep -o "LOKI-TEST-LOG" | wc -l)
|
||||
echo -e "${GREEN}✓ Found $LOG_COUNT test log messages in Loki${NC}"
|
||||
PASS=$((PASS+1))
|
||||
else
|
||||
echo -e "${RED}✗ Test logs not found in Loki${NC}"
|
||||
echo " Response: ${QUERY_RESULT:0:200}"
|
||||
FAIL=$((FAIL+1))
|
||||
fi
|
||||
|
||||
# Clean up test pod
|
||||
kubectl delete pod test-logger-verify --ignore-not-found >/dev/null 2>&1
|
||||
|
||||
echo ""
|
||||
|
||||
# Test 6: Check observability namespace logs
|
||||
echo -e "${YELLOW}Test 6: Checking for observability namespace logs...${NC}"
|
||||
|
||||
OBS_QUERY=$(kubectl run test-loki-obs-$RANDOM --rm -i --restart=Never --image=curlimages/curl:latest -- \
|
||||
curl -s -m 10 "http://loki.observability.svc.cluster.local:3100/loki/api/v1/query_range" \
|
||||
--data-urlencode 'query={namespace="observability"}' \
|
||||
--data-urlencode "start=$START_TIME" \
|
||||
--data-urlencode "end=$END_TIME" \
|
||||
--data-urlencode "limit=10" 2>/dev/null || echo "failed")
|
||||
|
||||
if echo "$OBS_QUERY" | grep -q '"values":\[\['; then
|
||||
echo -e "${GREEN}✓ Observability namespace logs found in Loki${NC}"
|
||||
PASS=$((PASS+1))
|
||||
else
|
||||
echo -e "${RED}✗ No logs found for observability namespace${NC}"
|
||||
FAIL=$((FAIL+1))
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo -e "${BLUE} Test Results${NC}"
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo ""
|
||||
|
||||
TOTAL=$((PASS+FAIL))
|
||||
echo -e "Passed: ${GREEN}$PASS${NC} / $TOTAL"
|
||||
echo -e "Failed: ${RED}$FAIL${NC} / $TOTAL"
|
||||
echo ""
|
||||
|
||||
if [ $FAIL -eq 0 ]; then
|
||||
echo -e "${GREEN}✓✓✓ All tests passed! Logs are flowing to Loki! ✓✓✓${NC}"
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo " 1. Open Grafana: https://grafana.betelgeusebytes.io"
|
||||
echo " 2. Go to Explore → Loki"
|
||||
echo " 3. Query: {namespace=\"observability\"}"
|
||||
echo ""
|
||||
else
|
||||
echo -e "${RED}✗✗✗ Some tests failed. Check the output above for details. ✗✗✗${NC}"
|
||||
echo ""
|
||||
echo "Troubleshooting:"
|
||||
echo " - Check Alloy logs: kubectl logs -n observability -l app=alloy"
|
||||
echo " - Check Loki logs: kubectl logs -n observability loki-0"
|
||||
echo " - Verify services: kubectl get svc -n observability"
|
||||
echo " - See full guide: VERIFY-LOKI-LOGS.md"
|
||||
echo ""
|
||||
exit 1
|
||||
fi
|
||||
Loading…
Reference in New Issue