betelgeusebytes/CLAUDE.md

178 lines
6.4 KiB
Markdown

# CLAUDE.md - BetelgeuseBytes Full Stack
## Project Overview
Kubernetes cluster deployment for BetelgeuseBytes using Ansible for infrastructure automation and kubectl for application deployment. This is a complete data science/ML platform with integrated observability, databases, and ML tools.
**Infrastructure:**
- 2-node Kubernetes cluster on Hetzner Cloud
- Control plane + worker: hetzner-1 (95.217.89.53)
- Worker node: hetzner-2 (138.201.254.97)
- Kubernetes v1.30.3 with Cilium CNI
## Directory Structure
```
.
├── ansible/ # Infrastructure-as-Code for cluster setup
│ ├── inventories/prod/ # Hetzner nodes inventory & group vars
│ │ ├── hosts.ini # Node definitions
│ │ └── group_vars/all.yml # Global K8s config (versions, CIDRs)
│ ├── playbooks/
│ │ ├── site.yml # Main cluster bootstrap playbook
│ │ └── add-control-planes.yml # HA control plane expansion
│ └── roles/ # 16 reusable Ansible roles
│ ├── common/ # Swap disable, kernel modules, sysctl
│ ├── containerd/ # Container runtime
│ ├── kubernetes/ # kubeadm, kubelet, kubectl
│ ├── kubeadm_init/ # Primary control plane init
│ ├── kubeadm_join/ # Worker node join
│ ├── cilium/ # CNI plugin
│ ├── ingress/ # NGINX Ingress Controller
│ ├── cert_manager/ # Let's Encrypt integration
│ ├── labels/ # Node labeling
│ └── storage_local_path/ # Local storage provisioning
└── k8s/ # Kubernetes manifests
├── 00-namespaces.yaml # 8 namespaces
├── 01-secrets/ # Basic auth secrets
├── storage/ # StorageClass, PersistentVolumes
├── postgres/ # PostgreSQL 16 with extensions
├── redis/ # Redis 7 cache
├── elastic/ # Elasticsearch 8.14 + Kibana
├── gitea/ # Git repository service
├── jupyter/ # JupyterLab notebook
├── kafka/ # Apache Kafka broker
├── neo4j/ # Neo4j graph database
├── prometheus/ # Prometheus monitoring
├── grafana/ # Grafana dashboards
├── minio/ # S3-compatible object storage
├── mlflow/ # ML lifecycle tracking
├── vllm/ # LLM inference (Ollama)
├── label_studio/ # Data annotation platform
├── argoflow/ # Argo Workflows
├── otlp/ # OpenTelemetry collector
└── observability/ # Fluent-Bit log aggregation
```
## Build & Deployment Commands
### Phase 1: Cluster Infrastructure
```bash
# Validate connectivity
ansible -i ansible/inventories/prod/hosts.ini all -m ping
# Bootstrap Kubernetes cluster
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
```
### Phase 2: Kubernetes Applications (order matters)
```bash
# 1. Namespaces & storage
kubectl apply -f k8s/00-namespaces.yaml
kubectl apply -f k8s/storage/storageclass.yaml
# 2. Secrets & auth
kubectl apply -f k8s/01-secrets/
# 3. Infrastructure (databases, cache, search)
kubectl apply -f k8s/postgres/
kubectl apply -f k8s/redis/
kubectl apply -f k8s/elastic/elasticsearch.yaml
kubectl apply -f k8s/elastic/kibana.yaml
# 4. Application layer
kubectl apply -f k8s/gitea/
kubectl apply -f k8s/jupyter/
kubectl apply -f k8s/kafka/kafka.yaml
kubectl apply -f k8s/kafka/kafka-ui.yaml
kubectl apply -f k8s/neo4j/
# 5. Observability & telemetry
kubectl apply -f k8s/otlp/
kubectl apply -f k8s/observability/fluent-bit.yaml
kubectl apply -f k8s/prometheus/
kubectl apply -f k8s/grafana/
```
## Namespace Organization
| Namespace | Purpose | Services |
|-----------|---------|----------|
| `db` | Databases & cache | PostgreSQL, Redis |
| `scm` | Source control | Gitea |
| `ml` | Machine Learning | JupyterLab, MLflow, Argo, Label Studio, Ollama |
| `elastic` | Search & logging | Elasticsearch, Kibana |
| `broker` | Message brokers | Kafka |
| `graph` | Graph databases | Neo4j |
| `monitoring` | Observability | Prometheus, Grafana |
| `observability` | Telemetry | OpenTelemetry, Fluent-Bit |
| `storage` | Object storage | MinIO |
## Key Configuration
**Kubernetes:**
- Pod CIDR: 10.244.0.0/16
- Service CIDR: 10.96.0.0/12
- CNI: Cilium v1.15.7
**Storage:**
- StorageClass: `local-ssd-hetzner` (local volumes)
- All stateful workloads pinned to hetzner-2
- Local path: `/mnt/local-ssd/{service-name}`
**Networking:**
- Internal DNS: `service.namespace.svc.cluster.local`
- External: `{service}.betelgeusebytes.io` via NGINX Ingress
- TLS: Let's Encrypt via cert-manager
## DNS Records
A records point to both nodes:
- `apps.betelgeusebytes.io` → 95.217.89.53, 138.201.254.97
CNAMEs to `apps.betelgeusebytes.io`:
- gitea, kibana, grafana, prometheus, notebook, broker, neo4j, otlp, label, llm, mlflow, minio
## Secrets Location
- `k8s/01-secrets/basic-auth.yaml` - HTTP basic auth for protected services
- Service-specific secrets inline in respective manifests (e.g., postgres-auth, redis-auth)
## Manifest Conventions
1. Compact YAML style: `metadata: { name: xyz, namespace: ns }`
2. StatefulSets for persistent services (databases, brokers)
3. Deployments for stateless services (web UIs, workers)
4. DaemonSets for node-level agents (Fluent-Bit)
5. Service port=80 for ingress routing, backend maps to container port
6. Ingress with TLS + basic auth annotations where needed
## Common Operations
```bash
# Check cluster status
kubectl get nodes
kubectl get pods -A
# View logs for a service
kubectl logs -n <namespace> -l app=<service-name>
# Scale a deployment
kubectl scale -n <namespace> deployment/<name> --replicas=N
# Apply changes to a specific service
kubectl apply -f k8s/<service>/
# Delete and recreate a service
kubectl delete -f k8s/<service>/ && kubectl apply -f k8s/<service>/
```
## Notes
- This is a development/test setup; passwords are hardcoded in manifests
- Elasticsearch security is disabled for development
- GPU support for vLLM is commented out (requires nvidia.com/gpu resources)
- Neo4j Bolt protocol (7687) requires manual ingress-nginx TCP patch