# CLAUDE.md - BetelgeuseBytes Full Stack ## Project Overview Kubernetes cluster deployment for BetelgeuseBytes using Ansible for infrastructure automation and kubectl for application deployment. This is a complete data science/ML platform with integrated observability, databases, and ML tools. **Infrastructure:** - 2-node Kubernetes cluster on Hetzner Cloud - Control plane + worker: hetzner-1 (95.217.89.53) - Worker node: hetzner-2 (138.201.254.97) - Kubernetes v1.30.3 with Cilium CNI ## Directory Structure ``` . ├── ansible/ # Infrastructure-as-Code for cluster setup │ ├── inventories/prod/ # Hetzner nodes inventory & group vars │ │ ├── hosts.ini # Node definitions │ │ └── group_vars/all.yml # Global K8s config (versions, CIDRs) │ ├── playbooks/ │ │ ├── site.yml # Main cluster bootstrap playbook │ │ └── add-control-planes.yml # HA control plane expansion │ └── roles/ # 16 reusable Ansible roles │ ├── common/ # Swap disable, kernel modules, sysctl │ ├── containerd/ # Container runtime │ ├── kubernetes/ # kubeadm, kubelet, kubectl │ ├── kubeadm_init/ # Primary control plane init │ ├── kubeadm_join/ # Worker node join │ ├── cilium/ # CNI plugin │ ├── ingress/ # NGINX Ingress Controller │ ├── cert_manager/ # Let's Encrypt integration │ ├── labels/ # Node labeling │ └── storage_local_path/ # Local storage provisioning └── k8s/ # Kubernetes manifests ├── 00-namespaces.yaml # 8 namespaces ├── 01-secrets/ # Basic auth secrets ├── storage/ # StorageClass, PersistentVolumes ├── postgres/ # PostgreSQL 16 with extensions ├── redis/ # Redis 7 cache ├── elastic/ # Elasticsearch 8.14 + Kibana ├── gitea/ # Git repository service ├── jupyter/ # JupyterLab notebook ├── kafka/ # Apache Kafka broker ├── neo4j/ # Neo4j graph database ├── prometheus/ # Prometheus monitoring ├── grafana/ # Grafana dashboards ├── minio/ # S3-compatible object storage ├── mlflow/ # ML lifecycle tracking ├── vllm/ # LLM inference (Ollama) ├── label_studio/ # Data annotation platform ├── argoflow/ # Argo Workflows ├── otlp/ # OpenTelemetry collector └── observability/ # Fluent-Bit log aggregation ``` ## Build & Deployment Commands ### Phase 1: Cluster Infrastructure ```bash # Validate connectivity ansible -i ansible/inventories/prod/hosts.ini all -m ping # Bootstrap Kubernetes cluster ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml ``` ### Phase 2: Kubernetes Applications (order matters) ```bash # 1. Namespaces & storage kubectl apply -f k8s/00-namespaces.yaml kubectl apply -f k8s/storage/storageclass.yaml # 2. Secrets & auth kubectl apply -f k8s/01-secrets/ # 3. Infrastructure (databases, cache, search) kubectl apply -f k8s/postgres/ kubectl apply -f k8s/redis/ kubectl apply -f k8s/elastic/elasticsearch.yaml kubectl apply -f k8s/elastic/kibana.yaml # 4. Application layer kubectl apply -f k8s/gitea/ kubectl apply -f k8s/jupyter/ kubectl apply -f k8s/kafka/kafka.yaml kubectl apply -f k8s/kafka/kafka-ui.yaml kubectl apply -f k8s/neo4j/ # 5. Observability & telemetry kubectl apply -f k8s/otlp/ kubectl apply -f k8s/observability/fluent-bit.yaml kubectl apply -f k8s/prometheus/ kubectl apply -f k8s/grafana/ ``` ## Namespace Organization | Namespace | Purpose | Services | |-----------|---------|----------| | `db` | Databases & cache | PostgreSQL, Redis | | `scm` | Source control | Gitea | | `ml` | Machine Learning | JupyterLab, MLflow, Argo, Label Studio, Ollama | | `elastic` | Search & logging | Elasticsearch, Kibana | | `broker` | Message brokers | Kafka | | `graph` | Graph databases | Neo4j | | `monitoring` | Observability | Prometheus, Grafana | | `observability` | Telemetry | OpenTelemetry, Fluent-Bit | | `storage` | Object storage | MinIO | ## Key Configuration **Kubernetes:** - Pod CIDR: 10.244.0.0/16 - Service CIDR: 10.96.0.0/12 - CNI: Cilium v1.15.7 **Storage:** - StorageClass: `local-ssd-hetzner` (local volumes) - All stateful workloads pinned to hetzner-2 - Local path: `/mnt/local-ssd/{service-name}` **Networking:** - Internal DNS: `service.namespace.svc.cluster.local` - External: `{service}.betelgeusebytes.io` via NGINX Ingress - TLS: Let's Encrypt via cert-manager ## DNS Records A records point to both nodes: - `apps.betelgeusebytes.io` → 95.217.89.53, 138.201.254.97 CNAMEs to `apps.betelgeusebytes.io`: - gitea, kibana, grafana, prometheus, notebook, broker, neo4j, otlp, label, llm, mlflow, minio ## Secrets Location - `k8s/01-secrets/basic-auth.yaml` - HTTP basic auth for protected services - Service-specific secrets inline in respective manifests (e.g., postgres-auth, redis-auth) ## Manifest Conventions 1. Compact YAML style: `metadata: { name: xyz, namespace: ns }` 2. StatefulSets for persistent services (databases, brokers) 3. Deployments for stateless services (web UIs, workers) 4. DaemonSets for node-level agents (Fluent-Bit) 5. Service port=80 for ingress routing, backend maps to container port 6. Ingress with TLS + basic auth annotations where needed ## Common Operations ```bash # Check cluster status kubectl get nodes kubectl get pods -A # View logs for a service kubectl logs -n -l app= # Scale a deployment kubectl scale -n deployment/ --replicas=N # Apply changes to a specific service kubectl apply -f k8s// # Delete and recreate a service kubectl delete -f k8s// && kubectl apply -f k8s// ``` ## Notes - This is a development/test setup; passwords are hardcoded in manifests - Elasticsearch security is disabled for development - GPU support for vLLM is commented out (requires nvidia.com/gpu resources) - Neo4j Bolt protocol (7687) requires manual ingress-nginx TCP patch