betelgeusebytes/CLAUDE.md

6.4 KiB

CLAUDE.md - BetelgeuseBytes Full Stack

Project Overview

Kubernetes cluster deployment for BetelgeuseBytes using Ansible for infrastructure automation and kubectl for application deployment. This is a complete data science/ML platform with integrated observability, databases, and ML tools.

Infrastructure:

  • 2-node Kubernetes cluster on Hetzner Cloud
  • Control plane + worker: hetzner-1 (95.217.89.53)
  • Worker node: hetzner-2 (138.201.254.97)
  • Kubernetes v1.30.3 with Cilium CNI

Directory Structure

.
├── ansible/                    # Infrastructure-as-Code for cluster setup
│   ├── inventories/prod/       # Hetzner nodes inventory & group vars
│   │   ├── hosts.ini           # Node definitions
│   │   └── group_vars/all.yml  # Global K8s config (versions, CIDRs)
│   ├── playbooks/
│   │   ├── site.yml            # Main cluster bootstrap playbook
│   │   └── add-control-planes.yml  # HA control plane expansion
│   └── roles/                  # 16 reusable Ansible roles
│       ├── common/             # Swap disable, kernel modules, sysctl
│       ├── containerd/         # Container runtime
│       ├── kubernetes/         # kubeadm, kubelet, kubectl
│       ├── kubeadm_init/       # Primary control plane init
│       ├── kubeadm_join/       # Worker node join
│       ├── cilium/             # CNI plugin
│       ├── ingress/            # NGINX Ingress Controller
│       ├── cert_manager/       # Let's Encrypt integration
│       ├── labels/             # Node labeling
│       └── storage_local_path/ # Local storage provisioning
└── k8s/                        # Kubernetes manifests
    ├── 00-namespaces.yaml      # 8 namespaces
    ├── 01-secrets/             # Basic auth secrets
    ├── storage/                # StorageClass, PersistentVolumes
    ├── postgres/               # PostgreSQL 16 with extensions
    ├── redis/                  # Redis 7 cache
    ├── elastic/                # Elasticsearch 8.14 + Kibana
    ├── gitea/                  # Git repository service
    ├── jupyter/                # JupyterLab notebook
    ├── kafka/                  # Apache Kafka broker
    ├── neo4j/                  # Neo4j graph database
    ├── prometheus/             # Prometheus monitoring
    ├── grafana/                # Grafana dashboards
    ├── minio/                  # S3-compatible object storage
    ├── mlflow/                 # ML lifecycle tracking
    ├── vllm/                   # LLM inference (Ollama)
    ├── label_studio/           # Data annotation platform
    ├── argoflow/               # Argo Workflows
    ├── otlp/                   # OpenTelemetry collector
    └── observability/          # Fluent-Bit log aggregation

Build & Deployment Commands

Phase 1: Cluster Infrastructure

# Validate connectivity
ansible -i ansible/inventories/prod/hosts.ini all -m ping

# Bootstrap Kubernetes cluster
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml

Phase 2: Kubernetes Applications (order matters)

# 1. Namespaces & storage
kubectl apply -f k8s/00-namespaces.yaml
kubectl apply -f k8s/storage/storageclass.yaml

# 2. Secrets & auth
kubectl apply -f k8s/01-secrets/

# 3. Infrastructure (databases, cache, search)
kubectl apply -f k8s/postgres/
kubectl apply -f k8s/redis/
kubectl apply -f k8s/elastic/elasticsearch.yaml
kubectl apply -f k8s/elastic/kibana.yaml

# 4. Application layer
kubectl apply -f k8s/gitea/
kubectl apply -f k8s/jupyter/
kubectl apply -f k8s/kafka/kafka.yaml
kubectl apply -f k8s/kafka/kafka-ui.yaml
kubectl apply -f k8s/neo4j/

# 5. Observability & telemetry
kubectl apply -f k8s/otlp/
kubectl apply -f k8s/observability/fluent-bit.yaml
kubectl apply -f k8s/prometheus/
kubectl apply -f k8s/grafana/

Namespace Organization

Namespace Purpose Services
db Databases & cache PostgreSQL, Redis
scm Source control Gitea
ml Machine Learning JupyterLab, MLflow, Argo, Label Studio, Ollama
elastic Search & logging Elasticsearch, Kibana
broker Message brokers Kafka
graph Graph databases Neo4j
monitoring Observability Prometheus, Grafana
observability Telemetry OpenTelemetry, Fluent-Bit
storage Object storage MinIO

Key Configuration

Kubernetes:

  • Pod CIDR: 10.244.0.0/16
  • Service CIDR: 10.96.0.0/12
  • CNI: Cilium v1.15.7

Storage:

  • StorageClass: local-ssd-hetzner (local volumes)
  • All stateful workloads pinned to hetzner-2
  • Local path: /mnt/local-ssd/{service-name}

Networking:

  • Internal DNS: service.namespace.svc.cluster.local
  • External: {service}.betelgeusebytes.io via NGINX Ingress
  • TLS: Let's Encrypt via cert-manager

DNS Records

A records point to both nodes:

  • apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97

CNAMEs to apps.betelgeusebytes.io:

  • gitea, kibana, grafana, prometheus, notebook, broker, neo4j, otlp, label, llm, mlflow, minio

Secrets Location

  • k8s/01-secrets/basic-auth.yaml - HTTP basic auth for protected services
  • Service-specific secrets inline in respective manifests (e.g., postgres-auth, redis-auth)

Manifest Conventions

  1. Compact YAML style: metadata: { name: xyz, namespace: ns }
  2. StatefulSets for persistent services (databases, brokers)
  3. Deployments for stateless services (web UIs, workers)
  4. DaemonSets for node-level agents (Fluent-Bit)
  5. Service port=80 for ingress routing, backend maps to container port
  6. Ingress with TLS + basic auth annotations where needed

Common Operations

# Check cluster status
kubectl get nodes
kubectl get pods -A

# View logs for a service
kubectl logs -n <namespace> -l app=<service-name>

# Scale a deployment
kubectl scale -n <namespace> deployment/<name> --replicas=N

# Apply changes to a specific service
kubectl apply -f k8s/<service>/

# Delete and recreate a service
kubectl delete -f k8s/<service>/ && kubectl apply -f k8s/<service>/

Notes

  • This is a development/test setup; passwords are hardcoded in manifests
  • Elasticsearch security is disabled for development
  • GPU support for vLLM is commented out (requires nvidia.com/gpu resources)
  • Neo4j Bolt protocol (7687) requires manual ingress-nginx TCP patch