178 lines
6.4 KiB
Markdown
178 lines
6.4 KiB
Markdown
# CLAUDE.md - BetelgeuseBytes Full Stack
|
|
|
|
## Project Overview
|
|
|
|
Kubernetes cluster deployment for BetelgeuseBytes using Ansible for infrastructure automation and kubectl for application deployment. This is a complete data science/ML platform with integrated observability, databases, and ML tools.
|
|
|
|
**Infrastructure:**
|
|
- 2-node Kubernetes cluster on Hetzner Cloud
|
|
- Control plane + worker: hetzner-1 (95.217.89.53)
|
|
- Worker node: hetzner-2 (138.201.254.97)
|
|
- Kubernetes v1.30.3 with Cilium CNI
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
.
|
|
├── ansible/ # Infrastructure-as-Code for cluster setup
|
|
│ ├── inventories/prod/ # Hetzner nodes inventory & group vars
|
|
│ │ ├── hosts.ini # Node definitions
|
|
│ │ └── group_vars/all.yml # Global K8s config (versions, CIDRs)
|
|
│ ├── playbooks/
|
|
│ │ ├── site.yml # Main cluster bootstrap playbook
|
|
│ │ └── add-control-planes.yml # HA control plane expansion
|
|
│ └── roles/ # 16 reusable Ansible roles
|
|
│ ├── common/ # Swap disable, kernel modules, sysctl
|
|
│ ├── containerd/ # Container runtime
|
|
│ ├── kubernetes/ # kubeadm, kubelet, kubectl
|
|
│ ├── kubeadm_init/ # Primary control plane init
|
|
│ ├── kubeadm_join/ # Worker node join
|
|
│ ├── cilium/ # CNI plugin
|
|
│ ├── ingress/ # NGINX Ingress Controller
|
|
│ ├── cert_manager/ # Let's Encrypt integration
|
|
│ ├── labels/ # Node labeling
|
|
│ └── storage_local_path/ # Local storage provisioning
|
|
└── k8s/ # Kubernetes manifests
|
|
├── 00-namespaces.yaml # 8 namespaces
|
|
├── 01-secrets/ # Basic auth secrets
|
|
├── storage/ # StorageClass, PersistentVolumes
|
|
├── postgres/ # PostgreSQL 16 with extensions
|
|
├── redis/ # Redis 7 cache
|
|
├── elastic/ # Elasticsearch 8.14 + Kibana
|
|
├── gitea/ # Git repository service
|
|
├── jupyter/ # JupyterLab notebook
|
|
├── kafka/ # Apache Kafka broker
|
|
├── neo4j/ # Neo4j graph database
|
|
├── prometheus/ # Prometheus monitoring
|
|
├── grafana/ # Grafana dashboards
|
|
├── minio/ # S3-compatible object storage
|
|
├── mlflow/ # ML lifecycle tracking
|
|
├── vllm/ # LLM inference (Ollama)
|
|
├── label_studio/ # Data annotation platform
|
|
├── argoflow/ # Argo Workflows
|
|
├── otlp/ # OpenTelemetry collector
|
|
└── observability/ # Fluent-Bit log aggregation
|
|
```
|
|
|
|
## Build & Deployment Commands
|
|
|
|
### Phase 1: Cluster Infrastructure
|
|
|
|
```bash
|
|
# Validate connectivity
|
|
ansible -i ansible/inventories/prod/hosts.ini all -m ping
|
|
|
|
# Bootstrap Kubernetes cluster
|
|
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
|
|
```
|
|
|
|
### Phase 2: Kubernetes Applications (order matters)
|
|
|
|
```bash
|
|
# 1. Namespaces & storage
|
|
kubectl apply -f k8s/00-namespaces.yaml
|
|
kubectl apply -f k8s/storage/storageclass.yaml
|
|
|
|
# 2. Secrets & auth
|
|
kubectl apply -f k8s/01-secrets/
|
|
|
|
# 3. Infrastructure (databases, cache, search)
|
|
kubectl apply -f k8s/postgres/
|
|
kubectl apply -f k8s/redis/
|
|
kubectl apply -f k8s/elastic/elasticsearch.yaml
|
|
kubectl apply -f k8s/elastic/kibana.yaml
|
|
|
|
# 4. Application layer
|
|
kubectl apply -f k8s/gitea/
|
|
kubectl apply -f k8s/jupyter/
|
|
kubectl apply -f k8s/kafka/kafka.yaml
|
|
kubectl apply -f k8s/kafka/kafka-ui.yaml
|
|
kubectl apply -f k8s/neo4j/
|
|
|
|
# 5. Observability & telemetry
|
|
kubectl apply -f k8s/otlp/
|
|
kubectl apply -f k8s/observability/fluent-bit.yaml
|
|
kubectl apply -f k8s/prometheus/
|
|
kubectl apply -f k8s/grafana/
|
|
```
|
|
|
|
## Namespace Organization
|
|
|
|
| Namespace | Purpose | Services |
|
|
|-----------|---------|----------|
|
|
| `db` | Databases & cache | PostgreSQL, Redis |
|
|
| `scm` | Source control | Gitea |
|
|
| `ml` | Machine Learning | JupyterLab, MLflow, Argo, Label Studio, Ollama |
|
|
| `elastic` | Search & logging | Elasticsearch, Kibana |
|
|
| `broker` | Message brokers | Kafka |
|
|
| `graph` | Graph databases | Neo4j |
|
|
| `monitoring` | Observability | Prometheus, Grafana |
|
|
| `observability` | Telemetry | OpenTelemetry, Fluent-Bit |
|
|
| `storage` | Object storage | MinIO |
|
|
|
|
## Key Configuration
|
|
|
|
**Kubernetes:**
|
|
- Pod CIDR: 10.244.0.0/16
|
|
- Service CIDR: 10.96.0.0/12
|
|
- CNI: Cilium v1.15.7
|
|
|
|
**Storage:**
|
|
- StorageClass: `local-ssd-hetzner` (local volumes)
|
|
- All stateful workloads pinned to hetzner-2
|
|
- Local path: `/mnt/local-ssd/{service-name}`
|
|
|
|
**Networking:**
|
|
- Internal DNS: `service.namespace.svc.cluster.local`
|
|
- External: `{service}.betelgeusebytes.io` via NGINX Ingress
|
|
- TLS: Let's Encrypt via cert-manager
|
|
|
|
## DNS Records
|
|
|
|
A records point to both nodes:
|
|
- `apps.betelgeusebytes.io` → 95.217.89.53, 138.201.254.97
|
|
|
|
CNAMEs to `apps.betelgeusebytes.io`:
|
|
- gitea, kibana, grafana, prometheus, notebook, broker, neo4j, otlp, label, llm, mlflow, minio
|
|
|
|
## Secrets Location
|
|
|
|
- `k8s/01-secrets/basic-auth.yaml` - HTTP basic auth for protected services
|
|
- Service-specific secrets inline in respective manifests (e.g., postgres-auth, redis-auth)
|
|
|
|
## Manifest Conventions
|
|
|
|
1. Compact YAML style: `metadata: { name: xyz, namespace: ns }`
|
|
2. StatefulSets for persistent services (databases, brokers)
|
|
3. Deployments for stateless services (web UIs, workers)
|
|
4. DaemonSets for node-level agents (Fluent-Bit)
|
|
5. Service port=80 for ingress routing, backend maps to container port
|
|
6. Ingress with TLS + basic auth annotations where needed
|
|
|
|
## Common Operations
|
|
|
|
```bash
|
|
# Check cluster status
|
|
kubectl get nodes
|
|
kubectl get pods -A
|
|
|
|
# View logs for a service
|
|
kubectl logs -n <namespace> -l app=<service-name>
|
|
|
|
# Scale a deployment
|
|
kubectl scale -n <namespace> deployment/<name> --replicas=N
|
|
|
|
# Apply changes to a specific service
|
|
kubectl apply -f k8s/<service>/
|
|
|
|
# Delete and recreate a service
|
|
kubectl delete -f k8s/<service>/ && kubectl apply -f k8s/<service>/
|
|
```
|
|
|
|
## Notes
|
|
|
|
- This is a development/test setup; passwords are hardcoded in manifests
|
|
- Elasticsearch security is disabled for development
|
|
- GPU support for vLLM is commented out (requires nvidia.com/gpu resources)
|
|
- Neo4j Bolt protocol (7687) requires manual ingress-nginx TCP patch
|