adding betelgeusebytes.io deops part
This commit is contained in:
commit
dfdd36db3f
|
|
@ -0,0 +1,177 @@
|
|||
# CLAUDE.md - BetelgeuseBytes Full Stack
|
||||
|
||||
## Project Overview
|
||||
|
||||
Kubernetes cluster deployment for BetelgeuseBytes using Ansible for infrastructure automation and kubectl for application deployment. This is a complete data science/ML platform with integrated observability, databases, and ML tools.
|
||||
|
||||
**Infrastructure:**
|
||||
- 2-node Kubernetes cluster on Hetzner Cloud
|
||||
- Control plane + worker: hetzner-1 (95.217.89.53)
|
||||
- Worker node: hetzner-2 (138.201.254.97)
|
||||
- Kubernetes v1.30.3 with Cilium CNI
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
.
|
||||
├── ansible/ # Infrastructure-as-Code for cluster setup
|
||||
│ ├── inventories/prod/ # Hetzner nodes inventory & group vars
|
||||
│ │ ├── hosts.ini # Node definitions
|
||||
│ │ └── group_vars/all.yml # Global K8s config (versions, CIDRs)
|
||||
│ ├── playbooks/
|
||||
│ │ ├── site.yml # Main cluster bootstrap playbook
|
||||
│ │ └── add-control-planes.yml # HA control plane expansion
|
||||
│ └── roles/ # 16 reusable Ansible roles
|
||||
│ ├── common/ # Swap disable, kernel modules, sysctl
|
||||
│ ├── containerd/ # Container runtime
|
||||
│ ├── kubernetes/ # kubeadm, kubelet, kubectl
|
||||
│ ├── kubeadm_init/ # Primary control plane init
|
||||
│ ├── kubeadm_join/ # Worker node join
|
||||
│ ├── cilium/ # CNI plugin
|
||||
│ ├── ingress/ # NGINX Ingress Controller
|
||||
│ ├── cert_manager/ # Let's Encrypt integration
|
||||
│ ├── labels/ # Node labeling
|
||||
│ └── storage_local_path/ # Local storage provisioning
|
||||
└── k8s/ # Kubernetes manifests
|
||||
├── 00-namespaces.yaml # 8 namespaces
|
||||
├── 01-secrets/ # Basic auth secrets
|
||||
├── storage/ # StorageClass, PersistentVolumes
|
||||
├── postgres/ # PostgreSQL 16 with extensions
|
||||
├── redis/ # Redis 7 cache
|
||||
├── elastic/ # Elasticsearch 8.14 + Kibana
|
||||
├── gitea/ # Git repository service
|
||||
├── jupyter/ # JupyterLab notebook
|
||||
├── kafka/ # Apache Kafka broker
|
||||
├── neo4j/ # Neo4j graph database
|
||||
├── prometheus/ # Prometheus monitoring
|
||||
├── grafana/ # Grafana dashboards
|
||||
├── minio/ # S3-compatible object storage
|
||||
├── mlflow/ # ML lifecycle tracking
|
||||
├── vllm/ # LLM inference (Ollama)
|
||||
├── label_studio/ # Data annotation platform
|
||||
├── argoflow/ # Argo Workflows
|
||||
├── otlp/ # OpenTelemetry collector
|
||||
└── observability/ # Fluent-Bit log aggregation
|
||||
```
|
||||
|
||||
## Build & Deployment Commands
|
||||
|
||||
### Phase 1: Cluster Infrastructure
|
||||
|
||||
```bash
|
||||
# Validate connectivity
|
||||
ansible -i ansible/inventories/prod/hosts.ini all -m ping
|
||||
|
||||
# Bootstrap Kubernetes cluster
|
||||
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
|
||||
```
|
||||
|
||||
### Phase 2: Kubernetes Applications (order matters)
|
||||
|
||||
```bash
|
||||
# 1. Namespaces & storage
|
||||
kubectl apply -f k8s/00-namespaces.yaml
|
||||
kubectl apply -f k8s/storage/storageclass.yaml
|
||||
|
||||
# 2. Secrets & auth
|
||||
kubectl apply -f k8s/01-secrets/
|
||||
|
||||
# 3. Infrastructure (databases, cache, search)
|
||||
kubectl apply -f k8s/postgres/
|
||||
kubectl apply -f k8s/redis/
|
||||
kubectl apply -f k8s/elastic/elasticsearch.yaml
|
||||
kubectl apply -f k8s/elastic/kibana.yaml
|
||||
|
||||
# 4. Application layer
|
||||
kubectl apply -f k8s/gitea/
|
||||
kubectl apply -f k8s/jupyter/
|
||||
kubectl apply -f k8s/kafka/kafka.yaml
|
||||
kubectl apply -f k8s/kafka/kafka-ui.yaml
|
||||
kubectl apply -f k8s/neo4j/
|
||||
|
||||
# 5. Observability & telemetry
|
||||
kubectl apply -f k8s/otlp/
|
||||
kubectl apply -f k8s/observability/fluent-bit.yaml
|
||||
kubectl apply -f k8s/prometheus/
|
||||
kubectl apply -f k8s/grafana/
|
||||
```
|
||||
|
||||
## Namespace Organization
|
||||
|
||||
| Namespace | Purpose | Services |
|
||||
|-----------|---------|----------|
|
||||
| `db` | Databases & cache | PostgreSQL, Redis |
|
||||
| `scm` | Source control | Gitea |
|
||||
| `ml` | Machine Learning | JupyterLab, MLflow, Argo, Label Studio, Ollama |
|
||||
| `elastic` | Search & logging | Elasticsearch, Kibana |
|
||||
| `broker` | Message brokers | Kafka |
|
||||
| `graph` | Graph databases | Neo4j |
|
||||
| `monitoring` | Observability | Prometheus, Grafana |
|
||||
| `observability` | Telemetry | OpenTelemetry, Fluent-Bit |
|
||||
| `storage` | Object storage | MinIO |
|
||||
|
||||
## Key Configuration
|
||||
|
||||
**Kubernetes:**
|
||||
- Pod CIDR: 10.244.0.0/16
|
||||
- Service CIDR: 10.96.0.0/12
|
||||
- CNI: Cilium v1.15.7
|
||||
|
||||
**Storage:**
|
||||
- StorageClass: `local-ssd-hetzner` (local volumes)
|
||||
- All stateful workloads pinned to hetzner-2
|
||||
- Local path: `/mnt/local-ssd/{service-name}`
|
||||
|
||||
**Networking:**
|
||||
- Internal DNS: `service.namespace.svc.cluster.local`
|
||||
- External: `{service}.betelgeusebytes.io` via NGINX Ingress
|
||||
- TLS: Let's Encrypt via cert-manager
|
||||
|
||||
## DNS Records
|
||||
|
||||
A records point to both nodes:
|
||||
- `apps.betelgeusebytes.io` → 95.217.89.53, 138.201.254.97
|
||||
|
||||
CNAMEs to `apps.betelgeusebytes.io`:
|
||||
- gitea, kibana, grafana, prometheus, notebook, broker, neo4j, otlp, label, llm, mlflow, minio
|
||||
|
||||
## Secrets Location
|
||||
|
||||
- `k8s/01-secrets/basic-auth.yaml` - HTTP basic auth for protected services
|
||||
- Service-specific secrets inline in respective manifests (e.g., postgres-auth, redis-auth)
|
||||
|
||||
## Manifest Conventions
|
||||
|
||||
1. Compact YAML style: `metadata: { name: xyz, namespace: ns }`
|
||||
2. StatefulSets for persistent services (databases, brokers)
|
||||
3. Deployments for stateless services (web UIs, workers)
|
||||
4. DaemonSets for node-level agents (Fluent-Bit)
|
||||
5. Service port=80 for ingress routing, backend maps to container port
|
||||
6. Ingress with TLS + basic auth annotations where needed
|
||||
|
||||
## Common Operations
|
||||
|
||||
```bash
|
||||
# Check cluster status
|
||||
kubectl get nodes
|
||||
kubectl get pods -A
|
||||
|
||||
# View logs for a service
|
||||
kubectl logs -n <namespace> -l app=<service-name>
|
||||
|
||||
# Scale a deployment
|
||||
kubectl scale -n <namespace> deployment/<name> --replicas=N
|
||||
|
||||
# Apply changes to a specific service
|
||||
kubectl apply -f k8s/<service>/
|
||||
|
||||
# Delete and recreate a service
|
||||
kubectl delete -f k8s/<service>/ && kubectl apply -f k8s/<service>/
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- This is a development/test setup; passwords are hardcoded in manifests
|
||||
- Elasticsearch security is disabled for development
|
||||
- GPU support for vLLM is commented out (requires nvidia.com/gpu resources)
|
||||
- Neo4j Bolt protocol (7687) requires manual ingress-nginx TCP patch
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
apps.betelgeusebytes.io. 300 IN A 95.217.89.53
|
||||
apps.betelgeusebytes.io. 300 IN A 138.201.254.97
|
||||
gitea.betelgeusebytes.io. 300 IN CNAME apps.betelgeusebytes.io.
|
||||
kibana.betelgeusebytes.io. 300 IN CNAME apps.betelgeusebytes.io.
|
||||
grafana.betelgeusebytes.io. 300 IN CNAME apps.betelgeusebytes.io.
|
||||
prometheus.betelgeusebytes.io. 300 IN CNAME apps.betelgeusebytes.io.
|
||||
notebook.betelgeusebytes.io. 300 IN CNAME apps.betelgeusebytes.io.
|
||||
broker.betelgeusebytes.io. 300 IN CNAME apps.betelgeusebytes.io.
|
||||
neo4j.betelgeusebytes.io. 300 IN CNAME apps.betelgeusebytes.io.
|
||||
otlp.betelgeusebytes.io. 300 IN CNAME apps.betelgeusebytes.io.
|
||||
|
|
@ -0,0 +1,43 @@
|
|||
# BetelgeuseBytes K8s — Full Stack (kubectl-only)
|
||||
|
||||
**Nodes**
|
||||
- Control-plane + worker: hetzner-1 (95.217.89.53)
|
||||
- Worker: hetzner-2 (138.201.254.97)
|
||||
|
||||
## Bring up the cluster
|
||||
```bash
|
||||
ansible -i ansible/inventories/prod/hosts.ini all -m ping
|
||||
ansible-playbook -i ansible/inventories/prod/hosts.ini ansible/playbooks/site.yml
|
||||
```
|
||||
|
||||
## Apply apps (edit secrets first)
|
||||
```bash
|
||||
kubectl apply -f k8s/00-namespaces.yaml
|
||||
kubectl apply -f k8s/01-secrets/
|
||||
kubectl apply -f k8s/storage/storageclass.yaml
|
||||
|
||||
kubectl apply -f k8s/postgres/
|
||||
kubectl apply -f k8s/redis/
|
||||
kubectl apply -f k8s/elastic/elasticsearch.yaml
|
||||
kubectl apply -f k8s/elastic/kibana.yaml
|
||||
|
||||
kubectl apply -f k8s/gitea/
|
||||
kubectl apply -f k8s/jupyter/
|
||||
kubectl apply -f k8s/kafka/kafka.yaml
|
||||
kubectl apply -f k8s/kafka/kafka-ui.yaml
|
||||
kubectl apply -f k8s/neo4j/
|
||||
|
||||
kubectl apply -f k8s/otlp/
|
||||
kubectl apply -f k8s/observability/fluent-bit.yaml
|
||||
kubectl apply -f k8s/prometheus/
|
||||
kubectl apply -f k8s/grafana/
|
||||
```
|
||||
|
||||
## DNS
|
||||
A records:
|
||||
- apps.betelgeusebytes.io → 95.217.89.53, 138.201.254.97
|
||||
|
||||
CNAMEs → apps.betelgeusebytes.io:
|
||||
- gitea., kibana., grafana., prometheus., notebook., broker., neo4j., otlp.
|
||||
|
||||
(HA later) cp.k8s.betelgeusebytes.io → <VPS_IP>, 95.217.89.53, 138.201.254.97; then set control_plane_endpoint accordingly.
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
cluster_name: prod
|
||||
k8s_version: "v1.30.3"
|
||||
control_plane_endpoint: "95.217.89.53:6443" # switch later to cp.k8s.betelgeusebytes.io:6443
|
||||
|
||||
pod_cidr: "10.244.0.0/16"
|
||||
service_cidr: "10.96.0.0/12"
|
||||
cilium_version: "1.15.7"
|
||||
|
||||
local_path_dir: "/srv/k8s"
|
||||
local_sc_name: "local-ssd-hetzner"
|
||||
|
||||
stateful_node_label_key: "node"
|
||||
stateful_node_label_val: "hetzner-2"
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
[k8s_control_plane]
|
||||
hetzner-1 ansible_host=95.217.89.53 public_ip=95.217.89.53 wg_address=10.66.0.11
|
||||
|
||||
[k8s_workers]
|
||||
hetzner-1 ansible_host=95.217.89.53 public_ip=95.217.89.53 wg_address=10.66.0.11
|
||||
hetzner-2 ansible_host=138.201.254.97 public_ip=138.201.254.97 wg_address=10.66.0.12
|
||||
|
||||
[k8s_nodes:children]
|
||||
k8s_control_plane
|
||||
k8s_workers
|
||||
|
||||
# add tiny VPS control-planes here when ready
|
||||
[new_control_planes]
|
||||
# cp-a ansible_host=<VPS1_IP> public_ip=<VPS1_IP> wg_address=10.66.0.10
|
||||
|
||||
[all:vars]
|
||||
ansible_user=root
|
||||
ansible_password=3Lcd0504
|
||||
ansible_become=true
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
- hosts: k8s_control_plane[0]
|
||||
become: yes
|
||||
roles:
|
||||
- kubeadm_cp_discovery
|
||||
|
||||
- hosts: new_control_planes
|
||||
become: yes
|
||||
roles:
|
||||
- common
|
||||
- wireguard
|
||||
- containerd
|
||||
- kubernetes
|
||||
|
||||
- hosts: new_control_planes
|
||||
become: yes
|
||||
roles:
|
||||
- kubeadm_join_cp
|
||||
vars:
|
||||
kubeadm_cp_join_cmd: "{{ hostvars[groups['k8s_control_plane'][0]].kubeadm_cp_join_cmd | default(kubeadm_cp_join_cmd) }}"
|
||||
|
|
@ -0,0 +1,31 @@
|
|||
- hosts: k8s_nodes
|
||||
become: yes
|
||||
# serial: 1
|
||||
roles:
|
||||
# - ../roles/common
|
||||
#- ../roles/wireguard
|
||||
#- ../roles/containerd
|
||||
#- ../roles/kubernetes
|
||||
|
||||
- hosts: k8s_control_plane
|
||||
become: yes
|
||||
roles:
|
||||
- ../roles/kubeadm_init
|
||||
|
||||
# - hosts: k8s_workers
|
||||
# become: yes
|
||||
# roles:
|
||||
# - ../roles/kubeadm_join
|
||||
|
||||
- hosts: k8s_control_plane
|
||||
become: yes
|
||||
roles:
|
||||
# - ../roles/cilium
|
||||
# - ../roles/ingress
|
||||
#- ../roles/cert_manager
|
||||
|
||||
- hosts: k8s_nodes
|
||||
become: yes
|
||||
roles:
|
||||
#- ../roles/storage_local_path
|
||||
- ../roles/labels
|
||||
|
|
@ -0,0 +1,66 @@
|
|||
- name: Install cert-manager
|
||||
shell: kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml
|
||||
|
||||
- name: Wait for cert-manager pods to be ready
|
||||
shell: kubectl wait --for=condition=ready --timeout=300s pod -l app.kubernetes.io/instance=cert-manager -n cert-manager
|
||||
|
||||
- name: Wait for webhook endpoint to be ready
|
||||
shell: |
|
||||
for i in {1..30}; do
|
||||
if kubectl get endpoints cert-manager-webhook -n cert-manager -o jsonpath='{.subsets[*].addresses[*].ip}' | grep -q .; then
|
||||
echo "Webhook endpoint is ready"
|
||||
exit 0
|
||||
fi
|
||||
echo "Waiting for webhook endpoint... attempt $i/30"
|
||||
sleep 2
|
||||
done
|
||||
exit 1
|
||||
|
||||
- name: Test webhook connectivity
|
||||
shell: kubectl run test-webhook --image=curlimages/curl:latest --rm -i --restart=Never -- curl -k https://cert-manager-webhook.cert-manager.svc:443/healthz
|
||||
register: webhook_test
|
||||
ignore_errors: yes
|
||||
|
||||
- name: Display webhook test result
|
||||
debug:
|
||||
var: webhook_test
|
||||
|
||||
- name: ClusterIssuer
|
||||
copy:
|
||||
dest: /root/cluster-issuer-prod.yaml
|
||||
content: |
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: ClusterIssuer
|
||||
metadata:
|
||||
name: letsencrypt-prod
|
||||
spec:
|
||||
acme:
|
||||
- name: ClusterIssuer
|
||||
copy:
|
||||
dest: /root/cluster-issuer-prod.yaml
|
||||
content: |
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: ClusterIssuer
|
||||
metadata:
|
||||
name: letsencrypt-prod
|
||||
spec:
|
||||
acme:
|
||||
email: admin@betelgeusebytes.io
|
||||
server: https://acme-v02.api.letsencrypt.org/directory
|
||||
privateKeySecretRef:
|
||||
name: letsencrypt-prod-key
|
||||
solvers:
|
||||
- http01:
|
||||
ingress:
|
||||
class: nginx
|
||||
|
||||
- name: Temporarily disable cert-manager webhook
|
||||
shell: |
|
||||
kubectl delete validatingwebhookconfiguration cert-manager-webhook || true
|
||||
ignore_errors: yes
|
||||
|
||||
- name: Apply ClusterIssuer
|
||||
command: kubectl apply -f /root/cluster-issuer-prod.yaml
|
||||
|
||||
- name: Reinstall cert-manager to restore webhook
|
||||
shell: kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
- name: Install cilium CLI
|
||||
shell: |
|
||||
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz
|
||||
tar xzf cilium-linux-amd64.tar.gz -C /usr/local/bin
|
||||
args: { creates: /usr/local/bin/cilium }
|
||||
|
||||
- name: Deploy cilium
|
||||
shell: |
|
||||
cilium install --version {{ cilium_version }} --set kubeProxyReplacement=strict --set bpf.masquerade=true
|
||||
|
|
@ -0,0 +1,31 @@
|
|||
- name: Disable swap
|
||||
command: swapoff -a
|
||||
when: ansible_swaptotal_mb|int > 0
|
||||
|
||||
- name: Ensure swap disabled on boot
|
||||
replace:
|
||||
path: /etc/fstab
|
||||
regexp: '^([^#].*\sswap\s)'
|
||||
replace: '# \1'
|
||||
|
||||
- name: Kernel modules
|
||||
copy:
|
||||
dest: /etc/modules-load.d/containerd.conf
|
||||
content: |
|
||||
overlay
|
||||
br_netfilter
|
||||
|
||||
- name: Load modules
|
||||
command: modprobe {{ item }}
|
||||
loop: [overlay, br_netfilter]
|
||||
|
||||
- name: Sysctl for k8s
|
||||
copy:
|
||||
dest: /etc/sysctl.d/99-kubernetes.conf
|
||||
content: |
|
||||
net.bridge.bridge-nf-call-iptables = 1
|
||||
net.bridge.bridge-nf-call-ip6tables = 1
|
||||
net.ipv4.ip_forward = 1
|
||||
vm.max_map_count = 262144
|
||||
- name: Apply sysctl
|
||||
command: sysctl --system
|
||||
|
|
@ -0,0 +1,27 @@
|
|||
- name: Install containerd
|
||||
apt:
|
||||
name: containerd
|
||||
state: present
|
||||
update_cache: yes
|
||||
|
||||
- name: Ensure containerd config directory
|
||||
file:
|
||||
path: /etc/containerd
|
||||
state: directory
|
||||
mode: '0755'
|
||||
|
||||
- name: Generate default config
|
||||
shell: containerd config default > /etc/containerd/config.toml
|
||||
args: { creates: /etc/containerd/config.toml }
|
||||
|
||||
- name: Ensure SystemdCgroup=true
|
||||
replace:
|
||||
path: /etc/containerd/config.toml
|
||||
regexp: 'SystemdCgroup = false'
|
||||
replace: 'SystemdCgroup = true'
|
||||
|
||||
- name: Restart containerd
|
||||
service:
|
||||
name: containerd
|
||||
state: restarted
|
||||
enabled: yes
|
||||
|
|
@ -0,0 +1,2 @@
|
|||
- name: Deploy ingress-nginx (baremetal)
|
||||
shell: kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/baremetal/deploy.yaml
|
||||
|
|
@ -0,0 +1,24 @@
|
|||
- name: Upload certs and get certificate key
|
||||
shell: kubeadm init phase upload-certs --upload-certs | tail -n 1
|
||||
register: cert_key
|
||||
|
||||
- name: Compute CA cert hash
|
||||
shell: |
|
||||
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | awk '{print $2}'
|
||||
register: ca_hash
|
||||
|
||||
- name: Create short-lived token
|
||||
shell: kubeadm token create --ttl 30m
|
||||
register: join_token
|
||||
|
||||
- name: Determine control-plane endpoint
|
||||
set_fact:
|
||||
cp_endpoint: "{{ hostvars[inventory_hostname].control_plane_endpoint | default(ansible_host ~ ':6443') }}"
|
||||
|
||||
- set_fact:
|
||||
kubeadm_cp_join_cmd: >-
|
||||
kubeadm join {{ cp_endpoint }}
|
||||
--token {{ join_token.stdout }}
|
||||
--discovery-token-ca-cert-hash sha256:{{ ca_hash.stdout }}
|
||||
--control-plane
|
||||
--certificate-key {{ cert_key.stdout }}
|
||||
|
|
@ -0,0 +1,24 @@
|
|||
# - name: Write kubeadm config
|
||||
# template:
|
||||
# src: kubeadm-config.yaml.j2
|
||||
# dest: /etc/kubernetes/kubeadm-config.yaml
|
||||
|
||||
# - name: Pre-pull images
|
||||
# command: kubeadm config images pull
|
||||
|
||||
# - name: Init control-plane
|
||||
# command: kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml
|
||||
# args: { creates: /etc/kubernetes/admin.conf }
|
||||
|
||||
# - name: Setup kubeconfig
|
||||
# shell: |
|
||||
# mkdir -p $HOME/.kube
|
||||
# cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
|
||||
# chown $(id -u):$(id -g) $HOME/.kube/config
|
||||
|
||||
- name: Save join command
|
||||
shell: kubeadm token create --print-join-command
|
||||
register: join_cmd
|
||||
|
||||
- set_fact:
|
||||
kubeadm_join_command_all: "{{ join_cmd.stdout }}"
|
||||
|
|
@ -0,0 +1,14 @@
|
|||
apiVersion: kubeadm.k8s.io/v1beta3
|
||||
kind: ClusterConfiguration
|
||||
kubernetesVersion: {{ k8s_version }}
|
||||
clusterName: {{ cluster_name }}
|
||||
controlPlaneEndpoint: "{{ control_plane_endpoint }}"
|
||||
networking:
|
||||
podSubnet: "{{ pod_cidr }}"
|
||||
serviceSubnet: "{{ service_cidr }}"
|
||||
---
|
||||
apiVersion: kubeadm.k8s.io/v1beta3
|
||||
kind: InitConfiguration
|
||||
nodeRegistration:
|
||||
kubeletExtraArgs:
|
||||
node-ip: "{{ hostvars[inventory_hostname].wg_address | default(hostvars[inventory_hostname].public_ip) }}"
|
||||
|
|
@ -0,0 +1,2 @@
|
|||
- name: Join node to cluster
|
||||
command: "{{ hostvars[groups['k8s_control_plane'][0]].kubeadm_join_command_all }} --ignore-preflight-errors=FileAvailable--etc-kubernetes-kubelet.conf,FileAvailable--etc-kubernetes-pki-ca.crt,Port-10250"
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
- name: Ensure join command provided
|
||||
fail:
|
||||
msg: "Set kubeadm_cp_join_cmd variable (string)"
|
||||
when: kubeadm_cp_join_cmd is not defined
|
||||
|
||||
- name: Join node as control-plane
|
||||
command: "{{ kubeadm_cp_join_cmd }}"
|
||||
args:
|
||||
creates: /etc/kubernetes/kubelet.conf
|
||||
|
|
@ -0,0 +1,17 @@
|
|||
- name: Install Kubernetes apt key
|
||||
shell: curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
|
||||
args: { creates: /etc/apt/keyrings/kubernetes-apt-keyring.gpg }
|
||||
|
||||
- name: Add Kubernetes repo
|
||||
apt_repository:
|
||||
repo: "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.30/deb/ /"
|
||||
state: present
|
||||
|
||||
- name: Install kubeadm, kubelet, kubectl
|
||||
apt:
|
||||
name: [kubeadm, kubelet, kubectl]
|
||||
state: present
|
||||
update_cache: yes
|
||||
|
||||
- name: Hold kube packages
|
||||
command: apt-mark hold kubeadm kubelet kubectl
|
||||
|
|
@ -0,0 +1,4 @@
|
|||
- name: Label hetzner-2 for stateful
|
||||
command: kubectl label node hetzner-2 {{ stateful_node_label_key }}={{ stateful_node_label_val }} --overwrite
|
||||
delegate_to: "{{ groups['k8s_control_plane'][0] }}"
|
||||
run_once: true
|
||||
|
|
@ -0,0 +1,55 @@
|
|||
- name: Ensure local path dir
|
||||
file:
|
||||
path: "{{ local_path_dir }}"
|
||||
state: directory
|
||||
mode: '0777'
|
||||
|
||||
- name: StorageClass local-ssd-hetzner
|
||||
copy:
|
||||
dest: /root/local-sc.yaml
|
||||
content: |
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: {{ local_sc_name }}
|
||||
provisioner: kubernetes.io/no-provisioner
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
when: inventory_hostname in groups['k8s_control_plane']
|
||||
|
||||
- name: Apply SC
|
||||
command: kubectl apply -f /root/local-sc.yaml
|
||||
environment:
|
||||
KUBECONFIG: /etc/kubernetes/admin.conf
|
||||
when: inventory_hostname in groups['k8s_control_plane']
|
||||
|
||||
- name: Create local-path directory
|
||||
file:
|
||||
path: /mnt/local-ssd
|
||||
state: directory
|
||||
mode: '0755'
|
||||
|
||||
- name: Create subdirectories for each PV
|
||||
file:
|
||||
path: "/mnt/local-ssd/{{ item }}"
|
||||
state: directory
|
||||
mode: '0755'
|
||||
loop:
|
||||
- postgres
|
||||
- prometheus
|
||||
- elasticsearch
|
||||
- grafana
|
||||
|
||||
- name: Copy PV manifest
|
||||
template:
|
||||
src: local-ssd-pv.yaml
|
||||
dest: /tmp/local-ssd-pv.yaml
|
||||
|
||||
- name: Apply PV
|
||||
command: kubectl apply -f /tmp/local-ssd-pv.yaml
|
||||
run_once: true
|
||||
delegate_to: "{{ groups['k8s_control_plane'][0] }}"
|
||||
|
||||
- name: Apply SC
|
||||
command: kubectl apply -f /tmp/local-ssd-sc.yaml
|
||||
run_once: true
|
||||
delegate_to: "{{ groups['k8s_control_plane'][0] }}"
|
||||
|
|
@ -0,0 +1,65 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: local-ssd-postgres
|
||||
spec:
|
||||
capacity:
|
||||
storage: 100Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/postgres
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: local-ssd-prometheus
|
||||
spec:
|
||||
capacity:
|
||||
storage: 100Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/prometheus
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: local-ssd-elasticsearch
|
||||
spec:
|
||||
capacity:
|
||||
storage: 300Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/elasticsearch
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,62 @@
|
|||
- name: Install wireguard
|
||||
apt:
|
||||
name: [wireguard, qrencode]
|
||||
state: present
|
||||
update_cache: yes
|
||||
|
||||
- name: Ensure key dir
|
||||
file: { path: /etc/wireguard/keys, state: directory, mode: '0700' }
|
||||
|
||||
- name: Generate private key if missing
|
||||
shell: "[ -f /etc/wireguard/keys/privatekey ] || (umask 077 && wg genkey > /etc/wireguard/keys/privatekey)"
|
||||
args: { creates: /etc/wireguard/keys/privatekey }
|
||||
|
||||
- name: Generate public key
|
||||
shell: "wg pubkey < /etc/wireguard/keys/privatekey > /etc/wireguard/keys/publickey"
|
||||
args: { creates: /etc/wireguard/keys/publickey }
|
||||
|
||||
- name: Read pubkey
|
||||
slurp: { src: /etc/wireguard/keys/publickey }
|
||||
register: pubkey_raw
|
||||
|
||||
- name: Read private key
|
||||
slurp: { src: /etc/wireguard/keys/privatekey }
|
||||
register: privkey_raw
|
||||
|
||||
- set_fact:
|
||||
wg_public_key: "{{ pubkey_raw.content | b64decode | trim }}"
|
||||
wg_private_key: "{{ privkey_raw.content | b64decode | trim }}"
|
||||
|
||||
- name: Gather facts from all hosts
|
||||
setup:
|
||||
delegate_to: "{{ item }}"
|
||||
delegate_facts: true
|
||||
loop: "{{ groups['k8s_nodes'] }}"
|
||||
run_once: true
|
||||
|
||||
- name: Pretty print hostvars
|
||||
debug:
|
||||
msg: "{{ hostvars['hetzner-1']['wg_public_key'] }}"
|
||||
|
||||
- name: Render config
|
||||
template:
|
||||
src: wg0.conf.j2
|
||||
dest: /etc/wireguard/wg0.conf
|
||||
mode: '0600'
|
||||
|
||||
- name: Enable IP forward
|
||||
sysctl:
|
||||
name: net.ipv4.ip_forward
|
||||
value: "1"
|
||||
sysctl_set: yes
|
||||
state: present
|
||||
reload: yes
|
||||
|
||||
- name: Enable wg-quick
|
||||
service:
|
||||
name: wg-quick@wg0
|
||||
enabled: yes
|
||||
state: started
|
||||
|
||||
- debug:
|
||||
var: wg_show.stdout
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
[Interface]
|
||||
Address = {{ wg_nodes[inventory_hostname].address }}/24
|
||||
ListenPort = {{ wg_port }}
|
||||
PrivateKey = {{ wg_private_key }}
|
||||
|
||||
{% for h in groups['k8s_nodes'] if h != inventory_hostname %}
|
||||
[Peer]
|
||||
PublicKey = {{ hostvars[h].wg_public_key }}
|
||||
AllowedIPs = {{ wg_nodes[h].address }}/32
|
||||
Endpoint = {{ wg_nodes[h].public_ip }}:{{ wg_port }}
|
||||
PersistentKeepalive = 25
|
||||
{% endfor %}
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
wg_interface: wg0
|
||||
wg_port: 51820
|
||||
wg_cidr: 10.66.0.0/24
|
||||
wg_nodes:
|
||||
hetzner-1: { address: 10.66.0.11, public_ip: "95.217.89.53" }
|
||||
hetzner-2: { address: 10.66.0.12, public_ip: "138.201.254.97" }
|
||||
Binary file not shown.
|
|
@ -0,0 +1,31 @@
|
|||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata: { name: db }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata: { name: scm }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata: { name: ml }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata: { name: monitoring }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata: { name: elastic }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata: { name: broker }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata: { name: graph }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata: { name: observability }
|
||||
|
|
@ -0,0 +1,38 @@
|
|||
# Replace each 'auth' line with a real htpasswd pair:
|
||||
# htpasswd -nbBC 10 admin 'Str0ngP@ss' (copy 'admin:...' to value below)
|
||||
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: basic-auth-kibana, namespace: elastic }
|
||||
type: Opaque
|
||||
stringData: { auth: "admin:$2y$10$MBLgALyI7xwFrQh2PHqZruX.EzaTUGagmJODwpBEvF27snFAxCBvq" }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: basic-auth-grafana, namespace: monitoring }
|
||||
type: Opaque
|
||||
stringData: { auth: "admin:$2y$10$MBLgALyI7xwFrQh2PHqZruX.EzaTUGagmJODwpBEvF27snFAxCBvq" }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: basic-auth-prometheus, namespace: monitoring }
|
||||
type: Opaque
|
||||
stringData: { auth: "aadmin:$2y$10$MBLgALyI7xwFrQh2PHqZruX.EzaTUGagmJODwpBEvF27snFAxCBvq" }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: basic-auth-notebook, namespace: ml }
|
||||
type: Opaque
|
||||
stringData: { auth: "admin:$2y$10$MBLgALyI7xwFrQh2PHqZruX.EzaTUGagmJODwpBEvF27snFAxCBvq" }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: basic-auth-broker, namespace: broker }
|
||||
type: Opaque
|
||||
stringData: { auth: "admin:$2y$10$MBLgALyI7xwFrQh2PHqZruX.EzaTUGagmJODwpBEvF27snFAxCBvq" }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: basic-auth-neo4j, namespace: graph }
|
||||
type: Opaque
|
||||
stringData: { auth: "admin:$2y$10$MBLgALyI7xwFrQh2PHqZruX.EzaTUGagmJODwpBEvF27snFAxCBvq" }
|
||||
|
|
@ -0,0 +1,146 @@
|
|||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: argo-artifacts
|
||||
namespace: ml
|
||||
type: Opaque
|
||||
stringData:
|
||||
accesskey: "minioadmin" # <-- change
|
||||
secretkey: "minioadmin" # <-- change
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: workflow-controller-configmap
|
||||
namespace: ml
|
||||
data:
|
||||
config: |
|
||||
artifactRepository:
|
||||
s3:
|
||||
bucket: argo-artifacts
|
||||
endpoint: minio.betelgeusebytes.io # no scheme here
|
||||
insecure: false # https via Ingress
|
||||
accessKeySecret:
|
||||
name: argo-artifacts
|
||||
key: accesskey
|
||||
secretKeySecret:
|
||||
name: argo-artifacts
|
||||
key: secretkey
|
||||
keyFormat: "{{workflow.namespace}}/{{workflow.name}}/{{pod.name}}"
|
||||
|
||||
---
|
||||
# k8s/argo/workflows/ns-rbac.yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: argo-server
|
||||
namespace: ml
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: Role
|
||||
metadata:
|
||||
name: argo-namespaced
|
||||
namespace: ml
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources: ["pods","pods/log","secrets","configmaps","events","persistentvolumeclaims","serviceaccounts"]
|
||||
verbs: ["get","list","watch","create","delete","patch","update"]
|
||||
- apiGroups: ["coordination.k8s.io"]
|
||||
resources: ["leases"]
|
||||
verbs: ["get","list","watch","create","delete","patch","update"]
|
||||
- apiGroups: ["argoproj.io"]
|
||||
resources: ["workflows","workflowtemplates","cronworkflows","workfloweventbindings","sensors","eventsources","workflowtasksets","workflowartifactgctasks","workflowtaskresults"]
|
||||
verbs: ["get","list","watch","create","delete","patch","update"]
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: RoleBinding
|
||||
metadata:
|
||||
name: argo-namespaced-binding
|
||||
namespace: ml
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: argo-server
|
||||
namespace: ml
|
||||
roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
kind: Role
|
||||
name: argo-namespaced
|
||||
|
||||
---
|
||||
# k8s/argo/workflows/controller.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: workflow-controller, namespace: ml }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: workflow-controller } }
|
||||
template:
|
||||
metadata: { labels: { app: workflow-controller } }
|
||||
spec:
|
||||
serviceAccountName: argo-server
|
||||
containers:
|
||||
- name: controller
|
||||
image: quay.io/argoproj/workflow-controller:latest
|
||||
args: ["--namespaced"]
|
||||
env:
|
||||
- name: LEADER_ELECTION_IDENTITY
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.name
|
||||
ports: [{ containerPort: 9090 }]
|
||||
readinessProbe:
|
||||
httpGet: { path: /metrics, port: 9090, scheme: HTTPS }
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
livenessProbe:
|
||||
httpGet: { path: /metrics, port: 9090, scheme: HTTPS }
|
||||
initialDelaySeconds: 20
|
||||
periodSeconds: 20
|
||||
|
||||
---
|
||||
# k8s/argo/workflows/server.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: argo-server, namespace: ml }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: argo-server } }
|
||||
template:
|
||||
metadata: { labels: { app: argo-server } }
|
||||
spec:
|
||||
serviceAccountName: argo-server
|
||||
containers:
|
||||
- name: server
|
||||
image: quay.io/argoproj/argocli:latest
|
||||
args: ["server","--auth-mode","server","--namespaced","--secure=false"]
|
||||
ports: [{ containerPort: 2746 }]
|
||||
readinessProbe:
|
||||
httpGet: { path: /, port: 2746, scheme: HTTP }
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
livenessProbe:
|
||||
httpGet: { path: /, port: 2746, scheme: HTTP }
|
||||
initialDelaySeconds: 20
|
||||
periodSeconds: 20
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: argo-server, namespace: ml }
|
||||
spec: { selector: { app: argo-server }, ports: [ { port: 80, targetPort: 2746 } ] }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: argo
|
||||
namespace: ml
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["argo.betelgeusebytes.io"], secretName: argo-tls }]
|
||||
rules:
|
||||
- host: argo.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: argo-server, port: { number: 80 } } }
|
||||
|
|
@ -0,0 +1,217 @@
|
|||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: automation
|
||||
labels:
|
||||
name: automation
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: n8n-pv
|
||||
labels:
|
||||
app: n8n
|
||||
spec:
|
||||
capacity:
|
||||
storage: 20Gi
|
||||
volumeMode: Filesystem
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd
|
||||
local:
|
||||
path: /mnt/local-ssd/n8n
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: n8n-data
|
||||
namespace: automation
|
||||
labels:
|
||||
app: n8n
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
storageClassName: local-ssd
|
||||
resources:
|
||||
requests:
|
||||
storage: 20Gi
|
||||
selector:
|
||||
matchLabels:
|
||||
app: n8n
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: n8n-secrets
|
||||
namespace: automation
|
||||
type: Opaque
|
||||
stringData:
|
||||
# Generate a strong encryption key with: openssl rand -base64 32
|
||||
N8N_ENCRYPTION_KEY: "G/US0ePajEpWwRUjlchyOs6+6I/AT+0bisXmE2fugSU="
|
||||
# Optional: Database connection if using PostgreSQL
|
||||
DB_TYPE: "postgresdb"
|
||||
DB_POSTGRESDB_HOST: "pg.betelgeusebytes.io"
|
||||
DB_POSTGRESDB_PORT: "5432"
|
||||
DB_POSTGRESDB_DATABASE: "n8n"
|
||||
DB_POSTGRESDB_USER: "app"
|
||||
DB_POSTGRESDB_PASSWORD: "pa$$word"
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: n8n
|
||||
namespace: automation
|
||||
spec:
|
||||
serviceName: n8n
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: n8n
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: n8n
|
||||
spec:
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: hetzner-2
|
||||
containers:
|
||||
- name: n8n
|
||||
image: n8nio/n8n:latest
|
||||
ports:
|
||||
- containerPort: 5678
|
||||
name: http
|
||||
env:
|
||||
- name: N8N_HOST
|
||||
value: "n8n.betelgeusebytes.io"
|
||||
- name: N8N_PORT
|
||||
value: "5678"
|
||||
- name: N8N_PROTOCOL
|
||||
value: "https"
|
||||
- name: WEBHOOK_URL
|
||||
value: "https://n8n.betelgeusebytes.io/"
|
||||
- name: GENERIC_TIMEZONE
|
||||
value: "UTC"
|
||||
- name: N8N_ENCRYPTION_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: n8n-secrets
|
||||
key: N8N_ENCRYPTION_KEY
|
||||
# Uncomment if using PostgreSQL
|
||||
- name: DB_TYPE
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: n8n-secrets
|
||||
key: DB_TYPE
|
||||
- name: DB_POSTGRESDB_HOST
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: n8n-secrets
|
||||
key: DB_POSTGRESDB_HOST
|
||||
- name: DB_POSTGRESDB_PORT
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: n8n-secrets
|
||||
key: DB_POSTGRESDB_PORT
|
||||
- name: DB_POSTGRESDB_DATABASE
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: n8n-secrets
|
||||
key: DB_POSTGRESDB_DATABASE
|
||||
- name: DB_POSTGRESDB_USER
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: n8n-secrets
|
||||
key: DB_POSTGRESDB_USER
|
||||
- name: DB_POSTGRESDB_PASSWORD
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: n8n-secrets
|
||||
key: DB_POSTGRESDB_PASSWORD
|
||||
volumeMounts:
|
||||
- name: n8n-data
|
||||
mountPath: /home/node/.n8n
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
cpu: "250m"
|
||||
limits:
|
||||
memory: "2Gi"
|
||||
cpu: "1000m"
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /healthz
|
||||
port: 5678
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 30
|
||||
timeoutSeconds: 10
|
||||
failureThreshold: 5
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /healthz
|
||||
port: 5678
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
volumes:
|
||||
- name: n8n-data
|
||||
persistentVolumeClaim:
|
||||
claimName: n8n-data
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: n8n
|
||||
namespace: automation
|
||||
labels:
|
||||
app: n8n
|
||||
spec:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- port: 5678
|
||||
targetPort: 5678
|
||||
protocol: TCP
|
||||
name: http
|
||||
selector:
|
||||
app: n8n
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: n8n
|
||||
namespace: automation
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: "letsencrypt-prod"
|
||||
# nginx.ingress.kubernetes.io/proxy-body-size: "50m"
|
||||
# nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
|
||||
# nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
|
||||
# Uncomment below if you want basic auth protection in addition to n8n's auth
|
||||
# nginx.ingress.kubernetes.io/auth-type: basic
|
||||
# nginx.ingress.kubernetes.io/auth-secret: n8n-basic-auth
|
||||
# nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required'
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls:
|
||||
- hosts:
|
||||
- n8n.betelgeusebytes.io
|
||||
secretName: wildcard-betelgeusebytes-tls
|
||||
rules:
|
||||
- host: n8n.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: n8n
|
||||
port:
|
||||
number: 5678
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
apiVersion: cert-manager.io/v1
|
||||
kind: ClusterIssuer
|
||||
metadata: { name: letsencrypt-prod }
|
||||
spec:
|
||||
acme:
|
||||
email: angal.salah@gmail.com
|
||||
server: https://acme-v02.api.letsencrypt.org/directory
|
||||
privateKeySecretRef: { name: letsencrypt-prod-key }
|
||||
solvers:
|
||||
- http01: { ingress: { class: nginx } }
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-elasticsearch
|
||||
spec:
|
||||
capacity:
|
||||
storage: 80Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/elasticsearch
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,38 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: elasticsearch, namespace: elastic }
|
||||
spec:
|
||||
ports:
|
||||
- { name: http, port: 9200, targetPort: 9200 }
|
||||
- { name: transport, port: 9300, targetPort: 9300 }
|
||||
selector: { app: elasticsearch }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata: { name: elasticsearch, namespace: elastic }
|
||||
spec:
|
||||
serviceName: elasticsearch
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: elasticsearch } }
|
||||
template:
|
||||
metadata: { labels: { app: elasticsearch } }
|
||||
spec:
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: es
|
||||
image: docker.elastic.co/elasticsearch/elasticsearch:8.14.0
|
||||
env:
|
||||
- { name: discovery.type, value: single-node }
|
||||
- { name: xpack.security.enabled, value: "false" }
|
||||
- { name: ES_JAVA_OPTS, value: "-Xms2g -Xmx2g" }
|
||||
ports:
|
||||
- { containerPort: 9200 }
|
||||
- { containerPort: 9300 }
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /usr/share/elasticsearch/data }
|
||||
volumeClaimTemplates:
|
||||
- metadata: { name: data }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 80Gi } }
|
||||
|
|
@ -0,0 +1,44 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: kibana, namespace: elastic }
|
||||
spec:
|
||||
ports: [{ port: 5601, targetPort: 5601 }]
|
||||
selector: { app: kibana }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: kibana, namespace: elastic }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: kibana } }
|
||||
template:
|
||||
metadata: { labels: { app: kibana } }
|
||||
spec:
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: kibana
|
||||
image: docker.elastic.co/kibana/kibana:8.14.0
|
||||
env:
|
||||
- { name: ELASTICSEARCH_HOSTS, value: "http://elasticsearch.elastic.svc.cluster.local:9200" }
|
||||
ports: [{ containerPort: 5601 }]
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: kibana
|
||||
namespace: elastic
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
# nginx.ingress.kubernetes.io/auth-type: basic
|
||||
# nginx.ingress.kubernetes.io/auth-secret: basic-auth-kibana
|
||||
# nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["kibana.betelgeusebytes.io"], secretName: kibana-tls }]
|
||||
rules:
|
||||
- host: kibana.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: kibana, port: { number: 5601 } } }
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-gitea
|
||||
spec:
|
||||
capacity:
|
||||
storage: 50Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/gitea
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,54 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: gitea, namespace: scm }
|
||||
spec:
|
||||
ports: [{ port: 80, targetPort: 3000 }]
|
||||
selector: { app: gitea }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata: { name: gitea, namespace: scm }
|
||||
spec:
|
||||
serviceName: gitea
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: gitea } }
|
||||
template:
|
||||
metadata: { labels: { app: gitea } }
|
||||
spec:
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: gitea
|
||||
image: gitea/gitea:1.21.11
|
||||
env:
|
||||
- { name: GITEA__server__ROOT_URL, value: "https://gitea.betelgeusebytes.io" }
|
||||
- { name: GITEA__database__DB_TYPE, value: "postgres" }
|
||||
- { name: GITEA__database__HOST, value: "postgres.db.svc.cluster.local:5432" }
|
||||
- { name: GITEA__database__NAME, value: "gitea" }
|
||||
- { name: GITEA__database__USER, value: "app" }
|
||||
- { name: GITEA__database__PASSWD, value: "pa$$word" }
|
||||
ports: [{ containerPort: 3000 }]
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /data }
|
||||
volumeClaimTemplates:
|
||||
- metadata: { name: data }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 50Gi } }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: gitea
|
||||
namespace: scm
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["gitea.betelgeusebytes.io"], secretName: gitea-tls }]
|
||||
rules:
|
||||
- host: gitea.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: gitea, port: { number: 80 } } }
|
||||
|
|
@ -0,0 +1,45 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: grafana, namespace: monitoring }
|
||||
spec:
|
||||
ports: [{ port: 80, targetPort: 3000 }]
|
||||
selector: { app: grafana }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: grafana, namespace: monitoring }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: grafana } }
|
||||
template:
|
||||
metadata: { labels: { app: grafana } }
|
||||
spec:
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: grafana
|
||||
image: grafana/grafana:10.4.3
|
||||
env:
|
||||
- { name: GF_SECURITY_ADMIN_USER, value: admin }
|
||||
- { name: GF_SECURITY_ADMIN_PASSWORD, value: "ADMINclaude-GRAFANA" }
|
||||
ports: [{ containerPort: 3000 }]
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: grafana
|
||||
namespace: monitoring
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
nginx.ingress.kubernetes.io/auth-type: basic
|
||||
nginx.ingress.kubernetes.io/auth-secret: basic-auth-grafana
|
||||
nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["grafana.betelgeusebytes.io"], secretName: grafana-tls }]
|
||||
rules:
|
||||
- host: grafana.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: grafana, port: { number: 80 } } }
|
||||
|
|
@ -0,0 +1,49 @@
|
|||
apiVersion: kustomize.config.k8s.io/v1beta1
|
||||
kind: Kustomization
|
||||
namespace: ingress-nginx
|
||||
|
||||
# Create the tcp-services ConfigMap from *quoted* literals
|
||||
configMapGenerator:
|
||||
- name: tcp-services
|
||||
literals:
|
||||
- "5432=db/postgres:5432"
|
||||
- "7687=graph/neo4j:7687"
|
||||
|
||||
generatorOptions:
|
||||
disableNameSuffixHash: true
|
||||
|
||||
# Inline JSON6902 patches
|
||||
patches:
|
||||
# 1) Add controller arg for tcp-services
|
||||
- target:
|
||||
group: apps
|
||||
version: v1
|
||||
kind: Deployment
|
||||
name: ingress-nginx-controller
|
||||
namespace: ingress-nginx
|
||||
patch: |-
|
||||
- op: add
|
||||
path: /spec/template/spec/containers/0/args/-
|
||||
value: --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
|
||||
|
||||
# 2) Expose Service ports 5432 and 7687 (keeps 80/443)
|
||||
- target:
|
||||
version: v1
|
||||
kind: Service
|
||||
name: ingress-nginx-controller
|
||||
namespace: ingress-nginx
|
||||
patch: |-
|
||||
- op: add
|
||||
path: /spec/ports/-
|
||||
value:
|
||||
name: tcp-5432
|
||||
port: 5432
|
||||
protocol: TCP
|
||||
targetPort: 5432
|
||||
- op: add
|
||||
path: /spec/ports/-
|
||||
value:
|
||||
name: tcp-7687
|
||||
port: 7687
|
||||
protocol: TCP
|
||||
targetPort: 7687
|
||||
|
|
@ -0,0 +1,68 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: notebook, namespace: ml }
|
||||
spec:
|
||||
selector: { app: jupyterlab }
|
||||
ports: [{ port: 80, targetPort: 8888 }]
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: jupyterlab, namespace: ml }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: jupyterlab } }
|
||||
template:
|
||||
metadata: { labels: { app: jupyterlab } }
|
||||
spec:
|
||||
securityContext:
|
||||
runAsUser: 1000
|
||||
fsGroup: 100
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: jupyter
|
||||
image: jupyter/base-notebook:latest
|
||||
args: ["start-notebook.sh", "--NotebookApp.token=$(PASSWORD)"]
|
||||
env:
|
||||
- name: PASSWORD
|
||||
valueFrom: { secretKeyRef: { name: jupyter-auth, key: PASSWORD } }
|
||||
ports: [{ containerPort: 8888 }]
|
||||
volumeMounts:
|
||||
- { name: work, mountPath: /home/jovyan/work }
|
||||
volumes:
|
||||
- name: work
|
||||
persistentVolumeClaim: { claimName: jupyter-pvc }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata: { name: jupyter-pvc, namespace: ml }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 20Gi } }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: jupyter-auth, namespace: ml }
|
||||
type: Opaque
|
||||
stringData: { PASSWORD: "notebook" }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: notebook
|
||||
namespace: ml
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
# nginx.ingress.kubernetes.io/auth-type: basic
|
||||
# nginx.ingress.kubernetes.io/auth-secret: basic-auth-notebook
|
||||
# nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["notebook.betelgeusebytes.io"], secretName: notebook-tls }]
|
||||
rules:
|
||||
- host: notebook.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: notebook, port: { number: 80 } } }
|
||||
|
|
@ -0,0 +1,65 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-kafka
|
||||
spec:
|
||||
capacity:
|
||||
storage: 50Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/kafka
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-zookeeper-data
|
||||
spec:
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/zookeeper-data
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-zookeeper-log
|
||||
spec:
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/zookeeper-log
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,44 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: kafka-ui, namespace: broker }
|
||||
spec:
|
||||
ports: [{ port: 80, targetPort: 8080 }]
|
||||
selector: { app: kafka-ui }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: kafka-ui, namespace: broker }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: kafka-ui } }
|
||||
template:
|
||||
metadata: { labels: { app: kafka-ui } }
|
||||
spec:
|
||||
containers:
|
||||
- name: ui
|
||||
image: provectuslabs/kafka-ui:latest
|
||||
env:
|
||||
- { name: KAFKA_CLUSTERS_0_NAME, value: "local" }
|
||||
- { name: KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS, value: "kafka.broker.svc.cluster.local:9092" }
|
||||
ports: [{ containerPort: 8080 }]
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: kafka-ui
|
||||
namespace: broker
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
# nginx.ingress.kubernetes.io/auth-type: basic
|
||||
# nginx.ingress.kubernetes.io/auth-secret: basic-auth-broker
|
||||
# nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["broker.betelgeusebytes.io"], secretName: broker-tls }]
|
||||
rules:
|
||||
- host: broker.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: kafka-ui, port: { number: 80 } } }
|
||||
|
|
@ -0,0 +1,45 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: kafka, namespace: broker }
|
||||
spec:
|
||||
ports: [{ name: kafka, port: 9092, targetPort: 9092 }]
|
||||
selector: { app: kafka }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata: { name: kafka, namespace: broker }
|
||||
spec:
|
||||
serviceName: kafka
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: kafka } }
|
||||
template:
|
||||
metadata: { labels: { app: kafka } }
|
||||
spec:
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: kafka
|
||||
image: apache/kafka:latest
|
||||
env:
|
||||
- { name: KAFKA_NODE_ID, value: "1" }
|
||||
- { name: KAFKA_PROCESS_ROLES, value: "broker,controller" }
|
||||
- { name: KAFKA_LISTENERS, value: "PLAINTEXT://:9092,CONTROLLER://:9093" }
|
||||
- { name: KAFKA_ADVERTISED_LISTENERS, value: "PLAINTEXT://kafka.broker.svc.cluster.local:9092" }
|
||||
- { name: KAFKA_CONTROLLER_LISTENER_NAMES, value: "CONTROLLER" }
|
||||
- { name: KAFKA_LISTENER_SECURITY_PROTOCOL_MAP, value: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT" }
|
||||
- { name: KAFKA_CONTROLLER_QUORUM_VOTERS, value: "1@localhost:9093" }
|
||||
- { name: KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR, value: "1" }
|
||||
- { name: KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR, value: "1" }
|
||||
- { name: KAFKA_TRANSACTION_STATE_LOG_MIN_ISR, value: "1" }
|
||||
- { name: KAFKA_LOG_DIRS, value: "/var/lib/kafka/data" }
|
||||
- { name: CLUSTER_ID, value: "MkU3OEVBNTcwNTJENDM2Qk" }
|
||||
ports:
|
||||
- { containerPort: 9092 }
|
||||
- { containerPort: 9093 }
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /var/lib/kafka/data }
|
||||
volumeClaimTemplates:
|
||||
- metadata: { name: data }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 50Gi } }
|
||||
|
|
@ -0,0 +1,74 @@
|
|||
# k8s/ai/label-studio/secret-pg.yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: labelstudio-pg, namespace: ml }
|
||||
type: Opaque
|
||||
stringData: { POSTGRES_PASSWORD: "admin" }
|
||||
|
||||
---
|
||||
# k8s/ai/label-studio/secret-minio.yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: minio-label, namespace: ml }
|
||||
type: Opaque
|
||||
stringData:
|
||||
accesskey: "minioadmin"
|
||||
secretkey: "minioadmin"
|
||||
|
||||
---
|
||||
# k8s/ai/label-studio/deploy.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: label-studio, namespace: ml }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: label-studio } }
|
||||
template:
|
||||
metadata: { labels: { app: label-studio } }
|
||||
spec:
|
||||
containers:
|
||||
- name: app
|
||||
image: heartexlabs/label-studio:latest
|
||||
env:
|
||||
- { name: POSTGRE_NAME, value: "labelstudio" }
|
||||
- { name: POSTGRE_USER, value: "admin" }
|
||||
- name: POSTGRE_PASSWORD
|
||||
valueFrom: { secretKeyRef: { name: labelstudio-pg, key: POSTGRES_PASSWORD } }
|
||||
- { name: POSTGRE_HOST, value: "postgres.db.svc.cluster.local" }
|
||||
- { name: POSTGRE_PORT, value: "5432" }
|
||||
- { name: S3_ENDPOINT, value: "https://minio.betelgeusebytes.io" }
|
||||
- name: AWS_ACCESS_KEY_ID
|
||||
valueFrom: { secretKeyRef: { name: minio-label, key: accesskey } }
|
||||
- name: AWS_SECRET_ACCESS_KEY
|
||||
valueFrom: { secretKeyRef: { name: minio-label, key: secretkey } }
|
||||
- name: ALLOWED_HOSTS
|
||||
value: "label.betelgeusebytes.io"
|
||||
- name: CSRF_TRUSTED_ORIGINS
|
||||
value: "https://label.betelgeusebytes.io"
|
||||
- name: CSRF_COOKIE_SECURE
|
||||
value: "1"
|
||||
- name: SESSION_COOKIE_SECURE
|
||||
value: "1"
|
||||
ports: [{ containerPort: 8080 }]
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: label-studio, namespace: ml }
|
||||
spec: { selector: { app: label-studio }, ports: [ { port: 80, targetPort: 8080 } ] }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: label-studio
|
||||
namespace: ml
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["label.betelgeusebytes.io"], secretName: label-tls }]
|
||||
rules:
|
||||
- host: label.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: label-studio, port: { number: 80 } } }
|
||||
|
|
@ -0,0 +1,96 @@
|
|||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata: { name: storage }
|
||||
---
|
||||
# k8s/storage/minio/secret.yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: minio-root, namespace: storage }
|
||||
type: Opaque
|
||||
stringData:
|
||||
MINIO_ROOT_USER: "minioadmin"
|
||||
MINIO_ROOT_PASSWORD: "minioadmin"
|
||||
|
||||
---
|
||||
# k8s/storage/minio/pvc.yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata: { name: minio-data, namespace: storage }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 20Gi } }
|
||||
|
||||
---
|
||||
# k8s/storage/minio/deploy.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: minio, namespace: storage }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: minio } }
|
||||
template:
|
||||
metadata: { labels: { app: minio } }
|
||||
spec:
|
||||
containers:
|
||||
- name: minio
|
||||
image: minio/minio:latest
|
||||
args: ["server","/data","--console-address",":9001"]
|
||||
envFrom: [{ secretRef: { name: minio-root } }]
|
||||
ports:
|
||||
- { containerPort: 9000 } # S3
|
||||
- { containerPort: 9001 } # Console
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /data }
|
||||
volumes:
|
||||
- name: data
|
||||
persistentVolumeClaim: { claimName: minio-data }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: minio, namespace: storage }
|
||||
spec:
|
||||
selector: { app: minio }
|
||||
ports:
|
||||
- { name: s3, port: 9000, targetPort: 9000 }
|
||||
- { name: console, port: 9001, targetPort: 9001 }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: minio
|
||||
namespace: storage
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["minio.betelgeusebytes.io"], secretName: minio-tls }]
|
||||
rules:
|
||||
- host: minio.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: minio, port: { number: 9001 } } }
|
||||
---
|
||||
# PV
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-minio
|
||||
spec:
|
||||
capacity:
|
||||
storage: 20Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/minio
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,64 @@
|
|||
# k8s/mlops/mlflow/secret-pg.yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: mlflow-pg, namespace: ml }
|
||||
type: Opaque
|
||||
stringData: { POSTGRES_PASSWORD: "pa$$word" }
|
||||
|
||||
---
|
||||
# k8s/mlops/mlflow/secret-minio.yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: mlflow-minio, namespace: ml }
|
||||
type: Opaque
|
||||
stringData:
|
||||
accesskey: "minioadmin"
|
||||
secretkey: "minioadmin"
|
||||
|
||||
---
|
||||
# k8s/mlops/mlflow/deploy.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: mlflow, namespace: ml }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: mlflow } }
|
||||
template:
|
||||
metadata: { labels: { app: mlflow } }
|
||||
spec:
|
||||
containers:
|
||||
- name: mlflow
|
||||
# image: ghcr.io/mlflow/mlflow:v3.6.0
|
||||
image: axxs/mlflow-pg
|
||||
env:
|
||||
- { name: MLFLOW_BACKEND_STORE_URI,
|
||||
value: "postgresql://admin:admin@postgres.db.svc.cluster.local:5432/mlflow" }
|
||||
- { name: POSTGRES_PASSWORD, valueFrom: { secretKeyRef: { name: mlflow-pg, key: POSTGRES_PASSWORD } } }
|
||||
- { name: MLFLOW_S3_ENDPOINT_URL, value: "https://minio.betelgeusebytes.io" }
|
||||
- { name: AWS_ACCESS_KEY_ID, valueFrom: { secretKeyRef: { name: mlflow-minio, key: accesskey } } }
|
||||
- { name: AWS_SECRET_ACCESS_KEY, valueFrom: { secretKeyRef: { name: mlflow-minio, key: secretkey } } }
|
||||
args: ["mlflow","server","--host","0.0.0.0","--port","5000","--artifacts-destination","s3://mlflow", "--allowed-hosts", "*.betelgeusebytes.io"]
|
||||
ports: [{ containerPort: 5000 }]
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: mlflow, namespace: ml }
|
||||
spec: { selector: { app: mlflow }, ports: [ { port: 80, targetPort: 5000 } ] }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: mlflow
|
||||
namespace: ml
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["mlflow.betelgeusebytes.io"], secretName: mlflow-tls }]
|
||||
rules:
|
||||
- host: mlflow.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: mlflow, port: { number: 80 } } }
|
||||
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-neo4j
|
||||
spec:
|
||||
capacity:
|
||||
storage: 20Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/neo4j
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: neo4j, namespace: graph }
|
||||
spec:
|
||||
selector: { app: neo4j }
|
||||
ports:
|
||||
- { name: http, port: 7474, targetPort: 7474 }
|
||||
- { name: bolt, port: 7687, targetPort: 7687 }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata: { name: neo4j, namespace: graph }
|
||||
spec:
|
||||
serviceName: neo4j
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: neo4j } }
|
||||
template:
|
||||
metadata: { labels: { app: neo4j } }
|
||||
spec:
|
||||
enableServiceLinks: false
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: neo4j
|
||||
image: neo4j:5.20
|
||||
env:
|
||||
- name: NEO4J_AUTH
|
||||
valueFrom: { secretKeyRef: { name: neo4j-auth, key: NEO4J_AUTH } }
|
||||
- name: NEO4J_dbms_ssl_policy_bolt_enabled
|
||||
value: "true"
|
||||
- name: NEO4J_dbms_ssl_policy_bolt_base__directory
|
||||
value: "/certs/bolt"
|
||||
- name: NEO4J_dbms_ssl_policy_bolt_private__key
|
||||
value: "tls.key"
|
||||
- name: NEO4J_dbms_ssl_policy_bolt_public__certificate
|
||||
value: "tls.crt"
|
||||
- name: NEO4J_dbms_connector_bolt_tls__level
|
||||
value: "REQUIRED"
|
||||
# Advertise public hostname so the Browser uses the external FQDN for Bolt
|
||||
- name: NEO4J_dbms_connector_bolt_advertised__address
|
||||
value: "neo4j.betelgeusebytes.io:7687"
|
||||
# also set a default advertised address (recommended)
|
||||
- name: NEO4J_dbms_default__advertised__address
|
||||
value: "neo4j.betelgeusebytes.io"
|
||||
ports:
|
||||
- { containerPort: 7474 }
|
||||
- { containerPort: 7687 }
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /data }
|
||||
- { name: bolt-certs, mountPath: /certs/bolt }
|
||||
volumes:
|
||||
- name: bolt-certs
|
||||
secret:
|
||||
secretName: neo4j-tls
|
||||
items:
|
||||
- key: tls.crt
|
||||
path: tls.crt
|
||||
- key: tls.key
|
||||
path: tls.key
|
||||
volumeClaimTemplates:
|
||||
- metadata: { name: data }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 20Gi } }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: neo4j-auth, namespace: graph }
|
||||
type: Opaque
|
||||
stringData: { NEO4J_AUTH: "neo4j/NEO4J-PASS" }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: neo4j-http
|
||||
namespace: graph
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
# nginx.ingress.kubernetes.io/auth-type: basic
|
||||
# nginx.ingress.kubernetes.io/auth-secret: basic-auth-neo4j
|
||||
# nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["neo4j.betelgeusebytes.io"], secretName: neo4j-tls }]
|
||||
rules:
|
||||
- host: neo4j.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: neo4j, port: { number: 7474 } } }
|
||||
|
||||
# create or update the tcp-services configmap
|
||||
# kubectl -n ingress-nginx create configmap tcp-services \
|
||||
# --from-literal="7687=graph/neo4j:7687" \
|
||||
# -o yaml --dry-run=client | kubectl apply -f -
|
||||
|
||||
# kubectl -n ingress-nginx patch deploy ingress-nginx-controller \
|
||||
# --type='json' -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--tcp-services-configmap=$(POD_NAMESPACE)/tcp-services"}]'
|
||||
|
||||
# kubectl -n ingress-nginx patch deploy ingress-nginx-controller \
|
||||
# --type='json' -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--tcp-services-configmap=$(POD_NAMESPACE)/tcp-services"}]'
|
||||
|
||||
# kubectl -n ingress-nginx patch deployment ingress-nginx-controller \
|
||||
# --type='json' -p='[
|
||||
# {"op":"add","path":"/spec/template/spec/containers/0/ports/-","value":{"name":"tcp-7687","containerPort":7687,"hostPort":7687,"protocol":"TCP"}}
|
||||
# ]'
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: observability
|
||||
labels:
|
||||
name: observability
|
||||
monitoring: "true"
|
||||
|
|
@ -0,0 +1,95 @@
|
|||
---
|
||||
# Prometheus PV
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: prometheus-data-pv
|
||||
spec:
|
||||
capacity:
|
||||
storage: 50Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-storage
|
||||
local:
|
||||
path: /mnt/local-ssd/prometheus
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
||||
---
|
||||
# Loki PV
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: loki-data-pv
|
||||
spec:
|
||||
capacity:
|
||||
storage: 100Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-storage
|
||||
local:
|
||||
path: /mnt/local-ssd/loki
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
||||
---
|
||||
# Tempo PV
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: tempo-data-pv
|
||||
spec:
|
||||
capacity:
|
||||
storage: 50Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-storage
|
||||
local:
|
||||
path: /mnt/local-ssd/tempo
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
||||
---
|
||||
# Grafana PV
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: grafana-data-pv
|
||||
spec:
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-storage
|
||||
local:
|
||||
path: /mnt/local-ssd/grafana
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,55 @@
|
|||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: prometheus-data
|
||||
namespace: observability
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
storageClassName: local-storage
|
||||
resources:
|
||||
requests:
|
||||
storage: 50Gi
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: loki-data
|
||||
namespace: observability
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
storageClassName: local-storage
|
||||
resources:
|
||||
requests:
|
||||
storage: 100Gi
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: tempo-data
|
||||
namespace: observability
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
storageClassName: local-storage
|
||||
resources:
|
||||
requests:
|
||||
storage: 50Gi
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: grafana-data
|
||||
namespace: observability
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
storageClassName: local-storage
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
|
|
@ -0,0 +1,169 @@
|
|||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: prometheus-config
|
||||
namespace: observability
|
||||
data:
|
||||
prometheus.yml: |
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
external_labels:
|
||||
cluster: 'betelgeuse-k8s'
|
||||
environment: 'production'
|
||||
|
||||
# Alerting configuration (optional - can add alertmanager later)
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets: []
|
||||
|
||||
# Rule files
|
||||
rule_files:
|
||||
- /etc/prometheus/rules/*.yml
|
||||
|
||||
scrape_configs:
|
||||
# Scrape Prometheus itself
|
||||
- job_name: 'prometheus'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
|
||||
# Kubernetes API server
|
||||
- job_name: 'kubernetes-apiservers'
|
||||
kubernetes_sd_configs:
|
||||
- role: endpoints
|
||||
scheme: https
|
||||
tls_config:
|
||||
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
||||
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
|
||||
action: keep
|
||||
regex: default;kubernetes;https
|
||||
|
||||
# Kubernetes nodes
|
||||
- job_name: 'kubernetes-nodes'
|
||||
kubernetes_sd_configs:
|
||||
- role: node
|
||||
scheme: https
|
||||
tls_config:
|
||||
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
||||
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
relabel_configs:
|
||||
- action: labelmap
|
||||
regex: __meta_kubernetes_node_label_(.+)
|
||||
- target_label: __address__
|
||||
replacement: kubernetes.default.svc:443
|
||||
- source_labels: [__meta_kubernetes_node_name]
|
||||
regex: (.+)
|
||||
target_label: __metrics_path__
|
||||
replacement: /api/v1/nodes/${1}/proxy/metrics
|
||||
|
||||
# Kubernetes nodes cadvisor
|
||||
- job_name: 'kubernetes-cadvisor'
|
||||
kubernetes_sd_configs:
|
||||
- role: node
|
||||
scheme: https
|
||||
tls_config:
|
||||
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
||||
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
relabel_configs:
|
||||
- action: labelmap
|
||||
regex: __meta_kubernetes_node_label_(.+)
|
||||
- target_label: __address__
|
||||
replacement: kubernetes.default.svc:443
|
||||
- source_labels: [__meta_kubernetes_node_name]
|
||||
regex: (.+)
|
||||
target_label: __metrics_path__
|
||||
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
|
||||
|
||||
# Kubernetes service endpoints
|
||||
- job_name: 'kubernetes-service-endpoints'
|
||||
kubernetes_sd_configs:
|
||||
- role: endpoints
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
|
||||
action: keep
|
||||
regex: true
|
||||
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
|
||||
action: replace
|
||||
target_label: __scheme__
|
||||
regex: (https?)
|
||||
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
|
||||
action: replace
|
||||
target_label: __metrics_path__
|
||||
regex: (.+)
|
||||
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
|
||||
action: replace
|
||||
target_label: __address__
|
||||
regex: ([^:]+)(?::\d+)?;(\d+)
|
||||
replacement: $1:$2
|
||||
- action: labelmap
|
||||
regex: __meta_kubernetes_service_label_(.+)
|
||||
- source_labels: [__meta_kubernetes_namespace]
|
||||
action: replace
|
||||
target_label: kubernetes_namespace
|
||||
- source_labels: [__meta_kubernetes_service_name]
|
||||
action: replace
|
||||
target_label: kubernetes_name
|
||||
- source_labels: [__meta_kubernetes_pod_name]
|
||||
action: replace
|
||||
target_label: kubernetes_pod_name
|
||||
|
||||
# Kubernetes pods
|
||||
- job_name: 'kubernetes-pods'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
|
||||
action: keep
|
||||
regex: true
|
||||
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
|
||||
action: replace
|
||||
target_label: __metrics_path__
|
||||
regex: (.+)
|
||||
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
|
||||
action: replace
|
||||
regex: ([^:]+)(?::\d+)?;(\d+)
|
||||
replacement: $1:$2
|
||||
target_label: __address__
|
||||
- action: labelmap
|
||||
regex: __meta_kubernetes_pod_label_(.+)
|
||||
- source_labels: [__meta_kubernetes_namespace]
|
||||
action: replace
|
||||
target_label: kubernetes_namespace
|
||||
- source_labels: [__meta_kubernetes_pod_name]
|
||||
action: replace
|
||||
target_label: kubernetes_pod_name
|
||||
|
||||
# kube-state-metrics
|
||||
- job_name: 'kube-state-metrics'
|
||||
static_configs:
|
||||
- targets: ['kube-state-metrics.observability.svc.cluster.local:8080']
|
||||
|
||||
# node-exporter
|
||||
- job_name: 'node-exporter'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_pod_label_app]
|
||||
action: keep
|
||||
regex: node-exporter
|
||||
- source_labels: [__meta_kubernetes_pod_node_name]
|
||||
action: replace
|
||||
target_label: instance
|
||||
|
||||
# Grafana Loki
|
||||
- job_name: 'loki'
|
||||
static_configs:
|
||||
- targets: ['loki.observability.svc.cluster.local:3100']
|
||||
|
||||
# Grafana Tempo
|
||||
- job_name: 'tempo'
|
||||
static_configs:
|
||||
- targets: ['tempo.observability.svc.cluster.local:3200']
|
||||
|
||||
# Grafana
|
||||
- job_name: 'grafana'
|
||||
static_configs:
|
||||
- targets: ['grafana.observability.svc.cluster.local:3000']
|
||||
|
|
@ -0,0 +1,94 @@
|
|||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: loki-config
|
||||
namespace: observability
|
||||
data:
|
||||
loki.yaml: |
|
||||
auth_enabled: false
|
||||
|
||||
server:
|
||||
http_listen_port: 3100
|
||||
grpc_listen_port: 9096
|
||||
log_level: info
|
||||
|
||||
common:
|
||||
path_prefix: /loki
|
||||
storage:
|
||||
filesystem:
|
||||
chunks_directory: /loki/chunks
|
||||
rules_directory: /loki/rules
|
||||
replication_factor: 1
|
||||
ring:
|
||||
kvstore:
|
||||
store: inmemory
|
||||
|
||||
schema_config:
|
||||
configs:
|
||||
- from: 2024-01-01
|
||||
store: tsdb
|
||||
object_store: filesystem
|
||||
schema: v13
|
||||
index:
|
||||
prefix: index_
|
||||
period: 24h
|
||||
|
||||
storage_config:
|
||||
tsdb_shipper:
|
||||
active_index_directory: /loki/tsdb-index
|
||||
cache_location: /loki/tsdb-cache
|
||||
filesystem:
|
||||
directory: /loki/chunks
|
||||
|
||||
compactor:
|
||||
working_directory: /loki/compactor
|
||||
compaction_interval: 10m
|
||||
retention_enabled: true
|
||||
retention_delete_delay: 2h
|
||||
retention_delete_worker_count: 150
|
||||
|
||||
limits_config:
|
||||
enforce_metric_name: false
|
||||
reject_old_samples: true
|
||||
reject_old_samples_max_age: 168h # 7 days
|
||||
retention_period: 168h # 7 days
|
||||
max_query_length: 721h # 30 days for queries
|
||||
max_query_parallelism: 32
|
||||
max_streams_per_user: 0
|
||||
max_global_streams_per_user: 0
|
||||
ingestion_rate_mb: 50
|
||||
ingestion_burst_size_mb: 100
|
||||
per_stream_rate_limit: 10MB
|
||||
per_stream_rate_limit_burst: 20MB
|
||||
split_queries_by_interval: 15m
|
||||
|
||||
query_range:
|
||||
align_queries_with_step: true
|
||||
cache_results: true
|
||||
results_cache:
|
||||
cache:
|
||||
embedded_cache:
|
||||
enabled: true
|
||||
max_size_mb: 500
|
||||
|
||||
frontend:
|
||||
log_queries_longer_than: 5s
|
||||
compress_responses: true
|
||||
|
||||
query_scheduler:
|
||||
max_outstanding_requests_per_tenant: 2048
|
||||
|
||||
ingester:
|
||||
chunk_idle_period: 30m
|
||||
chunk_block_size: 262144
|
||||
chunk_encoding: snappy
|
||||
chunk_retain_period: 1m
|
||||
max_chunk_age: 2h
|
||||
wal:
|
||||
enabled: true
|
||||
dir: /loki/wal
|
||||
flush_on_shutdown: true
|
||||
replay_memory_ceiling: 1GB
|
||||
|
||||
analytics:
|
||||
reporting_enabled: false
|
||||
|
|
@ -0,0 +1,72 @@
|
|||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: tempo-config
|
||||
namespace: observability
|
||||
data:
|
||||
tempo.yaml: |
|
||||
server:
|
||||
http_listen_port: 3200
|
||||
log_level: info
|
||||
|
||||
distributor:
|
||||
receivers:
|
||||
jaeger:
|
||||
protocols:
|
||||
thrift_http:
|
||||
endpoint: 0.0.0.0:14268
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:14250
|
||||
zipkin:
|
||||
endpoint: 0.0.0.0:9411
|
||||
otlp:
|
||||
protocols:
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
|
||||
ingester:
|
||||
max_block_duration: 5m
|
||||
|
||||
compactor:
|
||||
compaction:
|
||||
block_retention: 168h # 7 days
|
||||
|
||||
metrics_generator:
|
||||
registry:
|
||||
external_labels:
|
||||
source: tempo
|
||||
cluster: betelgeuse-k8s
|
||||
storage:
|
||||
path: /tmp/tempo/generator/wal
|
||||
remote_write:
|
||||
- url: http://prometheus.observability.svc.cluster.local:9090/api/v1/write
|
||||
send_exemplars: true
|
||||
|
||||
storage:
|
||||
trace:
|
||||
backend: local
|
||||
wal:
|
||||
path: /tmp/tempo/wal
|
||||
local:
|
||||
path: /tmp/tempo/blocks
|
||||
pool:
|
||||
max_workers: 100
|
||||
queue_depth: 10000
|
||||
|
||||
querier:
|
||||
frontend_worker:
|
||||
frontend_address: tempo.observability.svc.cluster.local:9095
|
||||
|
||||
query_frontend:
|
||||
search:
|
||||
duration_slo: 5s
|
||||
throughput_bytes_slo: 1.073741824e+09
|
||||
trace_by_id:
|
||||
duration_slo: 5s
|
||||
|
||||
overrides:
|
||||
defaults:
|
||||
metrics_generator:
|
||||
processors: [service-graphs, span-metrics]
|
||||
|
|
@ -0,0 +1,159 @@
|
|||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: alloy-config
|
||||
namespace: observability
|
||||
data:
|
||||
config.alloy: |
|
||||
// Logging configuration
|
||||
logging {
|
||||
level = "info"
|
||||
format = "logfmt"
|
||||
}
|
||||
|
||||
// Discover Kubernetes pods for log collection
|
||||
discovery.kubernetes "pods" {
|
||||
role = "pod"
|
||||
}
|
||||
|
||||
// Discover Kubernetes nodes
|
||||
discovery.kubernetes "nodes" {
|
||||
role = "node"
|
||||
}
|
||||
|
||||
// Relabel pods for log collection
|
||||
discovery.relabel "pod_logs" {
|
||||
targets = discovery.kubernetes.pods.targets
|
||||
|
||||
// Only scrape pods with logs
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_pod_container_name"]
|
||||
action = "keep"
|
||||
regex = ".+"
|
||||
}
|
||||
|
||||
// Set the log path
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
|
||||
target_label = "__path__"
|
||||
separator = "/"
|
||||
replacement = "/var/log/pods/*$1/*.log"
|
||||
}
|
||||
|
||||
// Set namespace label
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_namespace"]
|
||||
target_label = "namespace"
|
||||
}
|
||||
|
||||
// Set pod name label
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_pod_name"]
|
||||
target_label = "pod"
|
||||
}
|
||||
|
||||
// Set container name label
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_pod_container_name"]
|
||||
target_label = "container"
|
||||
}
|
||||
|
||||
// Set node name label
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_pod_node_name"]
|
||||
target_label = "node"
|
||||
}
|
||||
|
||||
// Copy all pod labels
|
||||
rule {
|
||||
action = "labelmap"
|
||||
regex = "__meta_kubernetes_pod_label_(.+)"
|
||||
}
|
||||
}
|
||||
|
||||
// Read logs from discovered pods
|
||||
loki.source.kubernetes "pod_logs" {
|
||||
targets = discovery.relabel.pod_logs.output
|
||||
forward_to = [loki.process.pod_logs.receiver]
|
||||
}
|
||||
|
||||
// Process and enrich logs
|
||||
loki.process "pod_logs" {
|
||||
forward_to = [loki.write.local.receiver]
|
||||
|
||||
// Parse JSON logs
|
||||
stage.json {
|
||||
expressions = {
|
||||
level = "level",
|
||||
message = "message",
|
||||
timestamp = "timestamp",
|
||||
}
|
||||
}
|
||||
|
||||
// Extract log level
|
||||
stage.labels {
|
||||
values = {
|
||||
level = "",
|
||||
}
|
||||
}
|
||||
|
||||
// Add cluster label
|
||||
stage.static_labels {
|
||||
values = {
|
||||
cluster = "betelgeuse-k8s",
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Write logs to Loki
|
||||
loki.write "local" {
|
||||
endpoint {
|
||||
url = "http://loki.observability.svc.cluster.local:3100/loki/api/v1/push"
|
||||
}
|
||||
}
|
||||
|
||||
// OpenTelemetry receiver for traces
|
||||
otelcol.receiver.otlp "default" {
|
||||
grpc {
|
||||
endpoint = "0.0.0.0:4317"
|
||||
}
|
||||
|
||||
http {
|
||||
endpoint = "0.0.0.0:4318"
|
||||
}
|
||||
|
||||
output {
|
||||
traces = [otelcol.exporter.otlp.tempo.input]
|
||||
metrics = [otelcol.exporter.prometheus.metrics.input]
|
||||
}
|
||||
}
|
||||
|
||||
// Export traces to Tempo
|
||||
otelcol.exporter.otlp "tempo" {
|
||||
client {
|
||||
endpoint = "tempo.observability.svc.cluster.local:4317"
|
||||
tls {
|
||||
insecure = true
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Export OTLP metrics to Prometheus
|
||||
otelcol.exporter.prometheus "metrics" {
|
||||
forward_to = [prometheus.remote_write.local.receiver]
|
||||
}
|
||||
|
||||
// Remote write to Prometheus
|
||||
prometheus.remote_write "local" {
|
||||
endpoint {
|
||||
url = "http://prometheus.observability.svc.cluster.local:9090/api/v1/write"
|
||||
}
|
||||
}
|
||||
|
||||
// Scrape local metrics (Alloy's own metrics)
|
||||
prometheus.scrape "alloy" {
|
||||
targets = [{
|
||||
__address__ = "localhost:12345",
|
||||
}]
|
||||
forward_to = [prometheus.remote_write.local.receiver]
|
||||
}
|
||||
|
|
@ -0,0 +1,62 @@
|
|||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: grafana-datasources
|
||||
namespace: observability
|
||||
data:
|
||||
datasources.yaml: |
|
||||
apiVersion: 1
|
||||
datasources:
|
||||
# Prometheus
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
access: proxy
|
||||
url: http://prometheus.observability.svc.cluster.local:9090
|
||||
isDefault: true
|
||||
editable: true
|
||||
jsonData:
|
||||
timeInterval: 15s
|
||||
queryTimeout: 60s
|
||||
httpMethod: POST
|
||||
|
||||
# Loki
|
||||
- name: Loki
|
||||
type: loki
|
||||
access: proxy
|
||||
url: http://loki.observability.svc.cluster.local:3100
|
||||
editable: true
|
||||
jsonData:
|
||||
maxLines: 1000
|
||||
derivedFields:
|
||||
- datasourceUid: tempo
|
||||
matcherRegex: "traceID=(\\w+)"
|
||||
name: TraceID
|
||||
url: "$${__value.raw}"
|
||||
|
||||
# Tempo
|
||||
- name: Tempo
|
||||
type: tempo
|
||||
access: proxy
|
||||
url: http://tempo.observability.svc.cluster.local:3200
|
||||
editable: true
|
||||
uid: tempo
|
||||
jsonData:
|
||||
tracesToLogsV2:
|
||||
datasourceUid: loki
|
||||
spanStartTimeShift: -1h
|
||||
spanEndTimeShift: 1h
|
||||
filterByTraceID: true
|
||||
filterBySpanID: false
|
||||
customQuery: false
|
||||
tracesToMetrics:
|
||||
datasourceUid: prometheus
|
||||
spanStartTimeShift: -1h
|
||||
spanEndTimeShift: 1h
|
||||
serviceMap:
|
||||
datasourceUid: prometheus
|
||||
nodeGraph:
|
||||
enabled: true
|
||||
search:
|
||||
hide: false
|
||||
lokiSearch:
|
||||
datasourceUid: loki
|
||||
|
|
@ -0,0 +1,178 @@
|
|||
---
|
||||
# Prometheus ServiceAccount
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: prometheus
|
||||
namespace: observability
|
||||
|
||||
---
|
||||
# Prometheus ClusterRole
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRole
|
||||
metadata:
|
||||
name: prometheus
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources:
|
||||
- nodes
|
||||
- nodes/proxy
|
||||
- services
|
||||
- endpoints
|
||||
- pods
|
||||
verbs: ["get", "list", "watch"]
|
||||
- apiGroups:
|
||||
- extensions
|
||||
resources:
|
||||
- ingresses
|
||||
verbs: ["get", "list", "watch"]
|
||||
- nonResourceURLs: ["/metrics"]
|
||||
verbs: ["get"]
|
||||
|
||||
---
|
||||
# Prometheus ClusterRoleBinding
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRoleBinding
|
||||
metadata:
|
||||
name: prometheus
|
||||
roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
kind: ClusterRole
|
||||
name: prometheus
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: prometheus
|
||||
namespace: observability
|
||||
|
||||
---
|
||||
# Alloy ServiceAccount
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: alloy
|
||||
namespace: observability
|
||||
|
||||
---
|
||||
# Alloy ClusterRole
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRole
|
||||
metadata:
|
||||
name: alloy
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources:
|
||||
- nodes
|
||||
- nodes/proxy
|
||||
- services
|
||||
- endpoints
|
||||
- pods
|
||||
verbs: ["get", "list", "watch"]
|
||||
- apiGroups:
|
||||
- extensions
|
||||
resources:
|
||||
- ingresses
|
||||
verbs: ["get", "list", "watch"]
|
||||
|
||||
---
|
||||
# Alloy ClusterRoleBinding
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRoleBinding
|
||||
metadata:
|
||||
name: alloy
|
||||
roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
kind: ClusterRole
|
||||
name: alloy
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: alloy
|
||||
namespace: observability
|
||||
|
||||
---
|
||||
# kube-state-metrics ServiceAccount
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: kube-state-metrics
|
||||
namespace: observability
|
||||
|
||||
---
|
||||
# kube-state-metrics ClusterRole
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRole
|
||||
metadata:
|
||||
name: kube-state-metrics
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources:
|
||||
- configmaps
|
||||
- secrets
|
||||
- nodes
|
||||
- pods
|
||||
- services
|
||||
- resourcequotas
|
||||
- replicationcontrollers
|
||||
- limitranges
|
||||
- persistentvolumeclaims
|
||||
- persistentvolumes
|
||||
- namespaces
|
||||
- endpoints
|
||||
verbs: ["list", "watch"]
|
||||
- apiGroups: ["apps"]
|
||||
resources:
|
||||
- statefulsets
|
||||
- daemonsets
|
||||
- deployments
|
||||
- replicasets
|
||||
verbs: ["list", "watch"]
|
||||
- apiGroups: ["batch"]
|
||||
resources:
|
||||
- cronjobs
|
||||
- jobs
|
||||
verbs: ["list", "watch"]
|
||||
- apiGroups: ["autoscaling"]
|
||||
resources:
|
||||
- horizontalpodautoscalers
|
||||
verbs: ["list", "watch"]
|
||||
- apiGroups: ["policy"]
|
||||
resources:
|
||||
- poddisruptionbudgets
|
||||
verbs: ["list", "watch"]
|
||||
- apiGroups: ["certificates.k8s.io"]
|
||||
resources:
|
||||
- certificatesigningrequests
|
||||
verbs: ["list", "watch"]
|
||||
- apiGroups: ["storage.k8s.io"]
|
||||
resources:
|
||||
- storageclasses
|
||||
- volumeattachments
|
||||
verbs: ["list", "watch"]
|
||||
- apiGroups: ["admissionregistration.k8s.io"]
|
||||
resources:
|
||||
- mutatingwebhookconfigurations
|
||||
- validatingwebhookconfigurations
|
||||
verbs: ["list", "watch"]
|
||||
- apiGroups: ["networking.k8s.io"]
|
||||
resources:
|
||||
- networkpolicies
|
||||
- ingresses
|
||||
verbs: ["list", "watch"]
|
||||
- apiGroups: ["coordination.k8s.io"]
|
||||
resources:
|
||||
- leases
|
||||
verbs: ["list", "watch"]
|
||||
|
||||
---
|
||||
# kube-state-metrics ClusterRoleBinding
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRoleBinding
|
||||
metadata:
|
||||
name: kube-state-metrics
|
||||
roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
kind: ClusterRole
|
||||
name: kube-state-metrics
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: kube-state-metrics
|
||||
namespace: observability
|
||||
|
|
@ -0,0 +1,90 @@
|
|||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: prometheus
|
||||
namespace: observability
|
||||
labels:
|
||||
app: prometheus
|
||||
spec:
|
||||
serviceName: prometheus
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: prometheus
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: prometheus
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "9090"
|
||||
spec:
|
||||
serviceAccountName: prometheus
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: hetzner-2
|
||||
containers:
|
||||
- name: prometheus
|
||||
image: prom/prometheus:v2.54.1
|
||||
args:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.path=/prometheus'
|
||||
- '--storage.tsdb.retention.time=7d'
|
||||
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
|
||||
- '--web.console.templates=/usr/share/prometheus/consoles'
|
||||
- '--web.enable-lifecycle'
|
||||
- '--web.enable-admin-api'
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 9090
|
||||
protocol: TCP
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /-/healthy
|
||||
port: http
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /-/ready
|
||||
port: http
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
resources:
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 2Gi
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 4Gi
|
||||
volumeMounts:
|
||||
- name: prometheus-config
|
||||
mountPath: /etc/prometheus
|
||||
- name: prometheus-data
|
||||
mountPath: /prometheus
|
||||
volumes:
|
||||
- name: prometheus-config
|
||||
configMap:
|
||||
name: prometheus-config
|
||||
- name: prometheus-data
|
||||
persistentVolumeClaim:
|
||||
claimName: prometheus-data
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: prometheus
|
||||
namespace: observability
|
||||
labels:
|
||||
app: prometheus
|
||||
spec:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- port: 9090
|
||||
targetPort: http
|
||||
protocol: TCP
|
||||
name: http
|
||||
selector:
|
||||
app: prometheus
|
||||
|
|
@ -0,0 +1,96 @@
|
|||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: loki
|
||||
namespace: observability
|
||||
labels:
|
||||
app: loki
|
||||
spec:
|
||||
serviceName: loki
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: loki
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: loki
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "3100"
|
||||
spec:
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: hetzner-2
|
||||
securityContext:
|
||||
fsGroup: 10001
|
||||
runAsGroup: 10001
|
||||
runAsNonRoot: true
|
||||
runAsUser: 10001
|
||||
containers:
|
||||
- name: loki
|
||||
image: grafana/loki:3.2.1
|
||||
args:
|
||||
- '-config.file=/etc/loki/loki.yaml'
|
||||
- '-target=all'
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 3100
|
||||
protocol: TCP
|
||||
- name: grpc
|
||||
containerPort: 9096
|
||||
protocol: TCP
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /ready
|
||||
port: http
|
||||
initialDelaySeconds: 45
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /ready
|
||||
port: http
|
||||
initialDelaySeconds: 45
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
resources:
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 2Gi
|
||||
volumeMounts:
|
||||
- name: loki-config
|
||||
mountPath: /etc/loki
|
||||
- name: loki-data
|
||||
mountPath: /loki
|
||||
volumes:
|
||||
- name: loki-config
|
||||
configMap:
|
||||
name: loki-config
|
||||
- name: loki-data
|
||||
persistentVolumeClaim:
|
||||
claimName: loki-data
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: loki
|
||||
namespace: observability
|
||||
labels:
|
||||
app: loki
|
||||
spec:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- port: 3100
|
||||
targetPort: http
|
||||
protocol: TCP
|
||||
name: http
|
||||
- port: 9096
|
||||
targetPort: grpc
|
||||
protocol: TCP
|
||||
name: grpc
|
||||
selector:
|
||||
app: loki
|
||||
|
|
@ -0,0 +1,118 @@
|
|||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: tempo
|
||||
namespace: observability
|
||||
labels:
|
||||
app: tempo
|
||||
spec:
|
||||
serviceName: tempo
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: tempo
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: tempo
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "3200"
|
||||
spec:
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: hetzner-2
|
||||
containers:
|
||||
- name: tempo
|
||||
image: grafana/tempo:2.6.1
|
||||
args:
|
||||
- '-config.file=/etc/tempo/tempo.yaml'
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 3200
|
||||
protocol: TCP
|
||||
- name: otlp-grpc
|
||||
containerPort: 4317
|
||||
protocol: TCP
|
||||
- name: otlp-http
|
||||
containerPort: 4318
|
||||
protocol: TCP
|
||||
- name: jaeger-grpc
|
||||
containerPort: 14250
|
||||
protocol: TCP
|
||||
- name: jaeger-http
|
||||
containerPort: 14268
|
||||
protocol: TCP
|
||||
- name: zipkin
|
||||
containerPort: 9411
|
||||
protocol: TCP
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /ready
|
||||
port: http
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /ready
|
||||
port: http
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
resources:
|
||||
requests:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
limits:
|
||||
cpu: 2000m
|
||||
memory: 2Gi
|
||||
volumeMounts:
|
||||
- name: tempo-config
|
||||
mountPath: /etc/tempo
|
||||
- name: tempo-data
|
||||
mountPath: /tmp/tempo
|
||||
volumes:
|
||||
- name: tempo-config
|
||||
configMap:
|
||||
name: tempo-config
|
||||
- name: tempo-data
|
||||
persistentVolumeClaim:
|
||||
claimName: tempo-data
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: tempo
|
||||
namespace: observability
|
||||
labels:
|
||||
app: tempo
|
||||
spec:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- port: 3200
|
||||
targetPort: http
|
||||
protocol: TCP
|
||||
name: http
|
||||
- port: 4317
|
||||
targetPort: otlp-grpc
|
||||
protocol: TCP
|
||||
name: otlp-grpc
|
||||
- port: 4318
|
||||
targetPort: otlp-http
|
||||
protocol: TCP
|
||||
name: otlp-http
|
||||
- port: 14250
|
||||
targetPort: jaeger-grpc
|
||||
protocol: TCP
|
||||
name: jaeger-grpc
|
||||
- port: 14268
|
||||
targetPort: jaeger-http
|
||||
protocol: TCP
|
||||
name: jaeger-http
|
||||
- port: 9411
|
||||
targetPort: zipkin
|
||||
protocol: TCP
|
||||
name: zipkin
|
||||
selector:
|
||||
app: tempo
|
||||
|
|
@ -0,0 +1,97 @@
|
|||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: grafana
|
||||
namespace: observability
|
||||
labels:
|
||||
app: grafana
|
||||
spec:
|
||||
serviceName: grafana
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: grafana
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: grafana
|
||||
spec:
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: hetzner-2
|
||||
securityContext:
|
||||
fsGroup: 472
|
||||
runAsGroup: 472
|
||||
runAsUser: 472
|
||||
containers:
|
||||
- name: grafana
|
||||
image: grafana/grafana:11.4.0
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 3000
|
||||
protocol: TCP
|
||||
env:
|
||||
- name: GF_SECURITY_ADMIN_USER
|
||||
value: admin
|
||||
- name: GF_SECURITY_ADMIN_PASSWORD
|
||||
value: admin # Change this in production!
|
||||
- name: GF_INSTALL_PLUGINS
|
||||
value: ""
|
||||
- name: GF_FEATURE_TOGGLES_ENABLE
|
||||
value: "traceqlEditor,correlations"
|
||||
- name: GF_AUTH_ANONYMOUS_ENABLED
|
||||
value: "false"
|
||||
- name: GF_ANALYTICS_REPORTING_ENABLED
|
||||
value: "false"
|
||||
- name: GF_ANALYTICS_CHECK_FOR_UPDATES
|
||||
value: "false"
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /api/health
|
||||
port: http
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /api/health
|
||||
port: http
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
resources:
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 512Mi
|
||||
limits:
|
||||
cpu: 1000m
|
||||
memory: 1Gi
|
||||
volumeMounts:
|
||||
- name: grafana-data
|
||||
mountPath: /var/lib/grafana
|
||||
- name: grafana-datasources
|
||||
mountPath: /etc/grafana/provisioning/datasources
|
||||
volumes:
|
||||
- name: grafana-data
|
||||
persistentVolumeClaim:
|
||||
claimName: grafana-data
|
||||
- name: grafana-datasources
|
||||
configMap:
|
||||
name: grafana-datasources
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: grafana
|
||||
namespace: observability
|
||||
labels:
|
||||
app: grafana
|
||||
spec:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- port: 3000
|
||||
targetPort: http
|
||||
protocol: TCP
|
||||
name: http
|
||||
selector:
|
||||
app: grafana
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
apiVersion: apps/v1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: alloy
|
||||
namespace: observability
|
||||
labels:
|
||||
app: alloy
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: alloy
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: alloy
|
||||
spec:
|
||||
serviceAccountName: alloy
|
||||
hostNetwork: true
|
||||
hostPID: true
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
containers:
|
||||
- name: alloy
|
||||
image: grafana/alloy:v1.5.1
|
||||
args:
|
||||
- run
|
||||
- /etc/alloy/config.alloy
|
||||
- --storage.path=/var/lib/alloy
|
||||
- --server.http.listen-addr=0.0.0.0:12345
|
||||
ports:
|
||||
- name: http-metrics
|
||||
containerPort: 12345
|
||||
protocol: TCP
|
||||
- name: otlp-grpc
|
||||
containerPort: 4317
|
||||
protocol: TCP
|
||||
- name: otlp-http
|
||||
containerPort: 4318
|
||||
protocol: TCP
|
||||
env:
|
||||
- name: HOSTNAME
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: spec.nodeName
|
||||
securityContext:
|
||||
privileged: true
|
||||
runAsUser: 0
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 256Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
volumeMounts:
|
||||
- name: config
|
||||
mountPath: /etc/alloy
|
||||
- name: varlog
|
||||
mountPath: /var/log
|
||||
readOnly: true
|
||||
- name: varlibdockercontainers
|
||||
mountPath: /var/lib/docker/containers
|
||||
readOnly: true
|
||||
- name: etcmachineid
|
||||
mountPath: /etc/machine-id
|
||||
readOnly: true
|
||||
tolerations:
|
||||
- effect: NoSchedule
|
||||
operator: Exists
|
||||
volumes:
|
||||
- name: config
|
||||
configMap:
|
||||
name: alloy-config
|
||||
- name: varlog
|
||||
hostPath:
|
||||
path: /var/log
|
||||
- name: varlibdockercontainers
|
||||
hostPath:
|
||||
path: /var/lib/docker/containers
|
||||
- name: etcmachineid
|
||||
hostPath:
|
||||
path: /etc/machine-id
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: alloy
|
||||
namespace: observability
|
||||
labels:
|
||||
app: alloy
|
||||
spec:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- port: 12345
|
||||
targetPort: http-metrics
|
||||
protocol: TCP
|
||||
name: http-metrics
|
||||
- port: 4317
|
||||
targetPort: otlp-grpc
|
||||
protocol: TCP
|
||||
name: otlp-grpc
|
||||
- port: 4318
|
||||
targetPort: otlp-http
|
||||
protocol: TCP
|
||||
name: otlp-http
|
||||
selector:
|
||||
app: alloy
|
||||
|
|
@ -0,0 +1,71 @@
|
|||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: kube-state-metrics
|
||||
namespace: observability
|
||||
labels:
|
||||
app: kube-state-metrics
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: kube-state-metrics
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: kube-state-metrics
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
spec:
|
||||
serviceAccountName: kube-state-metrics
|
||||
containers:
|
||||
- name: kube-state-metrics
|
||||
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.13.0
|
||||
ports:
|
||||
- name: http-metrics
|
||||
containerPort: 8080
|
||||
- name: telemetry
|
||||
containerPort: 8081
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /healthz
|
||||
port: 8080
|
||||
initialDelaySeconds: 5
|
||||
timeoutSeconds: 5
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /
|
||||
port: 8080
|
||||
initialDelaySeconds: 5
|
||||
timeoutSeconds: 5
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 128Mi
|
||||
limits:
|
||||
cpu: 200m
|
||||
memory: 256Mi
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: kube-state-metrics
|
||||
namespace: observability
|
||||
labels:
|
||||
app: kube-state-metrics
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
spec:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- name: http-metrics
|
||||
port: 8080
|
||||
targetPort: http-metrics
|
||||
- name: telemetry
|
||||
port: 8081
|
||||
targetPort: telemetry
|
||||
selector:
|
||||
app: kube-state-metrics
|
||||
|
|
@ -0,0 +1,85 @@
|
|||
apiVersion: apps/v1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: node-exporter
|
||||
namespace: observability
|
||||
labels:
|
||||
app: node-exporter
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: node-exporter
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: node-exporter
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "9100"
|
||||
spec:
|
||||
hostNetwork: true
|
||||
hostPID: true
|
||||
containers:
|
||||
- name: node-exporter
|
||||
image: prom/node-exporter:v1.8.2
|
||||
args:
|
||||
- --path.procfs=/host/proc
|
||||
- --path.sysfs=/host/sys
|
||||
- --path.rootfs=/host/root
|
||||
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
|
||||
ports:
|
||||
- name: metrics
|
||||
containerPort: 9100
|
||||
protocol: TCP
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 128Mi
|
||||
limits:
|
||||
cpu: 200m
|
||||
memory: 256Mi
|
||||
volumeMounts:
|
||||
- name: proc
|
||||
mountPath: /host/proc
|
||||
readOnly: true
|
||||
- name: sys
|
||||
mountPath: /host/sys
|
||||
readOnly: true
|
||||
- name: root
|
||||
mountPath: /host/root
|
||||
mountPropagation: HostToContainer
|
||||
readOnly: true
|
||||
tolerations:
|
||||
- effect: NoSchedule
|
||||
operator: Exists
|
||||
volumes:
|
||||
- name: proc
|
||||
hostPath:
|
||||
path: /proc
|
||||
- name: sys
|
||||
hostPath:
|
||||
path: /sys
|
||||
- name: root
|
||||
hostPath:
|
||||
path: /
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: node-exporter
|
||||
namespace: observability
|
||||
labels:
|
||||
app: node-exporter
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "9100"
|
||||
spec:
|
||||
type: ClusterIP
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: metrics
|
||||
port: 9100
|
||||
targetPort: metrics
|
||||
selector:
|
||||
app: node-exporter
|
||||
|
|
@ -0,0 +1,26 @@
|
|||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: grafana-ingress
|
||||
namespace: observability
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: "letsencrypt-prod"
|
||||
nginx.ingress.kubernetes.io/ssl-redirect: "true"
|
||||
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls:
|
||||
- hosts:
|
||||
- grafana.betelgeusebytes.io
|
||||
secretName: grafana-tls
|
||||
rules:
|
||||
- host: grafana.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: grafana
|
||||
port:
|
||||
number: 3000
|
||||
|
|
@ -0,0 +1,90 @@
|
|||
---
|
||||
# Optional: Prometheus Ingress (for direct access to Prometheus UI)
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: prometheus-ingress
|
||||
namespace: observability
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: "letsencrypt-prod"
|
||||
nginx.ingress.kubernetes.io/ssl-redirect: "true"
|
||||
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
|
||||
# Optional: Add basic auth for security
|
||||
# nginx.ingress.kubernetes.io/auth-type: basic
|
||||
# nginx.ingress.kubernetes.io/auth-secret: prometheus-basic-auth
|
||||
# nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required'
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls:
|
||||
- hosts:
|
||||
- prometheus.betelgeusebytes.io
|
||||
secretName: prometheus-tls
|
||||
rules:
|
||||
- host: prometheus.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: prometheus
|
||||
port:
|
||||
number: 9090
|
||||
|
||||
---
|
||||
# Optional: Loki Ingress (for direct API access)
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: loki-ingress
|
||||
namespace: observability
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: "letsencrypt-prod"
|
||||
nginx.ingress.kubernetes.io/ssl-redirect: "true"
|
||||
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls:
|
||||
- hosts:
|
||||
- loki.betelgeusebytes.io
|
||||
secretName: loki-tls
|
||||
rules:
|
||||
- host: loki.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: loki
|
||||
port:
|
||||
number: 3100
|
||||
|
||||
---
|
||||
# Optional: Tempo Ingress (for direct API access)
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: tempo-ingress
|
||||
namespace: observability
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: "letsencrypt-prod"
|
||||
nginx.ingress.kubernetes.io/ssl-redirect: "true"
|
||||
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls:
|
||||
- hosts:
|
||||
- tempo.betelgeusebytes.io
|
||||
secretName: tempo-tls
|
||||
rules:
|
||||
- host: tempo.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: tempo
|
||||
port:
|
||||
number: 3200
|
||||
|
|
@ -0,0 +1,359 @@
|
|||
# Observability Stack Deployment Checklist
|
||||
|
||||
Use this checklist to ensure a smooth deployment of the observability stack.
|
||||
|
||||
## Pre-Deployment
|
||||
|
||||
### Check for Existing Monitoring Stack
|
||||
- [ ] Check if you have existing monitoring components:
|
||||
```bash
|
||||
# Check for monitoring namespaces
|
||||
kubectl get namespaces | grep -E "(monitoring|prometheus|grafana|loki|tempo)"
|
||||
|
||||
# Check for monitoring pods in common namespaces
|
||||
kubectl get pods -n monitoring 2>/dev/null || true
|
||||
kubectl get pods -n prometheus 2>/dev/null || true
|
||||
kubectl get pods -n grafana 2>/dev/null || true
|
||||
kubectl get pods -A | grep -E "(prometheus|grafana|loki|tempo|fluent-bit|vector)"
|
||||
|
||||
# Check for Helm releases
|
||||
helm list -A | grep -E "(prometheus|grafana|loki|tempo)"
|
||||
```
|
||||
|
||||
- [ ] If existing monitoring is found, remove it first:
|
||||
```bash
|
||||
./remove-old-monitoring.sh
|
||||
```
|
||||
|
||||
**OR** run the deployment script which will prompt you:
|
||||
```bash
|
||||
./deploy.sh # Will ask if you want to clean up first
|
||||
```
|
||||
|
||||
### Prerequisites
|
||||
- [ ] Kubernetes cluster is running
|
||||
- [ ] NGINX Ingress Controller is installed
|
||||
- [ ] cert-manager is installed with Let's Encrypt ClusterIssuer
|
||||
- [ ] DNS record `grafana.betelgeusebytes.io` points to cluster IP
|
||||
- [ ] Node is labeled `kubernetes.io/hostname=hetzner-2`
|
||||
- [ ] kubectl is configured and working
|
||||
|
||||
### Verify Prerequisites
|
||||
```bash
|
||||
# Check cluster
|
||||
kubectl cluster-info
|
||||
|
||||
# Check NGINX Ingress
|
||||
kubectl get pods -n ingress-nginx
|
||||
|
||||
# Check cert-manager
|
||||
kubectl get pods -n cert-manager
|
||||
|
||||
# Check node label
|
||||
kubectl get nodes --show-labels | grep hetzner-2
|
||||
|
||||
# Check DNS (from external machine)
|
||||
dig grafana.betelgeusebytes.io
|
||||
```
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### Step 1: Prepare Storage
|
||||
- [ ] SSH into hetzner-2 node
|
||||
- [ ] Create directories:
|
||||
```bash
|
||||
sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
|
||||
```
|
||||
- [ ] Set correct permissions:
|
||||
```bash
|
||||
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
|
||||
sudo chown -R 10001:10001 /mnt/local-ssd/loki
|
||||
sudo chown -R root:root /mnt/local-ssd/tempo
|
||||
sudo chown -R 472:472 /mnt/local-ssd/grafana
|
||||
```
|
||||
- [ ] Verify permissions:
|
||||
```bash
|
||||
ls -la /mnt/local-ssd/
|
||||
```
|
||||
|
||||
### Step 2: Review Configuration
|
||||
- [ ] Review `03-prometheus-config.yaml` - verify scrape targets
|
||||
- [ ] Review `04-loki-config.yaml` - verify retention (7 days)
|
||||
- [ ] Review `05-tempo-config.yaml` - verify retention (7 days)
|
||||
- [ ] Review `06-alloy-config.yaml` - verify endpoints
|
||||
- [ ] Review `20-grafana-ingress.yaml` - verify domain name
|
||||
|
||||
### Step 3: Deploy the Stack
|
||||
- [ ] Navigate to observability-stack directory
|
||||
```bash
|
||||
cd /path/to/observability-stack
|
||||
```
|
||||
- [ ] Make scripts executable (already done):
|
||||
```bash
|
||||
chmod +x *.sh
|
||||
```
|
||||
- [ ] Run deployment script:
|
||||
```bash
|
||||
./deploy.sh
|
||||
```
|
||||
OR deploy manually:
|
||||
```bash
|
||||
kubectl apply -f 00-namespace.yaml
|
||||
kubectl apply -f 01-persistent-volumes.yaml
|
||||
kubectl apply -f 02-persistent-volume-claims.yaml
|
||||
kubectl apply -f 03-prometheus-config.yaml
|
||||
kubectl apply -f 04-loki-config.yaml
|
||||
kubectl apply -f 05-tempo-config.yaml
|
||||
kubectl apply -f 06-alloy-config.yaml
|
||||
kubectl apply -f 07-grafana-datasources.yaml
|
||||
kubectl apply -f 08-rbac.yaml
|
||||
kubectl apply -f 10-prometheus.yaml
|
||||
kubectl apply -f 11-loki.yaml
|
||||
kubectl apply -f 12-tempo.yaml
|
||||
kubectl apply -f 13-grafana.yaml
|
||||
kubectl apply -f 14-alloy.yaml
|
||||
kubectl apply -f 15-kube-state-metrics.yaml
|
||||
kubectl apply -f 16-node-exporter.yaml
|
||||
kubectl apply -f 20-grafana-ingress.yaml
|
||||
```
|
||||
|
||||
### Step 4: Verify Deployment
|
||||
- [ ] Run status check:
|
||||
```bash
|
||||
./status.sh
|
||||
```
|
||||
- [ ] Check all PersistentVolumes are Bound:
|
||||
```bash
|
||||
kubectl get pv
|
||||
```
|
||||
- [ ] Check all PersistentVolumeClaims are Bound:
|
||||
```bash
|
||||
kubectl get pvc -n observability
|
||||
```
|
||||
- [ ] Check all pods are Running:
|
||||
```bash
|
||||
kubectl get pods -n observability
|
||||
```
|
||||
Expected pods:
|
||||
- [x] prometheus-0
|
||||
- [x] loki-0
|
||||
- [x] tempo-0
|
||||
- [x] grafana-0
|
||||
- [x] alloy-xxxxx (one per node)
|
||||
- [x] kube-state-metrics-xxxxx
|
||||
- [x] node-exporter-xxxxx (one per node)
|
||||
|
||||
- [ ] Check services are created:
|
||||
```bash
|
||||
kubectl get svc -n observability
|
||||
```
|
||||
- [ ] Check ingress is created:
|
||||
```bash
|
||||
kubectl get ingress -n observability
|
||||
```
|
||||
- [ ] Verify TLS certificate is issued:
|
||||
```bash
|
||||
kubectl get certificate -n observability
|
||||
kubectl describe certificate grafana-tls -n observability
|
||||
```
|
||||
|
||||
### Step 5: Test Connectivity
|
||||
- [ ] Test Prometheus endpoint:
|
||||
```bash
|
||||
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
|
||||
curl http://prometheus.observability.svc.cluster.local:9090/-/healthy
|
||||
```
|
||||
- [ ] Test Loki endpoint:
|
||||
```bash
|
||||
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
|
||||
curl http://loki.observability.svc.cluster.local:3100/ready
|
||||
```
|
||||
- [ ] Test Tempo endpoint:
|
||||
```bash
|
||||
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
|
||||
curl http://tempo.observability.svc.cluster.local:3200/ready
|
||||
```
|
||||
- [ ] Test Grafana endpoint:
|
||||
```bash
|
||||
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
|
||||
curl http://grafana.observability.svc.cluster.local:3000/api/health
|
||||
```
|
||||
|
||||
## Post-Deployment Configuration
|
||||
|
||||
### Step 6: Access Grafana
|
||||
- [ ] Open browser to: https://grafana.betelgeusebytes.io
|
||||
- [ ] Login with default credentials:
|
||||
- Username: `admin`
|
||||
- Password: `admin`
|
||||
- [ ] **CRITICAL**: Change admin password immediately
|
||||
- [ ] Verify datasources are configured:
|
||||
- Go to Configuration → Data Sources
|
||||
- Should see: Prometheus (default), Loki, Tempo
|
||||
- Click "Test" on each datasource
|
||||
|
||||
### Step 7: Verify Data Collection
|
||||
- [ ] Check Prometheus has targets:
|
||||
- In Grafana, Explore → Prometheus
|
||||
- Query: `up`
|
||||
- Should see multiple targets with value=1
|
||||
- [ ] Check Loki is receiving logs:
|
||||
- In Grafana, Explore → Loki
|
||||
- Query: `{namespace="observability"}`
|
||||
- Should see logs from observability stack
|
||||
- [ ] Check kube-state-metrics:
|
||||
- In Grafana, Explore → Prometheus
|
||||
- Query: `kube_pod_status_phase`
|
||||
- Should see pod status metrics
|
||||
|
||||
### Step 8: Import Dashboards (Optional)
|
||||
- [ ] Import Kubernetes cluster dashboard:
|
||||
- Dashboards → Import → ID: 315
|
||||
- [ ] Import Node Exporter dashboard:
|
||||
- Dashboards → Import → ID: 1860
|
||||
- [ ] Import Loki dashboard:
|
||||
- Dashboards → Import → ID: 13639
|
||||
|
||||
### Step 9: Test with Demo App (Optional)
|
||||
- [ ] Deploy demo application:
|
||||
```bash
|
||||
kubectl apply -f demo-app.yaml
|
||||
```
|
||||
- [ ] Wait for pod to be ready:
|
||||
```bash
|
||||
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s
|
||||
```
|
||||
- [ ] Test the endpoints:
|
||||
```bash
|
||||
kubectl port-forward -n observability svc/demo-app 8080:8080
|
||||
# In another terminal:
|
||||
curl http://localhost:8080/
|
||||
curl http://localhost:8080/items
|
||||
curl http://localhost:8080/slow
|
||||
curl http://localhost:8080/error
|
||||
```
|
||||
- [ ] Verify in Grafana:
|
||||
- Logs: `{app="demo-app"}`
|
||||
- Metrics: `flask_http_request_total`
|
||||
- Traces: Search for "demo-app" service in Tempo
|
||||
|
||||
## Monitoring and Maintenance
|
||||
|
||||
### Daily Checks
|
||||
- [ ] Check pod status: `kubectl get pods -n observability`
|
||||
- [ ] Check resource usage: `kubectl top pods -n observability`
|
||||
- [ ] Check disk usage on hetzner-2: `df -h /mnt/local-ssd/`
|
||||
|
||||
### Weekly Checks
|
||||
- [ ] Review Grafana for any alerts or anomalies
|
||||
- [ ] Verify TLS certificate is valid
|
||||
- [ ] Check logs for any errors:
|
||||
```bash
|
||||
kubectl logs -n observability -l app=prometheus --tail=100
|
||||
kubectl logs -n observability -l app=loki --tail=100
|
||||
kubectl logs -n observability -l app=tempo --tail=100
|
||||
kubectl logs -n observability -l app=grafana --tail=100
|
||||
```
|
||||
|
||||
### Monthly Checks
|
||||
- [ ] Review retention policies (7 days is appropriate)
|
||||
- [ ] Check storage growth trends
|
||||
- [ ] Review and update dashboards
|
||||
- [ ] Backup Grafana dashboards and configs
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Pod Won't Start
|
||||
1. Check events: `kubectl describe pod <pod-name> -n observability`
|
||||
2. Check logs: `kubectl logs <pod-name> -n observability`
|
||||
3. Check storage: `kubectl get pv` and `kubectl get pvc -n observability`
|
||||
4. Verify node has space: SSH to hetzner-2 and run `df -h`
|
||||
|
||||
### No Logs Appearing
|
||||
1. Check Alloy pods: `kubectl get pods -n observability -l app=alloy`
|
||||
2. Check Alloy logs: `kubectl logs -n observability -l app=alloy`
|
||||
3. Check Loki is running: `kubectl get pods -n observability -l app=loki`
|
||||
4. Test Loki endpoint from Alloy pod
|
||||
|
||||
### No Metrics Appearing
|
||||
1. Check Prometheus targets: Port-forward and visit http://localhost:9090/targets
|
||||
2. Check service discovery: Look for "kubernetes-*" targets
|
||||
3. Verify RBAC: `kubectl get clusterrolebinding prometheus`
|
||||
4. Check kube-state-metrics: `kubectl get pods -n observability -l app=kube-state-metrics`
|
||||
|
||||
### Grafana Can't Connect to Datasources
|
||||
1. Test from Grafana pod:
|
||||
```bash
|
||||
kubectl exec -it grafana-0 -n observability -- wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy
|
||||
```
|
||||
2. Check datasource configuration in Grafana UI
|
||||
3. Verify services exist: `kubectl get svc -n observability`
|
||||
|
||||
### High Resource Usage
|
||||
1. Check actual usage: `kubectl top pods -n observability`
|
||||
2. Check node capacity: `kubectl top nodes`
|
||||
3. Consider reducing retention periods
|
||||
4. Review and adjust resource limits
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
If something goes wrong:
|
||||
|
||||
1. Remove the deployment:
|
||||
```bash
|
||||
./cleanup.sh
|
||||
```
|
||||
|
||||
2. Fix the issue in configuration files
|
||||
|
||||
3. Redeploy:
|
||||
```bash
|
||||
./deploy.sh
|
||||
```
|
||||
|
||||
## Success Criteria
|
||||
|
||||
All checked items below indicate successful deployment:
|
||||
|
||||
- [x] All pods are in Running state
|
||||
- [x] All PVCs are Bound
|
||||
- [x] Grafana is accessible at https://grafana.betelgeusebytes.io
|
||||
- [x] All three datasources (Prometheus, Loki, Tempo) test successfully
|
||||
- [x] Prometheus shows targets as "up"
|
||||
- [x] Loki shows logs from observability namespace
|
||||
- [x] TLS certificate is valid and auto-renewing
|
||||
- [x] Admin password has been changed
|
||||
- [x] Resource usage is within acceptable limits
|
||||
|
||||
## Documentation References
|
||||
|
||||
- **README.md**: Comprehensive documentation
|
||||
- **QUICKREF.md**: Quick reference for common operations
|
||||
- **demo-app.yaml**: Example instrumented application
|
||||
- **deploy.sh**: Automated deployment script
|
||||
- **cleanup.sh**: Removal script
|
||||
- **status.sh**: Status checking script
|
||||
|
||||
## Next Steps After Deployment
|
||||
|
||||
1. Import useful dashboards from Grafana.com
|
||||
2. Configure alerts (requires Alertmanager - not included)
|
||||
3. Instrument your applications to send logs/metrics/traces
|
||||
4. Create custom dashboards for your specific needs
|
||||
5. Set up backup procedures for Grafana dashboards
|
||||
6. Document your team's observability practices
|
||||
|
||||
## Notes
|
||||
|
||||
- Default retention: 7 days for all components
|
||||
- Default resources are optimized for single-node cluster
|
||||
- Scale up resources if monitoring high-traffic applications
|
||||
- Always backup before making configuration changes
|
||||
- Test changes in a non-production environment first
|
||||
|
||||
---
|
||||
|
||||
**Deployment Date**: _______________
|
||||
**Deployed By**: _______________
|
||||
**Grafana Version**: 11.4.0
|
||||
**Stack Version**: January 2025
|
||||
|
|
@ -0,0 +1,146 @@
|
|||
# DNS Configuration Guide
|
||||
|
||||
## Required DNS Records
|
||||
|
||||
### Minimum Setup (Recommended)
|
||||
|
||||
Only **one** DNS record is required for basic operation:
|
||||
|
||||
```
|
||||
grafana.betelgeusebytes.io A/CNAME <your-cluster-ip>
|
||||
```
|
||||
|
||||
This gives you access to the complete observability stack through Grafana's unified interface.
|
||||
|
||||
## Optional DNS Records
|
||||
|
||||
If you want direct access to individual components, add these DNS records:
|
||||
|
||||
```
|
||||
prometheus.betelgeusebytes.io A/CNAME <your-cluster-ip>
|
||||
loki.betelgeusebytes.io A/CNAME <your-cluster-ip>
|
||||
tempo.betelgeusebytes.io A/CNAME <your-cluster-ip>
|
||||
```
|
||||
|
||||
Then deploy the optional ingresses:
|
||||
```bash
|
||||
kubectl apply -f 21-optional-ingresses.yaml
|
||||
```
|
||||
|
||||
## DNS Record Types
|
||||
|
||||
**Option 1: A Record (Direct IP)**
|
||||
```
|
||||
Type: A
|
||||
Name: grafana.betelgeusebytes.io
|
||||
Value: 1.2.3.4 (your cluster's public IP)
|
||||
TTL: 300
|
||||
```
|
||||
|
||||
**Option 2: CNAME (Alias to another domain)**
|
||||
```
|
||||
Type: CNAME
|
||||
Name: grafana.betelgeusebytes.io
|
||||
Value: your-server.example.com
|
||||
TTL: 300
|
||||
```
|
||||
|
||||
## Access URLs Summary
|
||||
|
||||
### After DNS Setup
|
||||
|
||||
| Service | URL | Purpose | DNS Required? |
|
||||
|---------|-----|---------|---------------|
|
||||
| **Grafana** | https://grafana.betelgeusebytes.io | Main dashboard (logs/metrics/traces) | ✅ Yes |
|
||||
| **Prometheus** | https://prometheus.betelgeusebytes.io | Metrics UI (optional) | ⚠️ Optional |
|
||||
| **Loki** | https://loki.betelgeusebytes.io | Logs API (optional) | ⚠️ Optional |
|
||||
| **Tempo** | https://tempo.betelgeusebytes.io | Traces API (optional) | ⚠️ Optional |
|
||||
|
||||
### Internal (No DNS Needed)
|
||||
|
||||
These services are accessible from within your cluster only:
|
||||
|
||||
```
|
||||
# Metrics
|
||||
http://prometheus.observability.svc.cluster.local:9090
|
||||
|
||||
# Logs
|
||||
http://loki.observability.svc.cluster.local:3100
|
||||
|
||||
# Traces (OTLP endpoints for your apps)
|
||||
http://tempo.observability.svc.cluster.local:4317 # gRPC
|
||||
http://tempo.observability.svc.cluster.local:4318 # HTTP
|
||||
|
||||
# Grafana (internal)
|
||||
http://grafana.observability.svc.cluster.local:3000
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
After setting up DNS, verify it's working:
|
||||
|
||||
```bash
|
||||
# Check DNS resolution
|
||||
dig grafana.betelgeusebytes.io
|
||||
nslookup grafana.betelgeusebytes.io
|
||||
|
||||
# Should return your cluster IP
|
||||
|
||||
# Test HTTPS access
|
||||
curl -I https://grafana.betelgeusebytes.io
|
||||
# Should return 200 OK or 302 redirect
|
||||
```
|
||||
|
||||
## TLS Certificate
|
||||
|
||||
Let's Encrypt will automatically issue certificates for:
|
||||
- grafana.betelgeusebytes.io (required)
|
||||
- prometheus.betelgeusebytes.io (if optional ingress deployed)
|
||||
- loki.betelgeusebytes.io (if optional ingress deployed)
|
||||
- tempo.betelgeusebytes.io (if optional ingress deployed)
|
||||
|
||||
Check certificate status:
|
||||
```bash
|
||||
kubectl get certificate -n observability
|
||||
kubectl describe certificate grafana-tls -n observability
|
||||
```
|
||||
|
||||
## Recommendation
|
||||
|
||||
**For most users:** Just configure `grafana.betelgeusebytes.io`
|
||||
|
||||
Why?
|
||||
- Single DNS record to manage
|
||||
- Grafana provides unified access to all components
|
||||
- Simpler certificate management
|
||||
- All functionality available through one interface
|
||||
|
||||
**For advanced users:** Add optional DNS records if you need:
|
||||
- Direct Prometheus UI access for debugging
|
||||
- External log/trace ingestion
|
||||
- API integrations
|
||||
- Programmatic queries outside Grafana
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**DNS not resolving:**
|
||||
- Check DNS propagation: https://dnschecker.org/
|
||||
- Wait 5-15 minutes for DNS to propagate
|
||||
- Verify your DNS provider settings
|
||||
|
||||
**Certificate not issued:**
|
||||
```bash
|
||||
# Check cert-manager
|
||||
kubectl get pods -n cert-manager
|
||||
|
||||
# Check certificate request
|
||||
kubectl describe certificate grafana-tls -n observability
|
||||
|
||||
# Check challenges
|
||||
kubectl get challenges -n observability
|
||||
```
|
||||
|
||||
**403/404 errors:**
|
||||
- Verify ingress is created: `kubectl get ingress -n observability`
|
||||
- Check NGINX ingress controller: `kubectl get pods -n ingress-nginx`
|
||||
- Check ingress logs: `kubectl logs -n ingress-nginx <nginx-pod>`
|
||||
|
|
@ -0,0 +1,572 @@
|
|||
# Access URLs & Monitoring New Applications Guide
|
||||
|
||||
## 🌐 Access URLs
|
||||
|
||||
### Required (Already Configured)
|
||||
|
||||
**Grafana - Main Dashboard**
|
||||
- **URL**: https://grafana.betelgeusebytes.io
|
||||
- **DNS Required**: Yes - `grafana.betelgeusebytes.io` → your cluster IP
|
||||
- **Login**: admin / admin (change on first login!)
|
||||
- **Purpose**: Unified interface for logs, metrics, and traces
|
||||
- **Ingress**: Already included in deployment (20-grafana-ingress.yaml)
|
||||
|
||||
### Optional (Direct Component Access)
|
||||
|
||||
You can optionally expose these components directly:
|
||||
|
||||
**Prometheus - Metrics UI**
|
||||
- **URL**: https://prometheus.betelgeusebytes.io
|
||||
- **DNS Required**: Yes - `prometheus.betelgeusebytes.io` → your cluster IP
|
||||
- **Purpose**: Direct access to Prometheus UI, query metrics, check targets
|
||||
- **Deploy**: `kubectl apply -f 21-optional-ingresses.yaml`
|
||||
- **Use Case**: Debugging metric collection, advanced PromQL queries
|
||||
|
||||
**Loki - Logs API**
|
||||
- **URL**: https://loki.betelgeusebytes.io
|
||||
- **DNS Required**: Yes - `loki.betelgeusebytes.io` → your cluster IP
|
||||
- **Purpose**: Direct API access for log queries
|
||||
- **Deploy**: `kubectl apply -f 21-optional-ingresses.yaml`
|
||||
- **Use Case**: External log forwarding, API integration
|
||||
|
||||
**Tempo - Traces API**
|
||||
- **URL**: https://tempo.betelgeusebytes.io
|
||||
- **DNS Required**: Yes - `tempo.betelgeusebytes.io` → your cluster IP
|
||||
- **Purpose**: Direct API access for trace queries
|
||||
- **Deploy**: `kubectl apply -f 21-optional-ingresses.yaml`
|
||||
- **Use Case**: External trace ingestion, API integration
|
||||
|
||||
### Internal Only (No DNS Required)
|
||||
|
||||
These are ClusterIP services accessible only from within the cluster:
|
||||
|
||||
```
|
||||
http://prometheus.observability.svc.cluster.local:9090
|
||||
http://loki.observability.svc.cluster.local:3100
|
||||
http://tempo.observability.svc.cluster.local:3200
|
||||
http://tempo.observability.svc.cluster.local:4317 # OTLP gRPC
|
||||
http://tempo.observability.svc.cluster.local:4318 # OTLP HTTP
|
||||
```
|
||||
|
||||
## 🎯 Recommendation
|
||||
|
||||
**For most users**: Just use Grafana (grafana.betelgeusebytes.io)
|
||||
- Grafana provides unified access to all components
|
||||
- No need to expose Prometheus, Loki, or Tempo directly
|
||||
- Simpler DNS configuration (only one subdomain)
|
||||
|
||||
**For power users**: Add optional ingresses
|
||||
- Direct Prometheus access is useful for debugging
|
||||
- Helps verify targets and scrape configs
|
||||
- Deploy with: `kubectl apply -f 21-optional-ingresses.yaml`
|
||||
|
||||
## 📊 Monitoring New Applications
|
||||
|
||||
### Automatic: Kubernetes Logs
|
||||
|
||||
**All pod logs are automatically collected!** No configuration needed.
|
||||
|
||||
Alloy runs as a DaemonSet and automatically:
|
||||
1. Discovers all pods in the cluster
|
||||
2. Reads logs from `/var/log/pods/`
|
||||
3. Sends them to Loki with labels:
|
||||
- `namespace`
|
||||
- `pod`
|
||||
- `container`
|
||||
- `node`
|
||||
- All pod labels
|
||||
|
||||
**View in Grafana:**
|
||||
```logql
|
||||
# All logs from your app
|
||||
{namespace="your-namespace", pod=~"your-app.*"}
|
||||
|
||||
# Error logs only
|
||||
{namespace="your-namespace"} |= "error"
|
||||
|
||||
# JSON logs parsed
|
||||
{namespace="your-namespace"} | json | level="error"
|
||||
```
|
||||
|
||||
**Best Practice for Logs:**
|
||||
Emit structured JSON logs from your application:
|
||||
|
||||
```python
|
||||
import json
|
||||
import logging
|
||||
|
||||
# Python example
|
||||
logging.basicConfig(
|
||||
format='%(message)s',
|
||||
level=logging.INFO
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Log as JSON
|
||||
logger.info(json.dumps({
|
||||
"level": "info",
|
||||
"message": "User login successful",
|
||||
"user_id": "123",
|
||||
"ip": "1.2.3.4",
|
||||
"duration_ms": 42
|
||||
}))
|
||||
```
|
||||
|
||||
### Manual: Application Metrics
|
||||
|
||||
#### Step 1: Expose Metrics Endpoint
|
||||
|
||||
Your application needs to expose metrics at `/metrics` in Prometheus format.
|
||||
|
||||
**Python (Flask) Example:**
|
||||
```python
|
||||
from prometheus_flask_exporter import PrometheusMetrics
|
||||
|
||||
app = Flask(__name__)
|
||||
metrics = PrometheusMetrics(app)
|
||||
|
||||
# Now /metrics endpoint is available
|
||||
# Automatic metrics: request count, duration, etc.
|
||||
```
|
||||
|
||||
**Python (FastAPI) Example:**
|
||||
```python
|
||||
from prometheus_fastapi_instrumentator import Instrumentator
|
||||
|
||||
app = FastAPI()
|
||||
Instrumentator().instrument(app).expose(app)
|
||||
|
||||
# /metrics endpoint is now available
|
||||
```
|
||||
|
||||
**Go Example:**
|
||||
```go
|
||||
import (
|
||||
"github.com/prometheus/client_golang/prometheus/promhttp"
|
||||
"net/http"
|
||||
)
|
||||
|
||||
http.Handle("/metrics", promhttp.Handler())
|
||||
```
|
||||
|
||||
**Node.js Example:**
|
||||
```javascript
|
||||
const promClient = require('prom-client');
|
||||
|
||||
// Create default metrics
|
||||
const register = new promClient.Registry();
|
||||
promClient.collectDefaultMetrics({ register });
|
||||
|
||||
// Expose /metrics endpoint
|
||||
app.get('/metrics', async (req, res) => {
|
||||
res.set('Content-Type', register.contentType);
|
||||
res.end(await register.metrics());
|
||||
});
|
||||
```
|
||||
|
||||
#### Step 2: Add Prometheus Annotations to Your Deployment
|
||||
|
||||
Add these annotations to your pod template:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: my-app
|
||||
namespace: my-namespace
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
annotations:
|
||||
prometheus.io/scrape: "true" # Enable scraping
|
||||
prometheus.io/port: "8080" # Port where metrics are exposed
|
||||
prometheus.io/path: "/metrics" # Path to metrics (optional, /metrics is default)
|
||||
spec:
|
||||
containers:
|
||||
- name: my-app
|
||||
image: my-app:latest
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 8080
|
||||
```
|
||||
|
||||
#### Step 3: Verify Metrics Collection
|
||||
|
||||
**Check in Prometheus:**
|
||||
1. Access Prometheus UI (if exposed): https://prometheus.betelgeusebytes.io
|
||||
2. Go to Status → Targets
|
||||
3. Look for your pod under "kubernetes-pods"
|
||||
4. Should show as "UP"
|
||||
|
||||
**Or via Grafana:**
|
||||
1. Go to Explore → Prometheus
|
||||
2. Query: `up{pod=~"my-app.*"}`
|
||||
3. Should return value=1
|
||||
|
||||
**Query your metrics:**
|
||||
```promql
|
||||
# Request rate
|
||||
rate(http_requests_total{namespace="my-namespace"}[5m])
|
||||
|
||||
# Request duration 95th percentile
|
||||
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
||||
|
||||
# Error rate
|
||||
rate(http_requests_total{namespace="my-namespace", status=~"5.."}[5m])
|
||||
```
|
||||
|
||||
### Manual: Application Traces
|
||||
|
||||
#### Step 1: Add OpenTelemetry to Your Application
|
||||
|
||||
**Python Example:**
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||
from opentelemetry.instrumentation.flask import FlaskInstrumentor
|
||||
from opentelemetry.sdk.resources import Resource
|
||||
|
||||
# Configure resource
|
||||
resource = Resource.create({"service.name": "my-app"})
|
||||
|
||||
# Setup tracer
|
||||
trace_provider = TracerProvider(resource=resource)
|
||||
trace_provider.add_span_processor(
|
||||
BatchSpanProcessor(
|
||||
OTLPSpanExporter(
|
||||
endpoint="http://tempo.observability.svc.cluster.local:4317",
|
||||
insecure=True
|
||||
)
|
||||
)
|
||||
)
|
||||
trace.set_tracer_provider(trace_provider)
|
||||
|
||||
# Auto-instrument Flask
|
||||
app = Flask(__name__)
|
||||
FlaskInstrumentor().instrument_app(app)
|
||||
|
||||
# Manual spans
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
@app.route('/api/data')
|
||||
def get_data():
|
||||
with tracer.start_as_current_span("fetch_data") as span:
|
||||
# Your code here
|
||||
span.set_attribute("rows", 100)
|
||||
return {"data": "..."}
|
||||
```
|
||||
|
||||
**Install dependencies:**
|
||||
```bash
|
||||
pip install opentelemetry-api opentelemetry-sdk \
|
||||
opentelemetry-instrumentation-flask \
|
||||
opentelemetry-exporter-otlp-proto-grpc
|
||||
```
|
||||
|
||||
**Go Example:**
|
||||
```go
|
||||
import (
|
||||
"go.opentelemetry.io/otel"
|
||||
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
|
||||
"go.opentelemetry.io/otel/sdk/trace"
|
||||
)
|
||||
|
||||
exporter, _ := otlptracegrpc.New(
|
||||
context.Background(),
|
||||
otlptracegrpc.WithEndpoint("tempo.observability.svc.cluster.local:4317"),
|
||||
otlptracegrpc.WithInsecure(),
|
||||
)
|
||||
|
||||
tp := trace.NewTracerProvider(
|
||||
trace.WithBatcher(exporter),
|
||||
)
|
||||
otel.SetTracerProvider(tp)
|
||||
```
|
||||
|
||||
**Node.js Example:**
|
||||
```javascript
|
||||
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
|
||||
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
|
||||
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
|
||||
|
||||
const provider = new NodeTracerProvider();
|
||||
const exporter = new OTLPTraceExporter({
|
||||
url: 'http://tempo.observability.svc.cluster.local:4317'
|
||||
});
|
||||
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
|
||||
provider.register();
|
||||
```
|
||||
|
||||
#### Step 2: Add Trace IDs to Logs (Optional but Recommended)
|
||||
|
||||
This enables clicking from logs to traces in Grafana!
|
||||
|
||||
**Python Example:**
|
||||
```python
|
||||
import json
|
||||
from opentelemetry import trace
|
||||
|
||||
def log_with_trace(message):
|
||||
span = trace.get_current_span()
|
||||
trace_id = format(span.get_span_context().trace_id, '032x')
|
||||
|
||||
log_entry = {
|
||||
"message": message,
|
||||
"trace_id": trace_id,
|
||||
"level": "info"
|
||||
}
|
||||
print(json.dumps(log_entry))
|
||||
```
|
||||
|
||||
#### Step 3: Verify Traces
|
||||
|
||||
**In Grafana:**
|
||||
1. Go to Explore → Tempo
|
||||
2. Search for service: "my-app"
|
||||
3. Click on a trace to view details
|
||||
4. Click "Logs for this span" to see correlated logs
|
||||
|
||||
## 📋 Complete Example: Monitoring a New App
|
||||
|
||||
Here's a complete deployment with all monitoring configured:
|
||||
|
||||
```yaml
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: my-app-config
|
||||
namespace: my-namespace
|
||||
data:
|
||||
app.py: |
|
||||
from flask import Flask
|
||||
import logging
|
||||
import json
|
||||
from opentelemetry import trace
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||
from opentelemetry.instrumentation.flask import FlaskInstrumentor
|
||||
from opentelemetry.sdk.resources import Resource
|
||||
from prometheus_flask_exporter import PrometheusMetrics
|
||||
|
||||
# Setup logging
|
||||
logging.basicConfig(level=logging.INFO, format='%(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Setup tracing
|
||||
resource = Resource.create({"service.name": "my-app"})
|
||||
trace_provider = TracerProvider(resource=resource)
|
||||
trace_provider.add_span_processor(
|
||||
BatchSpanProcessor(
|
||||
OTLPSpanExporter(
|
||||
endpoint="http://tempo.observability.svc.cluster.local:4317",
|
||||
insecure=True
|
||||
)
|
||||
)
|
||||
)
|
||||
trace.set_tracer_provider(trace_provider)
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
# Setup metrics
|
||||
metrics = PrometheusMetrics(app)
|
||||
|
||||
# Auto-instrument with traces
|
||||
FlaskInstrumentor().instrument_app(app)
|
||||
|
||||
@app.route('/')
|
||||
def index():
|
||||
span = trace.get_current_span()
|
||||
trace_id = format(span.get_span_context().trace_id, '032x')
|
||||
|
||||
logger.info(json.dumps({
|
||||
"level": "info",
|
||||
"message": "Request received",
|
||||
"trace_id": trace_id,
|
||||
"endpoint": "/"
|
||||
}))
|
||||
|
||||
return {"status": "ok", "trace_id": trace_id}
|
||||
|
||||
if __name__ == '__main__':
|
||||
app.run(host='0.0.0.0', port=8080)
|
||||
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: my-app
|
||||
namespace: my-namespace
|
||||
labels:
|
||||
app: my-app
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: my-app
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: my-app
|
||||
annotations:
|
||||
# Enable Prometheus scraping
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
prometheus.io/path: "/metrics"
|
||||
spec:
|
||||
containers:
|
||||
- name: my-app
|
||||
image: python:3.11-slim
|
||||
command:
|
||||
- /bin/bash
|
||||
- -c
|
||||
- |
|
||||
pip install flask opentelemetry-api opentelemetry-sdk \
|
||||
opentelemetry-instrumentation-flask \
|
||||
opentelemetry-exporter-otlp-proto-grpc \
|
||||
prometheus-flask-exporter && \
|
||||
python /app/app.py
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 8080
|
||||
volumeMounts:
|
||||
- name: app-code
|
||||
mountPath: /app
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 256Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
volumes:
|
||||
- name: app-code
|
||||
configMap:
|
||||
name: my-app-config
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: my-app
|
||||
namespace: my-namespace
|
||||
labels:
|
||||
app: my-app
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
spec:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- port: 8080
|
||||
targetPort: http
|
||||
protocol: TCP
|
||||
name: http
|
||||
selector:
|
||||
app: my-app
|
||||
```
|
||||
|
||||
## 🔍 Verification Checklist
|
||||
|
||||
After deploying a new app with monitoring:
|
||||
|
||||
### Logs ✓ (Automatic)
|
||||
```bash
|
||||
# Check logs appear in Grafana
|
||||
# Explore → Loki → {namespace="my-namespace", pod=~"my-app.*"}
|
||||
```
|
||||
|
||||
### Metrics ✓ (If configured)
|
||||
```bash
|
||||
# Check Prometheus is scraping
|
||||
# Explore → Prometheus → up{pod=~"my-app.*"}
|
||||
# Should return 1
|
||||
|
||||
# Check your custom metrics
|
||||
# Explore → Prometheus → flask_http_request_total{namespace="my-namespace"}
|
||||
```
|
||||
|
||||
### Traces ✓ (If configured)
|
||||
```bash
|
||||
# Check traces appear in Tempo
|
||||
# Explore → Tempo → Search for service "my-app"
|
||||
# Should see traces
|
||||
|
||||
# Verify log-trace correlation
|
||||
# Click on a log line with trace_id → should jump to trace
|
||||
```
|
||||
|
||||
## 🎓 Quick Start for Common Frameworks
|
||||
|
||||
### Python Flask/FastAPI
|
||||
```bash
|
||||
pip install opentelemetry-distro opentelemetry-exporter-otlp prometheus-flask-exporter
|
||||
opentelemetry-bootstrap -a install
|
||||
```
|
||||
|
||||
```python
|
||||
# Set environment variables in your deployment:
|
||||
OTEL_SERVICE_NAME=my-app
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo.observability.svc.cluster.local:4317
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
|
||||
|
||||
# Then run with auto-instrumentation:
|
||||
opentelemetry-instrument python app.py
|
||||
```
|
||||
|
||||
### Go
|
||||
```bash
|
||||
go get go.opentelemetry.io/otel
|
||||
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
|
||||
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
|
||||
```
|
||||
|
||||
### Node.js
|
||||
```bash
|
||||
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
|
||||
@opentelemetry/exporter-trace-otlp-grpc prom-client
|
||||
```
|
||||
|
||||
## 📚 Summary
|
||||
|
||||
| Component | Automatic? | Configuration Needed |
|
||||
|-----------|-----------|---------------------|
|
||||
| **Logs** | ✅ Yes | None - just deploy your app |
|
||||
| **Metrics** | ❌ No | Add /metrics endpoint + annotations |
|
||||
| **Traces** | ❌ No | Add OpenTelemetry SDK + configure endpoint |
|
||||
|
||||
**Recommended Approach:**
|
||||
1. **Start simple**: Deploy app, logs work automatically
|
||||
2. **Add metrics**: Expose /metrics, add annotations
|
||||
3. **Add traces**: Instrument with OpenTelemetry
|
||||
4. **Correlate**: Add trace IDs to logs for full observability
|
||||
|
||||
## 🔗 Useful Links
|
||||
|
||||
- OpenTelemetry Python: https://opentelemetry.io/docs/instrumentation/python/
|
||||
- OpenTelemetry Go: https://opentelemetry.io/docs/instrumentation/go/
|
||||
- OpenTelemetry Node.js: https://opentelemetry.io/docs/instrumentation/js/
|
||||
- Prometheus Client Libraries: https://prometheus.io/docs/instrumenting/clientlibs/
|
||||
- Grafana Docs: https://grafana.com/docs/
|
||||
|
||||
## 🆘 Troubleshooting
|
||||
|
||||
**Logs not appearing:**
|
||||
- Check Alloy is running: `kubectl get pods -n observability -l app=alloy`
|
||||
- Check pod logs are being written to stdout/stderr
|
||||
- View in real-time: `kubectl logs -f <pod-name> -n <namespace>`
|
||||
|
||||
**Metrics not being scraped:**
|
||||
- Verify annotations are present: `kubectl get pod <pod> -o yaml | grep prometheus`
|
||||
- Check /metrics endpoint: `kubectl port-forward pod/<pod> 8080:8080` then `curl localhost:8080/metrics`
|
||||
- Check Prometheus targets: https://prometheus.betelgeusebytes.io/targets
|
||||
|
||||
**Traces not appearing:**
|
||||
- Verify endpoint: `tempo.observability.svc.cluster.local:4317`
|
||||
- Check Tempo logs: `kubectl logs -n observability tempo-0`
|
||||
- Verify OTLP exporter is configured correctly in your app
|
||||
- Check network policies allow traffic to observability namespace
|
||||
|
|
@ -0,0 +1,398 @@
|
|||
# Observability Stack Quick Reference
|
||||
|
||||
## Before You Start
|
||||
|
||||
### Remove Old Monitoring Stack
|
||||
If you have existing monitoring components, remove them first:
|
||||
```bash
|
||||
./remove-old-monitoring.sh
|
||||
```
|
||||
|
||||
This will safely remove:
|
||||
- Prometheus, Grafana, Loki, Tempo deployments
|
||||
- Fluent Bit, Vector, or other log collectors
|
||||
- Helm releases
|
||||
- ConfigMaps, PVCs, RBAC resources
|
||||
- Prometheus Operator CRDs
|
||||
|
||||
## Quick Access
|
||||
|
||||
- **Grafana UI**: https://grafana.betelgeusebytes.io
|
||||
- **Default Login**: admin / admin (change immediately!)
|
||||
|
||||
## Essential Commands
|
||||
|
||||
### Check Status
|
||||
```bash
|
||||
# Quick status check
|
||||
./status.sh
|
||||
|
||||
# View all pods
|
||||
kubectl get pods -n observability -o wide
|
||||
|
||||
# Check specific component
|
||||
kubectl get pods -n observability -l app=prometheus
|
||||
kubectl get pods -n observability -l app=loki
|
||||
kubectl get pods -n observability -l app=tempo
|
||||
kubectl get pods -n observability -l app=grafana
|
||||
|
||||
# Check storage
|
||||
kubectl get pv
|
||||
kubectl get pvc -n observability
|
||||
```
|
||||
|
||||
### View Logs
|
||||
```bash
|
||||
# Grafana
|
||||
kubectl logs -n observability -l app=grafana -f
|
||||
|
||||
# Prometheus
|
||||
kubectl logs -n observability -l app=prometheus -f
|
||||
|
||||
# Loki
|
||||
kubectl logs -n observability -l app=loki -f
|
||||
|
||||
# Tempo
|
||||
kubectl logs -n observability -l app=tempo -f
|
||||
|
||||
# Alloy (log collector)
|
||||
kubectl logs -n observability -l app=alloy -f
|
||||
```
|
||||
|
||||
### Restart Components
|
||||
```bash
|
||||
# Restart Prometheus
|
||||
kubectl rollout restart statefulset/prometheus -n observability
|
||||
|
||||
# Restart Loki
|
||||
kubectl rollout restart statefulset/loki -n observability
|
||||
|
||||
# Restart Tempo
|
||||
kubectl rollout restart statefulset/tempo -n observability
|
||||
|
||||
# Restart Grafana
|
||||
kubectl rollout restart statefulset/grafana -n observability
|
||||
|
||||
# Restart Alloy
|
||||
kubectl rollout restart daemonset/alloy -n observability
|
||||
```
|
||||
|
||||
### Update Configurations
|
||||
```bash
|
||||
# Edit Prometheus config
|
||||
kubectl edit configmap prometheus-config -n observability
|
||||
kubectl rollout restart statefulset/prometheus -n observability
|
||||
|
||||
# Edit Loki config
|
||||
kubectl edit configmap loki-config -n observability
|
||||
kubectl rollout restart statefulset/loki -n observability
|
||||
|
||||
# Edit Tempo config
|
||||
kubectl edit configmap tempo-config -n observability
|
||||
kubectl rollout restart statefulset/tempo -n observability
|
||||
|
||||
# Edit Alloy config
|
||||
kubectl edit configmap alloy-config -n observability
|
||||
kubectl rollout restart daemonset/alloy -n observability
|
||||
|
||||
# Edit Grafana datasources
|
||||
kubectl edit configmap grafana-datasources -n observability
|
||||
kubectl rollout restart statefulset/grafana -n observability
|
||||
```
|
||||
|
||||
## Common LogQL Queries (Loki)
|
||||
|
||||
### Basic Queries
|
||||
```logql
|
||||
# All logs from observability namespace
|
||||
{namespace="observability"}
|
||||
|
||||
# Logs from specific app
|
||||
{namespace="observability", app="prometheus"}
|
||||
|
||||
# Filter by log level
|
||||
{namespace="default"} |= "error"
|
||||
{namespace="default"} | json | level="error"
|
||||
|
||||
# Exclude certain logs
|
||||
{namespace="default"} != "health check"
|
||||
|
||||
# Multiple filters
|
||||
{namespace="default"} |= "error" != "ignore"
|
||||
```
|
||||
|
||||
### Advanced Queries
|
||||
```logql
|
||||
# Rate of errors
|
||||
rate({namespace="default"} |= "error" [5m])
|
||||
|
||||
# Count logs by level
|
||||
sum by (level) (count_over_time({namespace="default"} | json [5m]))
|
||||
|
||||
# Top 10 error messages
|
||||
topk(10, count by (message) (
|
||||
{namespace="default"} | json | level="error"
|
||||
))
|
||||
```
|
||||
|
||||
## Common PromQL Queries (Prometheus)
|
||||
|
||||
### Cluster Health
|
||||
```promql
|
||||
# All targets up/down
|
||||
up
|
||||
|
||||
# Pods by phase
|
||||
kube_pod_status_phase{namespace="observability"}
|
||||
|
||||
# Node memory available
|
||||
node_memory_MemAvailable_bytes
|
||||
|
||||
# Node CPU usage
|
||||
rate(node_cpu_seconds_total{mode="user"}[5m])
|
||||
```
|
||||
|
||||
### Container Metrics
|
||||
```promql
|
||||
# CPU usage by container
|
||||
rate(container_cpu_usage_seconds_total[5m])
|
||||
|
||||
# Memory usage by container
|
||||
container_memory_usage_bytes
|
||||
|
||||
# Network traffic
|
||||
rate(container_network_transmit_bytes_total[5m])
|
||||
rate(container_network_receive_bytes_total[5m])
|
||||
```
|
||||
|
||||
### Application Metrics
|
||||
```promql
|
||||
# HTTP request rate
|
||||
rate(http_requests_total[5m])
|
||||
|
||||
# Request duration
|
||||
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
||||
|
||||
# Error rate
|
||||
rate(http_requests_total{status=~"5.."}[5m])
|
||||
```
|
||||
|
||||
## Trace Search (Tempo)
|
||||
|
||||
In Grafana Explore with Tempo datasource:
|
||||
|
||||
- **Search by service**: Select from dropdown
|
||||
- **Search by duration**: "> 1s", "< 100ms"
|
||||
- **Search by tag**: `http.status_code=500`
|
||||
- **TraceQL**: `{span.http.method="POST" && span.http.status_code>=400}`
|
||||
|
||||
## Correlations
|
||||
|
||||
### From Logs to Traces
|
||||
1. View logs in Loki
|
||||
2. Click on a log line with a trace ID
|
||||
3. Click the "Tempo" link
|
||||
4. Trace opens in Tempo
|
||||
|
||||
### From Traces to Logs
|
||||
1. View trace in Tempo
|
||||
2. Click on a span
|
||||
3. Click "Logs for this span"
|
||||
4. Related logs appear
|
||||
|
||||
### From Traces to Metrics
|
||||
1. View trace in Tempo
|
||||
2. Service graph shows metrics
|
||||
3. Click service to see metrics
|
||||
|
||||
## Demo Application
|
||||
|
||||
Deploy the demo app to test the stack:
|
||||
|
||||
```bash
|
||||
kubectl apply -f demo-app.yaml
|
||||
|
||||
# Wait for it to start
|
||||
kubectl wait --for=condition=ready pod -l app=demo-app -n observability --timeout=300s
|
||||
|
||||
# Test it
|
||||
kubectl port-forward -n observability svc/demo-app 8080:8080
|
||||
|
||||
# In another terminal
|
||||
curl http://localhost:8080/
|
||||
curl http://localhost:8080/items
|
||||
curl http://localhost:8080/item/0
|
||||
curl http://localhost:8080/slow
|
||||
curl http://localhost:8080/error
|
||||
```
|
||||
|
||||
Now view in Grafana:
|
||||
- **Logs**: Search `{app="demo-app"}` in Loki
|
||||
- **Traces**: Search "demo-app" service in Tempo
|
||||
- **Metrics**: Query `flask_http_request_total` in Prometheus
|
||||
|
||||
## Storage Management
|
||||
|
||||
### Check Disk Usage
|
||||
```bash
|
||||
# On hetzner-2 node
|
||||
df -h /mnt/local-ssd/
|
||||
|
||||
# Detailed usage
|
||||
du -sh /mnt/local-ssd/*
|
||||
```
|
||||
|
||||
### Cleanup Old Data
|
||||
Data is automatically deleted after 7 days. To manually adjust retention:
|
||||
|
||||
**Prometheus** (in 03-prometheus-config.yaml):
|
||||
```yaml
|
||||
args:
|
||||
- '--storage.tsdb.retention.time=7d'
|
||||
```
|
||||
|
||||
**Loki** (in 04-loki-config.yaml):
|
||||
```yaml
|
||||
limits_config:
|
||||
retention_period: 168h # 7 days
|
||||
```
|
||||
|
||||
**Tempo** (in 05-tempo-config.yaml):
|
||||
```yaml
|
||||
compactor:
|
||||
compaction:
|
||||
block_retention: 168h # 7 days
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Logs Appearing
|
||||
```bash
|
||||
# Check Alloy is running
|
||||
kubectl get pods -n observability -l app=alloy
|
||||
|
||||
# Check Alloy logs
|
||||
kubectl logs -n observability -l app=alloy
|
||||
|
||||
# Check Loki
|
||||
kubectl logs -n observability -l app=loki
|
||||
|
||||
# Test Loki endpoint
|
||||
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
|
||||
curl http://loki.observability.svc.cluster.local:3100/ready
|
||||
```
|
||||
|
||||
### No Traces Appearing
|
||||
```bash
|
||||
# Check Tempo is running
|
||||
kubectl get pods -n observability -l app=tempo
|
||||
|
||||
# Check Tempo logs
|
||||
kubectl logs -n observability -l app=tempo
|
||||
|
||||
# Test Tempo endpoint
|
||||
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
|
||||
curl http://tempo.observability.svc.cluster.local:3200/ready
|
||||
|
||||
# Verify your app sends to correct endpoint
|
||||
# Should be: tempo.observability.svc.cluster.local:4317 (gRPC)
|
||||
# or: tempo.observability.svc.cluster.local:4318 (HTTP)
|
||||
```
|
||||
|
||||
### Grafana Can't Connect to Datasources
|
||||
```bash
|
||||
# Check all services are running
|
||||
kubectl get svc -n observability
|
||||
|
||||
# Test from Grafana pod
|
||||
kubectl exec -it -n observability grafana-0 -- \
|
||||
wget -O- http://prometheus.observability.svc.cluster.local:9090/-/healthy
|
||||
|
||||
kubectl exec -it -n observability grafana-0 -- \
|
||||
wget -O- http://loki.observability.svc.cluster.local:3100/ready
|
||||
|
||||
kubectl exec -it -n observability grafana-0 -- \
|
||||
wget -O- http://tempo.observability.svc.cluster.local:3200/ready
|
||||
```
|
||||
|
||||
### High Resource Usage
|
||||
```bash
|
||||
# Check resource usage
|
||||
kubectl top pods -n observability
|
||||
kubectl top nodes
|
||||
|
||||
# Scale down if needed (for testing)
|
||||
kubectl scale statefulset/prometheus -n observability --replicas=0
|
||||
kubectl scale statefulset/loki -n observability --replicas=0
|
||||
```
|
||||
|
||||
## Backup and Restore
|
||||
|
||||
### Backup Grafana Dashboards
|
||||
```bash
|
||||
# Export all dashboards via API
|
||||
kubectl port-forward -n observability svc/grafana 3000:3000
|
||||
|
||||
# In another terminal
|
||||
curl -H "Authorization: Bearer <API_KEY>" \
|
||||
http://localhost:3000/api/search?type=dash-db | jq
|
||||
```
|
||||
|
||||
### Backup Configurations
|
||||
```bash
|
||||
# Backup all ConfigMaps
|
||||
kubectl get configmap -n observability -o yaml > configmaps-backup.yaml
|
||||
|
||||
# Backup specific config
|
||||
kubectl get configmap prometheus-config -n observability -o yaml > prometheus-config-backup.yaml
|
||||
```
|
||||
|
||||
## Useful Dashboards in Grafana
|
||||
|
||||
After login, import these dashboard IDs:
|
||||
|
||||
- **315**: Kubernetes cluster monitoring
|
||||
- **7249**: Kubernetes cluster
|
||||
- **13639**: Loki dashboard
|
||||
- **12611**: Tempo dashboard
|
||||
- **3662**: Prometheus 2.0 stats
|
||||
- **1860**: Node Exporter Full
|
||||
|
||||
Go to: Dashboards → Import → Enter ID → Load
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### For Higher Load
|
||||
Increase resources in respective YAML files:
|
||||
|
||||
```yaml
|
||||
resources:
|
||||
requests:
|
||||
cpu: 1000m # from 500m
|
||||
memory: 4Gi # from 2Gi
|
||||
limits:
|
||||
cpu: 4000m # from 2000m
|
||||
memory: 8Gi # from 4Gi
|
||||
```
|
||||
|
||||
### For Lower Resource Usage
|
||||
- Reduce scrape intervals in Prometheus config
|
||||
- Reduce log retention periods
|
||||
- Reduce trace sampling rate
|
||||
|
||||
## Security Checklist
|
||||
|
||||
- [ ] Change Grafana admin password
|
||||
- [ ] Review RBAC permissions
|
||||
- [ ] Enable audit logging
|
||||
- [ ] Consider adding NetworkPolicies
|
||||
- [ ] Review ingress TLS configuration
|
||||
- [ ] Backup configurations regularly
|
||||
|
||||
## Getting Help
|
||||
|
||||
1. Check component logs first
|
||||
2. Review configurations
|
||||
3. Test network connectivity
|
||||
4. Check resource availability
|
||||
5. Review Grafana datasource settings
|
||||
|
|
@ -0,0 +1,385 @@
|
|||
# State-of-the-Art Observability Stack for Kubernetes
|
||||
|
||||
This deployment provides a comprehensive, production-ready observability solution using the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) with unified collection through Grafana Alloy.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Core Components
|
||||
|
||||
1. **Grafana** (v11.4.0) - Unified visualization platform
|
||||
- Pre-configured datasources for Prometheus, Loki, and Tempo
|
||||
- Automatic correlation between logs, metrics, and traces
|
||||
- Modern UI with TraceQL editor support
|
||||
|
||||
2. **Prometheus** (v2.54.1) - Metrics collection and storage
|
||||
- 7-day retention
|
||||
- Comprehensive Kubernetes service discovery
|
||||
- Scrapes: API server, nodes, cadvisor, pods, services
|
||||
|
||||
3. **Grafana Loki** (v3.2.1) - Log aggregation
|
||||
- 7-day retention with compaction
|
||||
- TSDB index for efficient queries
|
||||
- Automatic correlation with traces
|
||||
|
||||
4. **Grafana Tempo** (v2.6.1) - Distributed tracing
|
||||
- 7-day retention
|
||||
- Multiple protocol support: OTLP, Jaeger, Zipkin
|
||||
- Metrics generation from traces
|
||||
- Automatic correlation with logs and metrics
|
||||
|
||||
5. **Grafana Alloy** (v1.5.1) - Unified observability agent
|
||||
- Replaces Promtail, Vector, Fluent Bit
|
||||
- Collects logs from all pods
|
||||
- OTLP receiver for traces
|
||||
- Runs as DaemonSet on all nodes
|
||||
|
||||
6. **kube-state-metrics** (v2.13.0) - Kubernetes object metrics
|
||||
- Deployment, Pod, Service, Node metrics
|
||||
- Essential for cluster monitoring
|
||||
|
||||
7. **node-exporter** (v1.8.2) - Node-level system metrics
|
||||
- CPU, memory, disk, network metrics
|
||||
- Runs on all nodes via DaemonSet
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Unified Observability**: Logs, metrics, and traces in one platform
|
||||
- **Automatic Correlation**: Click from logs to traces to metrics seamlessly
|
||||
- **7-Day Retention**: Optimized for single-node cluster
|
||||
- **Local SSD Storage**: Fast, persistent storage on hetzner-2 node
|
||||
- **OTLP Support**: Modern OpenTelemetry protocol support
|
||||
- **TLS Enabled**: Secure access via NGINX Ingress with Let's Encrypt
|
||||
- **Low Resource Footprint**: Optimized for single-node deployment
|
||||
|
||||
## Storage Layout
|
||||
|
||||
All data stored on local SSD at `/mnt/local-ssd/`:
|
||||
|
||||
```
|
||||
/mnt/local-ssd/
|
||||
├── prometheus/ (50Gi) - Metrics data
|
||||
├── loki/ (100Gi) - Log data
|
||||
├── tempo/ (50Gi) - Trace data
|
||||
└── grafana/ (10Gi) - Dashboards and settings
|
||||
```
|
||||
|
||||
## Deployment Instructions
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. Kubernetes cluster with NGINX Ingress Controller
|
||||
2. cert-manager installed with Let's Encrypt issuer
|
||||
3. DNS record: `grafana.betelgeusebytes.io` → your cluster IP
|
||||
4. Node labeled: `kubernetes.io/hostname=hetzner-2`
|
||||
|
||||
### Step 0: Remove Existing Monitoring (If Applicable)
|
||||
|
||||
If you have an existing monitoring stack (Prometheus, Grafana, Loki, Fluent Bit, etc.), remove it first to avoid conflicts:
|
||||
|
||||
```bash
|
||||
./remove-old-monitoring.sh
|
||||
```
|
||||
|
||||
This interactive script will help you safely remove:
|
||||
- Existing Prometheus/Grafana/Loki/Tempo deployments
|
||||
- Helm releases for monitoring components
|
||||
- Fluent Bit, Vector, or other log collectors
|
||||
- Related ConfigMaps, PVCs, and RBAC resources
|
||||
- Prometheus Operator CRDs (if applicable)
|
||||
|
||||
**Note**: The main deployment script (`deploy.sh`) will also prompt you to run cleanup if needed.
|
||||
|
||||
### Step 1: Prepare Storage Directories
|
||||
|
||||
SSH into the hetzner-2 node and create directories:
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}
|
||||
sudo chown -R 65534:65534 /mnt/local-ssd/prometheus
|
||||
sudo chown -R 10001:10001 /mnt/local-ssd/loki
|
||||
sudo chown -R root:root /mnt/local-ssd/tempo
|
||||
sudo chown -R 472:472 /mnt/local-ssd/grafana
|
||||
```
|
||||
|
||||
### Step 2: Deploy the Stack
|
||||
|
||||
```bash
|
||||
chmod +x deploy.sh
|
||||
./deploy.sh
|
||||
```
|
||||
|
||||
Or deploy manually:
|
||||
|
||||
```bash
|
||||
kubectl apply -f 00-namespace.yaml
|
||||
kubectl apply -f 01-persistent-volumes.yaml
|
||||
kubectl apply -f 02-persistent-volume-claims.yaml
|
||||
kubectl apply -f 03-prometheus-config.yaml
|
||||
kubectl apply -f 04-loki-config.yaml
|
||||
kubectl apply -f 05-tempo-config.yaml
|
||||
kubectl apply -f 06-alloy-config.yaml
|
||||
kubectl apply -f 07-grafana-datasources.yaml
|
||||
kubectl apply -f 08-rbac.yaml
|
||||
kubectl apply -f 10-prometheus.yaml
|
||||
kubectl apply -f 11-loki.yaml
|
||||
kubectl apply -f 12-tempo.yaml
|
||||
kubectl apply -f 13-grafana.yaml
|
||||
kubectl apply -f 14-alloy.yaml
|
||||
kubectl apply -f 15-kube-state-metrics.yaml
|
||||
kubectl apply -f 16-node-exporter.yaml
|
||||
kubectl apply -f 20-grafana-ingress.yaml
|
||||
```
|
||||
|
||||
### Step 3: Verify Deployment
|
||||
|
||||
```bash
|
||||
kubectl get pods -n observability
|
||||
kubectl get pv
|
||||
kubectl get pvc -n observability
|
||||
```
|
||||
|
||||
All pods should be in `Running` state:
|
||||
- grafana-0
|
||||
- loki-0
|
||||
- prometheus-0
|
||||
- tempo-0
|
||||
- alloy-xxxxx (one per node)
|
||||
- kube-state-metrics-xxxxx
|
||||
- node-exporter-xxxxx (one per node)
|
||||
|
||||
### Step 4: Access Grafana
|
||||
|
||||
1. Open: https://grafana.betelgeusebytes.io
|
||||
2. Login with default credentials:
|
||||
- Username: `admin`
|
||||
- Password: `admin`
|
||||
3. **IMPORTANT**: Change the password on first login!
|
||||
|
||||
## Using the Stack
|
||||
|
||||
### Exploring Logs (Loki)
|
||||
|
||||
1. In Grafana, go to **Explore**
|
||||
2. Select **Loki** datasource
|
||||
3. Example queries:
|
||||
```
|
||||
{namespace="observability"}
|
||||
{namespace="observability", app="prometheus"}
|
||||
{namespace="default"} |= "error"
|
||||
{pod="my-app-xxx"} | json | level="error"
|
||||
```
|
||||
|
||||
### Exploring Metrics (Prometheus)
|
||||
|
||||
1. In Grafana, go to **Explore**
|
||||
2. Select **Prometheus** datasource
|
||||
3. Example queries:
|
||||
```
|
||||
up
|
||||
node_memory_MemAvailable_bytes
|
||||
rate(container_cpu_usage_seconds_total[5m])
|
||||
kube_pod_status_phase{namespace="observability"}
|
||||
```
|
||||
|
||||
### Exploring Traces (Tempo)
|
||||
|
||||
1. In Grafana, go to **Explore**
|
||||
2. Select **Tempo** datasource
|
||||
3. Search by:
|
||||
- Service name
|
||||
- Duration
|
||||
- Tags
|
||||
4. Click on a trace to see detailed span timeline
|
||||
|
||||
### Correlations
|
||||
|
||||
The stack automatically correlates:
|
||||
- **Logs → Traces**: Click traceID in logs to view trace
|
||||
- **Traces → Logs**: Click on trace to see related logs
|
||||
- **Traces → Metrics**: Tempo generates metrics from traces
|
||||
|
||||
### Instrumenting Your Applications
|
||||
|
||||
#### For Logs
|
||||
Logs are automatically collected from all pods by Alloy. Emit structured JSON logs:
|
||||
|
||||
```json
|
||||
{"level":"info","message":"Request processed","duration_ms":42}
|
||||
```
|
||||
|
||||
#### For Traces
|
||||
Send traces to Tempo using OTLP:
|
||||
|
||||
```python
|
||||
# Python with OpenTelemetry
|
||||
from opentelemetry import trace
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||
|
||||
provider = TracerProvider()
|
||||
provider.add_span_processor(
|
||||
BatchSpanProcessor(
|
||||
OTLPSpanExporter(endpoint="http://tempo.observability.svc.cluster.local:4317")
|
||||
)
|
||||
)
|
||||
trace.set_tracer_provider(provider)
|
||||
```
|
||||
|
||||
#### For Metrics
|
||||
Expose metrics in Prometheus format and add annotations to your pod:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
prometheus.io/path: "/metrics"
|
||||
```
|
||||
|
||||
## Monitoring Endpoints
|
||||
|
||||
Internal service endpoints:
|
||||
|
||||
- **Prometheus**: `http://prometheus.observability.svc.cluster.local:9090`
|
||||
- **Loki**: `http://loki.observability.svc.cluster.local:3100`
|
||||
- **Tempo**:
|
||||
- HTTP: `http://tempo.observability.svc.cluster.local:3200`
|
||||
- OTLP gRPC: `tempo.observability.svc.cluster.local:4317`
|
||||
- OTLP HTTP: `tempo.observability.svc.cluster.local:4318`
|
||||
- **Grafana**: `http://grafana.observability.svc.cluster.local:3000`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Check Pod Status
|
||||
```bash
|
||||
kubectl get pods -n observability
|
||||
kubectl describe pod <pod-name> -n observability
|
||||
```
|
||||
|
||||
### View Logs
|
||||
```bash
|
||||
kubectl logs -n observability -l app=grafana
|
||||
kubectl logs -n observability -l app=prometheus
|
||||
kubectl logs -n observability -l app=loki
|
||||
kubectl logs -n observability -l app=tempo
|
||||
kubectl logs -n observability -l app=alloy
|
||||
```
|
||||
|
||||
### Check Storage
|
||||
```bash
|
||||
kubectl get pv
|
||||
kubectl get pvc -n observability
|
||||
```
|
||||
|
||||
### Test Connectivity
|
||||
```bash
|
||||
# From inside cluster
|
||||
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
|
||||
curl http://prometheus.observability.svc.cluster.local:9090/-/healthy
|
||||
```
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Pods stuck in Pending**
|
||||
- Check if storage directories exist on hetzner-2
|
||||
- Verify PV/PVC bindings: `kubectl describe pvc -n observability`
|
||||
|
||||
**Loki won't start**
|
||||
- Check permissions on `/mnt/local-ssd/loki` (should be 10001:10001)
|
||||
- View logs: `kubectl logs -n observability loki-0`
|
||||
|
||||
**No logs appearing**
|
||||
- Check Alloy pods are running: `kubectl get pods -n observability -l app=alloy`
|
||||
- View Alloy logs: `kubectl logs -n observability -l app=alloy`
|
||||
|
||||
**Grafana can't reach datasources**
|
||||
- Verify services: `kubectl get svc -n observability`
|
||||
- Check datasource URLs in Grafana UI
|
||||
|
||||
## Updating Configuration
|
||||
|
||||
### Update Prometheus Scrape Config
|
||||
```bash
|
||||
kubectl edit configmap prometheus-config -n observability
|
||||
kubectl rollout restart statefulset/prometheus -n observability
|
||||
```
|
||||
|
||||
### Update Loki Retention
|
||||
```bash
|
||||
kubectl edit configmap loki-config -n observability
|
||||
kubectl rollout restart statefulset/loki -n observability
|
||||
```
|
||||
|
||||
### Update Alloy Collection Rules
|
||||
```bash
|
||||
kubectl edit configmap alloy-config -n observability
|
||||
kubectl rollout restart daemonset/alloy -n observability
|
||||
```
|
||||
|
||||
## Resource Usage
|
||||
|
||||
Expected resource consumption:
|
||||
|
||||
| Component | CPU Request | CPU Limit | Memory Request | Memory Limit |
|
||||
|-----------|-------------|-----------|----------------|--------------|
|
||||
| Prometheus | 500m | 2000m | 2Gi | 4Gi |
|
||||
| Loki | 500m | 2000m | 1Gi | 2Gi |
|
||||
| Tempo | 500m | 2000m | 1Gi | 2Gi |
|
||||
| Grafana | 250m | 1000m | 512Mi | 1Gi |
|
||||
| Alloy (per node) | 100m | 500m | 256Mi | 512Mi |
|
||||
| kube-state-metrics | 100m | 200m | 128Mi | 256Mi |
|
||||
| node-exporter (per node) | 100m | 200m | 128Mi | 256Mi |
|
||||
|
||||
**Total (single node)**: ~2.1 CPU cores, ~7.5Gi memory
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Change default Grafana password** immediately after deployment
|
||||
2. Consider adding authentication for internal services if exposed
|
||||
3. Review and restrict RBAC permissions as needed
|
||||
4. Enable audit logging in Loki for sensitive namespaces
|
||||
5. Consider adding NetworkPolicies to restrict traffic
|
||||
|
||||
## Documentation
|
||||
|
||||
This deployment includes comprehensive guides:
|
||||
|
||||
- **README.md**: Complete deployment and configuration guide (this file)
|
||||
- **MONITORING-GUIDE.md**: URLs, access, and how to monitor new applications
|
||||
- **DEPLOYMENT-CHECKLIST.md**: Step-by-step deployment checklist
|
||||
- **QUICKREF.md**: Quick reference for daily operations
|
||||
- **demo-app.yaml**: Example fully instrumented application
|
||||
- **deploy.sh**: Automated deployment script
|
||||
- **status.sh**: Health check script
|
||||
- **cleanup.sh**: Complete stack removal
|
||||
- **remove-old-monitoring.sh**: Remove existing monitoring before deployment
|
||||
- **21-optional-ingresses.yaml**: Optional external access to Prometheus/Loki/Tempo
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- Add Alertmanager for alerting
|
||||
- Configure Grafana SMTP for email notifications
|
||||
- Add custom dashboards for your applications
|
||||
- Implement Grafana RBAC for team access
|
||||
- Consider Mimir for long-term metrics storage
|
||||
- Add backup/restore procedures
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
1. Check pod logs first
|
||||
2. Review Grafana datasource configuration
|
||||
3. Verify network connectivity between components
|
||||
4. Check storage and resource availability
|
||||
|
||||
## Version Information
|
||||
|
||||
- Grafana: 11.4.0
|
||||
- Prometheus: 2.54.1
|
||||
- Loki: 3.2.1
|
||||
- Tempo: 2.6.1
|
||||
- Alloy: 1.5.1
|
||||
- kube-state-metrics: 2.13.0
|
||||
- node-exporter: 1.8.2
|
||||
|
||||
Last updated: January 2025
|
||||
|
|
@ -0,0 +1,62 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "=================================================="
|
||||
echo "Removing Observability Stack from Kubernetes"
|
||||
echo "=================================================="
|
||||
echo ""
|
||||
|
||||
RED='\033[0;31m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
echo -e "${RED}WARNING: This will delete all observability data!${NC}"
|
||||
echo ""
|
||||
read -p "Are you sure you want to continue? (yes/no): " confirm
|
||||
|
||||
if [ "$confirm" != "yes" ]; then
|
||||
echo "Cleanup cancelled."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo -e "${YELLOW}Removing Ingress...${NC}"
|
||||
kubectl delete -f 20-grafana-ingress.yaml --ignore-not-found
|
||||
|
||||
echo -e "${YELLOW}Removing Deployments and DaemonSets...${NC}"
|
||||
kubectl delete -f 16-node-exporter.yaml --ignore-not-found
|
||||
kubectl delete -f 15-kube-state-metrics.yaml --ignore-not-found
|
||||
kubectl delete -f 14-alloy.yaml --ignore-not-found
|
||||
kubectl delete -f 13-grafana.yaml --ignore-not-found
|
||||
kubectl delete -f 12-tempo.yaml --ignore-not-found
|
||||
kubectl delete -f 11-loki.yaml --ignore-not-found
|
||||
kubectl delete -f 10-prometheus.yaml --ignore-not-found
|
||||
|
||||
echo -e "${YELLOW}Removing RBAC...${NC}"
|
||||
kubectl delete -f 08-rbac.yaml --ignore-not-found
|
||||
|
||||
echo -e "${YELLOW}Removing ConfigMaps...${NC}"
|
||||
kubectl delete -f 07-grafana-datasources.yaml --ignore-not-found
|
||||
kubectl delete -f 06-alloy-config.yaml --ignore-not-found
|
||||
kubectl delete -f 05-tempo-config.yaml --ignore-not-found
|
||||
kubectl delete -f 04-loki-config.yaml --ignore-not-found
|
||||
kubectl delete -f 03-prometheus-config.yaml --ignore-not-found
|
||||
|
||||
echo -e "${YELLOW}Removing PVCs...${NC}"
|
||||
kubectl delete -f 02-persistent-volume-claims.yaml --ignore-not-found
|
||||
|
||||
echo -e "${YELLOW}Removing PVs...${NC}"
|
||||
kubectl delete -f 01-persistent-volumes.yaml --ignore-not-found
|
||||
|
||||
echo -e "${YELLOW}Removing Namespace...${NC}"
|
||||
kubectl delete -f 00-namespace.yaml --ignore-not-found
|
||||
|
||||
echo ""
|
||||
echo -e "${RED}=================================================="
|
||||
echo "Cleanup Complete!"
|
||||
echo "==================================================${NC}"
|
||||
echo ""
|
||||
echo "Data directories on hetzner-2 node are preserved."
|
||||
echo "To remove them, run on the node:"
|
||||
echo " sudo rm -rf /mnt/local-ssd/{prometheus,loki,tempo,grafana}"
|
||||
echo ""
|
||||
|
|
@ -0,0 +1,253 @@
|
|||
---
|
||||
# Example instrumented application to test the observability stack
|
||||
# This is a simple Python Flask app with OpenTelemetry instrumentation
|
||||
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: demo-app
|
||||
namespace: observability
|
||||
data:
|
||||
app.py: |
|
||||
from flask import Flask, jsonify
|
||||
import logging
|
||||
import json
|
||||
import time
|
||||
import random
|
||||
|
||||
# OpenTelemetry imports
|
||||
from opentelemetry import trace, metrics
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||
from opentelemetry.sdk.metrics import MeterProvider
|
||||
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
|
||||
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
|
||||
from opentelemetry.instrumentation.flask import FlaskInstrumentor
|
||||
from opentelemetry.sdk.resources import Resource
|
||||
from prometheus_flask_exporter import PrometheusMetrics
|
||||
|
||||
# Configure structured logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(message)s'
|
||||
)
|
||||
|
||||
class JSONFormatter(logging.Formatter):
|
||||
def format(self, record):
|
||||
log_obj = {
|
||||
'timestamp': self.formatTime(record, self.datefmt),
|
||||
'level': record.levelname,
|
||||
'message': record.getMessage(),
|
||||
'logger': record.name,
|
||||
}
|
||||
if hasattr(record, 'trace_id'):
|
||||
log_obj['trace_id'] = record.trace_id
|
||||
log_obj['span_id'] = record.span_id
|
||||
return json.dumps(log_obj)
|
||||
|
||||
handler = logging.StreamHandler()
|
||||
handler.setFormatter(JSONFormatter())
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.addHandler(handler)
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
# Configure OpenTelemetry
|
||||
resource = Resource.create({"service.name": "demo-app"})
|
||||
|
||||
# Tracing
|
||||
trace_provider = TracerProvider(resource=resource)
|
||||
trace_provider.add_span_processor(
|
||||
BatchSpanProcessor(
|
||||
OTLPSpanExporter(
|
||||
endpoint="http://tempo.observability.svc.cluster.local:4317",
|
||||
insecure=True
|
||||
)
|
||||
)
|
||||
)
|
||||
trace.set_tracer_provider(trace_provider)
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
# Create Flask app
|
||||
app = Flask(__name__)
|
||||
|
||||
# Prometheus metrics
|
||||
metrics = PrometheusMetrics(app)
|
||||
|
||||
# Auto-instrument Flask
|
||||
FlaskInstrumentor().instrument_app(app)
|
||||
|
||||
# Sample data
|
||||
ITEMS = ["apple", "banana", "orange", "grape", "mango"]
|
||||
|
||||
@app.route('/')
|
||||
def index():
|
||||
span = trace.get_current_span()
|
||||
trace_id = format(span.get_span_context().trace_id, '032x')
|
||||
|
||||
logger.info("Index page accessed", extra={
|
||||
'trace_id': trace_id,
|
||||
'endpoint': '/'
|
||||
})
|
||||
|
||||
return jsonify({
|
||||
'service': 'demo-app',
|
||||
'status': 'healthy',
|
||||
'trace_id': trace_id
|
||||
})
|
||||
|
||||
@app.route('/items')
|
||||
def get_items():
|
||||
with tracer.start_as_current_span("fetch_items") as span:
|
||||
# Simulate database query
|
||||
time.sleep(random.uniform(0.01, 0.1))
|
||||
|
||||
span.set_attribute("items.count", len(ITEMS))
|
||||
trace_id = format(span.get_span_context().trace_id, '032x')
|
||||
|
||||
logger.info("Items fetched", extra={
|
||||
'trace_id': trace_id,
|
||||
'count': len(ITEMS)
|
||||
})
|
||||
|
||||
return jsonify({
|
||||
'items': ITEMS,
|
||||
'count': len(ITEMS),
|
||||
'trace_id': trace_id
|
||||
})
|
||||
|
||||
@app.route('/item/<int:item_id>')
|
||||
def get_item(item_id):
|
||||
with tracer.start_as_current_span("fetch_item") as span:
|
||||
span.set_attribute("item.id", item_id)
|
||||
trace_id = format(span.get_span_context().trace_id, '032x')
|
||||
|
||||
# Simulate processing
|
||||
time.sleep(random.uniform(0.01, 0.05))
|
||||
|
||||
if item_id < 0 or item_id >= len(ITEMS):
|
||||
logger.warning("Item not found", extra={
|
||||
'trace_id': trace_id,
|
||||
'item_id': item_id
|
||||
})
|
||||
return jsonify({'error': 'Item not found', 'trace_id': trace_id}), 404
|
||||
|
||||
item = ITEMS[item_id]
|
||||
logger.info("Item fetched", extra={
|
||||
'trace_id': trace_id,
|
||||
'item_id': item_id,
|
||||
'item': item
|
||||
})
|
||||
|
||||
return jsonify({
|
||||
'id': item_id,
|
||||
'name': item,
|
||||
'trace_id': trace_id
|
||||
})
|
||||
|
||||
@app.route('/slow')
|
||||
def slow_endpoint():
|
||||
with tracer.start_as_current_span("slow_operation") as span:
|
||||
trace_id = format(span.get_span_context().trace_id, '032x')
|
||||
|
||||
logger.info("Slow operation started", extra={'trace_id': trace_id})
|
||||
|
||||
# Simulate slow operation
|
||||
time.sleep(random.uniform(1, 3))
|
||||
|
||||
logger.info("Slow operation completed", extra={'trace_id': trace_id})
|
||||
|
||||
return jsonify({
|
||||
'message': 'Operation completed',
|
||||
'trace_id': trace_id
|
||||
})
|
||||
|
||||
@app.route('/error')
|
||||
def error_endpoint():
|
||||
with tracer.start_as_current_span("error_operation") as span:
|
||||
trace_id = format(span.get_span_context().trace_id, '032x')
|
||||
|
||||
logger.error("Intentional error triggered", extra={'trace_id': trace_id})
|
||||
span.set_attribute("error", True)
|
||||
|
||||
return jsonify({
|
||||
'error': 'This is an intentional error',
|
||||
'trace_id': trace_id
|
||||
}), 500
|
||||
|
||||
if __name__ == '__main__':
|
||||
app.run(host='0.0.0.0', port=8080)
|
||||
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: demo-app
|
||||
namespace: observability
|
||||
labels:
|
||||
app: demo-app
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: demo-app
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: demo-app
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
prometheus.io/path: "/metrics"
|
||||
spec:
|
||||
containers:
|
||||
- name: demo-app
|
||||
image: python:3.11-slim
|
||||
command:
|
||||
- /bin/bash
|
||||
- -c
|
||||
- |
|
||||
pip install flask opentelemetry-api opentelemetry-sdk \
|
||||
opentelemetry-instrumentation-flask \
|
||||
opentelemetry-exporter-otlp-proto-grpc \
|
||||
prometheus-flask-exporter && \
|
||||
python /app/app.py
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 8080
|
||||
volumeMounts:
|
||||
- name: app-code
|
||||
mountPath: /app
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 256Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
volumes:
|
||||
- name: app-code
|
||||
configMap:
|
||||
name: demo-app
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: demo-app
|
||||
namespace: observability
|
||||
labels:
|
||||
app: demo-app
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
prometheus.io/path: "/metrics"
|
||||
spec:
|
||||
type: ClusterIP
|
||||
ports:
|
||||
- port: 8080
|
||||
targetPort: http
|
||||
protocol: TCP
|
||||
name: http
|
||||
selector:
|
||||
app: demo-app
|
||||
|
|
@ -0,0 +1,114 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "=================================================="
|
||||
echo "Deploying Observability Stack to Kubernetes"
|
||||
echo "=================================================="
|
||||
echo ""
|
||||
|
||||
# Colors for output
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
RED='\033[0;31m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
echo -e "${YELLOW}Pre-deployment Check: Existing Monitoring Stack${NC}"
|
||||
echo ""
|
||||
echo "If you have an existing monitoring/prometheus/grafana deployment,"
|
||||
echo "you should remove it first to avoid conflicts."
|
||||
echo ""
|
||||
read -p "Do you want to run the cleanup script now? (yes/no): " run_cleanup
|
||||
if [ "$run_cleanup" = "yes" ]; then
|
||||
if [ -f "./remove-old-monitoring.sh" ]; then
|
||||
echo "Running cleanup script..."
|
||||
./remove-old-monitoring.sh
|
||||
echo ""
|
||||
echo "Cleanup complete. Continuing with deployment..."
|
||||
echo ""
|
||||
else
|
||||
echo -e "${RED}Error: remove-old-monitoring.sh not found${NC}"
|
||||
echo "Please run it manually before deploying."
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
echo -e "${YELLOW}Step 1: Creating storage directories on node...${NC}"
|
||||
echo "Please run this on the hetzner-2 node:"
|
||||
echo " sudo mkdir -p /mnt/local-ssd/{prometheus,loki,tempo,grafana}"
|
||||
echo " sudo chown -R 65534:65534 /mnt/local-ssd/prometheus"
|
||||
echo " sudo chown -R 10001:10001 /mnt/local-ssd/loki"
|
||||
echo " sudo chown -R root:root /mnt/local-ssd/tempo"
|
||||
echo " sudo chown -R 472:472 /mnt/local-ssd/grafana"
|
||||
echo ""
|
||||
read -p "Press Enter once directories are created..."
|
||||
|
||||
echo -e "${GREEN}Step 2: Creating namespace...${NC}"
|
||||
kubectl apply -f 00-namespace.yaml
|
||||
|
||||
echo -e "${GREEN}Step 3: Creating PersistentVolumes...${NC}"
|
||||
kubectl apply -f 01-persistent-volumes.yaml
|
||||
|
||||
echo -e "${GREEN}Step 4: Creating PersistentVolumeClaims...${NC}"
|
||||
kubectl apply -f 02-persistent-volume-claims.yaml
|
||||
|
||||
echo -e "${GREEN}Step 5: Creating ConfigMaps...${NC}"
|
||||
kubectl apply -f 03-prometheus-config.yaml
|
||||
kubectl apply -f 04-loki-config.yaml
|
||||
kubectl apply -f 05-tempo-config.yaml
|
||||
kubectl apply -f 06-alloy-config.yaml
|
||||
kubectl apply -f 07-grafana-datasources.yaml
|
||||
|
||||
echo -e "${GREEN}Step 6: Creating RBAC resources...${NC}"
|
||||
kubectl apply -f 08-rbac.yaml
|
||||
|
||||
echo -e "${GREEN}Step 7: Deploying Prometheus...${NC}"
|
||||
kubectl apply -f 10-prometheus.yaml
|
||||
|
||||
echo -e "${GREEN}Step 8: Deploying Loki...${NC}"
|
||||
kubectl apply -f 11-loki.yaml
|
||||
|
||||
echo -e "${GREEN}Step 9: Deploying Tempo...${NC}"
|
||||
kubectl apply -f 12-tempo.yaml
|
||||
|
||||
echo -e "${GREEN}Step 10: Deploying Grafana...${NC}"
|
||||
kubectl apply -f 13-grafana.yaml
|
||||
|
||||
echo -e "${GREEN}Step 11: Deploying Grafana Alloy...${NC}"
|
||||
kubectl apply -f 14-alloy.yaml
|
||||
|
||||
echo -e "${GREEN}Step 12: Deploying kube-state-metrics...${NC}"
|
||||
kubectl apply -f 15-kube-state-metrics.yaml
|
||||
|
||||
echo -e "${GREEN}Step 13: Deploying node-exporter...${NC}"
|
||||
kubectl apply -f 16-node-exporter.yaml
|
||||
|
||||
echo -e "${GREEN}Step 14: Creating Grafana Ingress...${NC}"
|
||||
kubectl apply -f 20-grafana-ingress.yaml
|
||||
|
||||
echo ""
|
||||
echo -e "${GREEN}=================================================="
|
||||
echo "Deployment Complete!"
|
||||
echo "==================================================${NC}"
|
||||
echo ""
|
||||
echo "Waiting for pods to be ready..."
|
||||
kubectl wait --for=condition=ready pod -l app=prometheus -n observability --timeout=300s
|
||||
kubectl wait --for=condition=ready pod -l app=loki -n observability --timeout=300s
|
||||
kubectl wait --for=condition=ready pod -l app=tempo -n observability --timeout=300s
|
||||
kubectl wait --for=condition=ready pod -l app=grafana -n observability --timeout=300s
|
||||
|
||||
echo ""
|
||||
echo -e "${GREEN}All pods are ready!${NC}"
|
||||
echo ""
|
||||
echo "Access Grafana at: https://grafana.betelgeusebytes.io"
|
||||
echo "Default credentials: admin / admin"
|
||||
echo ""
|
||||
echo "To check status:"
|
||||
echo " kubectl get pods -n observability"
|
||||
echo ""
|
||||
echo "To view logs:"
|
||||
echo " kubectl logs -n observability -l app=grafana"
|
||||
echo " kubectl logs -n observability -l app=prometheus"
|
||||
echo " kubectl logs -n observability -l app=loki"
|
||||
echo " kubectl logs -n observability -l app=tempo"
|
||||
echo ""
|
||||
|
|
@ -0,0 +1,319 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "=========================================================="
|
||||
echo "Removing Existing Monitoring Stack"
|
||||
echo "=========================================================="
|
||||
echo ""
|
||||
|
||||
RED='\033[0;31m'
|
||||
YELLOW='\033[1;33m'
|
||||
GREEN='\033[0;32m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
echo -e "${YELLOW}This script will remove common monitoring deployments including:${NC}"
|
||||
echo " - Prometheus (standalone or operator)"
|
||||
echo " - Grafana"
|
||||
echo " - Fluent Bit"
|
||||
echo " - Vector"
|
||||
echo " - Loki"
|
||||
echo " - Tempo"
|
||||
echo " - Node exporters"
|
||||
echo " - kube-state-metrics"
|
||||
echo " - Any monitoring/prometheus/grafana namespaces"
|
||||
echo ""
|
||||
echo -e "${RED}WARNING: This will delete all existing monitoring data!${NC}"
|
||||
echo ""
|
||||
read -p "Are you sure you want to continue? (yes/no): " confirm
|
||||
|
||||
if [ "$confirm" != "yes" ]; then
|
||||
echo "Cleanup cancelled."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 1: Checking for existing monitoring namespaces...${NC}"
|
||||
|
||||
# Common namespace names for monitoring
|
||||
NAMESPACES=("monitoring" "prometheus" "grafana" "loki" "tempo" "logging")
|
||||
|
||||
for ns in "${NAMESPACES[@]}"; do
|
||||
if kubectl get namespace "$ns" &> /dev/null; then
|
||||
echo -e "${GREEN}Found namespace: $ns${NC}"
|
||||
|
||||
# Show what's in the namespace
|
||||
echo " Resources in $ns:"
|
||||
kubectl get all -n "$ns" 2>/dev/null | head -20 || true
|
||||
echo ""
|
||||
|
||||
read -p " Delete namespace '$ns'? (yes/no): " delete_ns
|
||||
if [ "$delete_ns" = "yes" ]; then
|
||||
echo " Deleting namespace $ns..."
|
||||
kubectl delete namespace "$ns" --timeout=120s || {
|
||||
echo -e "${YELLOW} Warning: Namespace deletion timed out, forcing...${NC}"
|
||||
kubectl delete namespace "$ns" --grace-period=0 --force &
|
||||
}
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 2: Removing common monitoring Helm releases...${NC}"
|
||||
|
||||
# Check if helm is available
|
||||
if command -v helm &> /dev/null; then
|
||||
echo "Checking for Helm releases..."
|
||||
|
||||
# Common Helm release names
|
||||
RELEASES=("prometheus" "grafana" "loki" "tempo" "fluent-bit" "prometheus-operator" "kube-prometheus-stack")
|
||||
|
||||
for release in "${RELEASES[@]}"; do
|
||||
# Check all namespaces for the release
|
||||
if helm list -A | grep -q "$release"; then
|
||||
ns=$(helm list -A | grep "$release" | awk '{print $2}')
|
||||
echo -e "${GREEN}Found Helm release: $release in namespace $ns${NC}"
|
||||
read -p " Uninstall Helm release '$release'? (yes/no): " uninstall
|
||||
if [ "$uninstall" = "yes" ]; then
|
||||
echo " Uninstalling $release..."
|
||||
helm uninstall "$release" -n "$ns" || echo -e "${YELLOW} Warning: Failed to uninstall $release${NC}"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
else
|
||||
echo "Helm not found, skipping Helm releases check"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 3: Removing standalone monitoring components...${NC}"
|
||||
|
||||
# Remove common DaemonSets in kube-system or default
|
||||
echo "Checking for monitoring DaemonSets..."
|
||||
for ns in kube-system default; do
|
||||
if kubectl get daemonset -n "$ns" 2>/dev/null | grep -q "node-exporter\|fluent-bit\|fluentd\|vector"; then
|
||||
echo -e "${GREEN}Found monitoring DaemonSets in $ns${NC}"
|
||||
kubectl get daemonset -n "$ns" | grep -E "node-exporter|fluent-bit|fluentd|vector"
|
||||
read -p " Delete these DaemonSets? (yes/no): " delete_ds
|
||||
if [ "$delete_ds" = "yes" ]; then
|
||||
kubectl delete daemonset -n "$ns" -l app=node-exporter --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" -l app=fluent-bit --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" -l app=fluentd --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" -l app=vector --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" node-exporter --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" fluent-bit --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" fluentd --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" vector --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
# Remove common Deployments
|
||||
echo ""
|
||||
echo "Checking for monitoring Deployments..."
|
||||
for ns in kube-system default; do
|
||||
if kubectl get deployment -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|kube-state-metrics\|loki\|tempo"; then
|
||||
echo -e "${GREEN}Found monitoring Deployments in $ns${NC}"
|
||||
kubectl get deployment -n "$ns" | grep -E "prometheus|grafana|kube-state-metrics|loki|tempo"
|
||||
read -p " Delete these Deployments? (yes/no): " delete_deploy
|
||||
if [ "$delete_deploy" = "yes" ]; then
|
||||
kubectl delete deployment -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" -l app=kube-state-metrics --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" -l app=loki --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" -l app=tempo --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" prometheus --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" grafana --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" kube-state-metrics --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
# Remove common StatefulSets
|
||||
echo ""
|
||||
echo "Checking for monitoring StatefulSets..."
|
||||
for ns in kube-system default; do
|
||||
if kubectl get statefulset -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|loki\|tempo"; then
|
||||
echo -e "${GREEN}Found monitoring StatefulSets in $ns${NC}"
|
||||
kubectl get statefulset -n "$ns" | grep -E "prometheus|grafana|loki|tempo"
|
||||
read -p " Delete these StatefulSets? (yes/no): " delete_sts
|
||||
if [ "$delete_sts" = "yes" ]; then
|
||||
kubectl delete statefulset -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" -l app=loki --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" -l app=tempo --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" prometheus --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" grafana --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" loki --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" tempo --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 4: Removing monitoring ConfigMaps...${NC}"
|
||||
|
||||
# Ask before removing ConfigMaps (they might contain important configs)
|
||||
echo "Checking for monitoring ConfigMaps..."
|
||||
for ns in kube-system default monitoring prometheus grafana; do
|
||||
if kubectl get namespace "$ns" &> /dev/null; then
|
||||
if kubectl get configmap -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|loki\|tempo\|fluent"; then
|
||||
echo -e "${GREEN}Found monitoring ConfigMaps in $ns${NC}"
|
||||
kubectl get configmap -n "$ns" | grep -E "prometheus|grafana|loki|tempo|fluent"
|
||||
read -p " Delete these ConfigMaps? (yes/no): " delete_cm
|
||||
if [ "$delete_cm" = "yes" ]; then
|
||||
kubectl delete configmap -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete configmap -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete configmap -n "$ns" -l app=loki --ignore-not-found
|
||||
kubectl delete configmap -n "$ns" -l app=fluent-bit --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 5: Removing ClusterRoles and ClusterRoleBindings...${NC}"
|
||||
|
||||
# Remove monitoring-related RBAC
|
||||
echo "Checking for monitoring ClusterRoles..."
|
||||
if kubectl get clusterrole 2>/dev/null | grep -q "prometheus\|grafana\|kube-state-metrics\|fluent-bit\|node-exporter"; then
|
||||
echo -e "${GREEN}Found monitoring ClusterRoles${NC}"
|
||||
kubectl get clusterrole | grep -E "prometheus|grafana|kube-state-metrics|fluent-bit|node-exporter"
|
||||
read -p " Delete these ClusterRoles? (yes/no): " delete_cr
|
||||
if [ "$delete_cr" = "yes" ]; then
|
||||
kubectl delete clusterrole prometheus --ignore-not-found
|
||||
kubectl delete clusterrole grafana --ignore-not-found
|
||||
kubectl delete clusterrole kube-state-metrics --ignore-not-found
|
||||
kubectl delete clusterrole fluent-bit --ignore-not-found
|
||||
kubectl delete clusterrole node-exporter --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "Checking for monitoring ClusterRoleBindings..."
|
||||
if kubectl get clusterrolebinding 2>/dev/null | grep -q "prometheus\|grafana\|kube-state-metrics\|fluent-bit\|node-exporter"; then
|
||||
echo -e "${GREEN}Found monitoring ClusterRoleBindings${NC}"
|
||||
kubectl get clusterrolebinding | grep -E "prometheus|grafana|kube-state-metrics|fluent-bit|node-exporter"
|
||||
read -p " Delete these ClusterRoleBindings? (yes/no): " delete_crb
|
||||
if [ "$delete_crb" = "yes" ]; then
|
||||
kubectl delete clusterrolebinding prometheus --ignore-not-found
|
||||
kubectl delete clusterrolebinding grafana --ignore-not-found
|
||||
kubectl delete clusterrolebinding kube-state-metrics --ignore-not-found
|
||||
kubectl delete clusterrolebinding fluent-bit --ignore-not-found
|
||||
kubectl delete clusterrolebinding node-exporter --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 6: Removing PVCs and PVs...${NC}"
|
||||
|
||||
# Check for monitoring PVCs
|
||||
echo "Checking for monitoring PersistentVolumeClaims..."
|
||||
for ns in kube-system default monitoring prometheus grafana; do
|
||||
if kubectl get namespace "$ns" &> /dev/null; then
|
||||
if kubectl get pvc -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|loki\|tempo"; then
|
||||
echo -e "${GREEN}Found monitoring PVCs in $ns${NC}"
|
||||
kubectl get pvc -n "$ns" | grep -E "prometheus|grafana|loki|tempo"
|
||||
echo -e "${RED} WARNING: Deleting PVCs will delete all stored data!${NC}"
|
||||
read -p " Delete these PVCs? (yes/no): " delete_pvc
|
||||
if [ "$delete_pvc" = "yes" ]; then
|
||||
kubectl delete pvc -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete pvc -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete pvc -n "$ns" -l app=loki --ignore-not-found
|
||||
kubectl delete pvc -n "$ns" -l app=tempo --ignore-not-found
|
||||
# Also try by name patterns
|
||||
kubectl get pvc -n "$ns" -o name | grep -E "prometheus|grafana|loki|tempo" | xargs -r kubectl delete -n "$ns" || true
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
# Check for monitoring PVs
|
||||
echo ""
|
||||
echo "Checking for monitoring PersistentVolumes..."
|
||||
if kubectl get pv 2>/dev/null | grep -q "prometheus\|grafana\|loki\|tempo\|monitoring"; then
|
||||
echo -e "${GREEN}Found monitoring PVs${NC}"
|
||||
kubectl get pv | grep -E "prometheus|grafana|loki|tempo|monitoring"
|
||||
echo -e "${RED} WARNING: Deleting PVs may delete data on disk!${NC}"
|
||||
read -p " Delete these PVs? (yes/no): " delete_pv
|
||||
if [ "$delete_pv" = "yes" ]; then
|
||||
kubectl get pv -o name | grep -E "prometheus|grafana|loki|tempo|monitoring" | xargs -r kubectl delete || true
|
||||
fi
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 7: Checking for monitoring Ingresses...${NC}"
|
||||
|
||||
for ns in kube-system default monitoring prometheus grafana; do
|
||||
if kubectl get namespace "$ns" &> /dev/null; then
|
||||
if kubectl get ingress -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|loki"; then
|
||||
echo -e "${GREEN}Found monitoring Ingresses in $ns${NC}"
|
||||
kubectl get ingress -n "$ns" | grep -E "prometheus|grafana|loki"
|
||||
read -p " Delete these Ingresses? (yes/no): " delete_ing
|
||||
if [ "$delete_ing" = "yes" ]; then
|
||||
kubectl delete ingress -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete ingress -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete ingress -n "$ns" prometheus-ingress --ignore-not-found
|
||||
kubectl delete ingress -n "$ns" grafana-ingress --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 8: Checking for Prometheus Operator CRDs...${NC}"
|
||||
|
||||
# Check for Prometheus Operator CRDs
|
||||
if kubectl get crd 2>/dev/null | grep -q "monitoring.coreos.com"; then
|
||||
echo -e "${GREEN}Found Prometheus Operator CRDs${NC}"
|
||||
kubectl get crd | grep "monitoring.coreos.com"
|
||||
echo ""
|
||||
echo -e "${RED}WARNING: Deleting these CRDs will remove ALL Prometheus Operator resources cluster-wide!${NC}"
|
||||
read -p " Delete Prometheus Operator CRDs? (yes/no): " delete_crd
|
||||
if [ "$delete_crd" = "yes" ]; then
|
||||
kubectl delete crd prometheuses.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd prometheusrules.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd servicemonitors.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd podmonitors.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd alertmanagers.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd alertmanagerconfigs.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd probes.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd thanosrulers.monitoring.coreos.com --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 9: Optional - Clean up data directories on nodes...${NC}"
|
||||
echo ""
|
||||
echo "You may have monitoring data stored on your nodes at:"
|
||||
echo " - /mnt/local-ssd/prometheus"
|
||||
echo " - /mnt/local-ssd/grafana"
|
||||
echo " - /mnt/local-ssd/loki"
|
||||
echo " - /mnt/local-ssd/tempo"
|
||||
echo " - /var/lib/prometheus"
|
||||
echo " - /var/lib/grafana"
|
||||
echo ""
|
||||
echo "To remove these, SSH to each node and run:"
|
||||
echo " sudo rm -rf /mnt/local-ssd/{prometheus,grafana,loki,tempo}"
|
||||
echo " sudo rm -rf /var/lib/{prometheus,grafana,loki,tempo}"
|
||||
echo ""
|
||||
read -p "Have you cleaned up the data directories? (yes to continue, no to skip): " cleanup_dirs
|
||||
|
||||
echo ""
|
||||
echo -e "${GREEN}=========================================================="
|
||||
echo "Existing Monitoring Stack Cleanup Complete!"
|
||||
echo "==========================================================${NC}"
|
||||
echo ""
|
||||
echo "Summary of actions taken:"
|
||||
echo " - Removed monitoring namespaces (if confirmed)"
|
||||
echo " - Uninstalled Helm releases (if found and confirmed)"
|
||||
echo " - Removed standalone monitoring components"
|
||||
echo " - Removed monitoring ConfigMaps"
|
||||
echo " - Removed RBAC resources"
|
||||
echo " - Removed PVCs and PVs (if confirmed)"
|
||||
echo " - Removed Ingresses"
|
||||
echo " - Removed Prometheus Operator CRDs (if confirmed)"
|
||||
echo ""
|
||||
echo -e "${YELLOW}Next Steps:${NC}"
|
||||
echo "1. Verify cleanup: kubectl get all -A | grep -E 'prometheus|grafana|loki|tempo|monitoring'"
|
||||
echo "2. Clean up node data directories (see above)"
|
||||
echo "3. Deploy new observability stack: ./deploy.sh"
|
||||
echo ""
|
||||
|
|
@ -0,0 +1,115 @@
|
|||
#!/bin/bash
|
||||
|
||||
GREEN='\033[0;32m'
|
||||
RED='\033[0;31m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
echo -e "${BLUE}=================================================="
|
||||
echo "Observability Stack Status Check"
|
||||
echo "==================================================${NC}"
|
||||
echo ""
|
||||
|
||||
# Check namespace
|
||||
echo -e "${YELLOW}Checking namespace...${NC}"
|
||||
if kubectl get namespace observability &> /dev/null; then
|
||||
echo -e "${GREEN}✓ Namespace 'observability' exists${NC}"
|
||||
else
|
||||
echo -e "${RED}✗ Namespace 'observability' not found${NC}"
|
||||
exit 1
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check PVs
|
||||
echo -e "${YELLOW}Checking PersistentVolumes...${NC}"
|
||||
pvs=$(kubectl get pv 2>/dev/null | grep -E "(prometheus|loki|tempo|grafana)-data-pv" | wc -l)
|
||||
if [ "$pvs" -eq 4 ]; then
|
||||
echo -e "${GREEN}✓ All 4 PersistentVolumes found${NC}"
|
||||
kubectl get pv | grep -E "(prometheus|loki|tempo|grafana)-data-pv"
|
||||
else
|
||||
echo -e "${RED}✗ Expected 4 PVs, found $pvs${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check PVCs
|
||||
echo -e "${YELLOW}Checking PersistentVolumeClaims...${NC}"
|
||||
pvcs=$(kubectl get pvc -n observability 2>/dev/null | grep -v NAME | wc -l)
|
||||
if [ "$pvcs" -eq 4 ]; then
|
||||
echo -e "${GREEN}✓ All 4 PersistentVolumeClaims found${NC}"
|
||||
kubectl get pvc -n observability
|
||||
else
|
||||
echo -e "${RED}✗ Expected 4 PVCs, found $pvcs${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check Pods
|
||||
echo -e "${YELLOW}Checking Pods...${NC}"
|
||||
kubectl get pods -n observability -o wide
|
||||
echo ""
|
||||
|
||||
# Count running pods
|
||||
total_pods=$(kubectl get pods -n observability --no-headers 2>/dev/null | wc -l)
|
||||
running_pods=$(kubectl get pods -n observability --field-selector=status.phase=Running --no-headers 2>/dev/null | wc -l)
|
||||
|
||||
if [ "$total_pods" -eq 0 ]; then
|
||||
echo -e "${RED}✗ No pods found in observability namespace${NC}"
|
||||
else
|
||||
if [ "$running_pods" -eq "$total_pods" ]; then
|
||||
echo -e "${GREEN}✓ All $total_pods pods are running${NC}"
|
||||
else
|
||||
echo -e "${YELLOW}⚠ $running_pods/$total_pods pods are running${NC}"
|
||||
fi
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check Services
|
||||
echo -e "${YELLOW}Checking Services...${NC}"
|
||||
kubectl get svc -n observability
|
||||
echo ""
|
||||
|
||||
# Check Ingress
|
||||
echo -e "${YELLOW}Checking Ingress...${NC}"
|
||||
if kubectl get ingress -n observability grafana-ingress &> /dev/null; then
|
||||
echo -e "${GREEN}✓ Grafana Ingress found${NC}"
|
||||
kubectl get ingress -n observability grafana-ingress
|
||||
else
|
||||
echo -e "${RED}✗ Grafana Ingress not found${NC}"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Check ConfigMaps
|
||||
echo -e "${YELLOW}Checking ConfigMaps...${NC}"
|
||||
configmaps=$(kubectl get configmap -n observability 2>/dev/null | grep -v NAME | wc -l)
|
||||
echo "Found $configmaps ConfigMaps:"
|
||||
kubectl get configmap -n observability --no-headers | awk '{print " - " $1}'
|
||||
echo ""
|
||||
|
||||
# Test endpoints
|
||||
echo -e "${YELLOW}Testing service endpoints...${NC}"
|
||||
|
||||
check_endpoint() {
|
||||
local name=$1
|
||||
local url=$2
|
||||
|
||||
if kubectl run -it --rm test-$RANDOM --image=curlimages/curl --restart=Never -- \
|
||||
curl -s -o /dev/null -w "%{http_code}" --max-time 5 $url 2>/dev/null | grep -q "200\|302\|401"; then
|
||||
echo -e "${GREEN}✓ $name is responding${NC}"
|
||||
else
|
||||
echo -e "${RED}✗ $name is not responding${NC}"
|
||||
fi
|
||||
}
|
||||
|
||||
check_endpoint "Prometheus" "http://prometheus.observability.svc.cluster.local:9090/-/healthy"
|
||||
check_endpoint "Loki" "http://loki.observability.svc.cluster.local:3100/ready"
|
||||
check_endpoint "Tempo" "http://tempo.observability.svc.cluster.local:3200/ready"
|
||||
check_endpoint "Grafana" "http://grafana.observability.svc.cluster.local:3000/api/health"
|
||||
|
||||
echo ""
|
||||
echo -e "${BLUE}=================================================="
|
||||
echo "Status Check Complete"
|
||||
echo "==================================================${NC}"
|
||||
echo ""
|
||||
echo "Access Grafana at: https://grafana.betelgeusebytes.io"
|
||||
echo "Default credentials: admin / admin"
|
||||
echo ""
|
||||
|
|
@ -0,0 +1,46 @@
|
|||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata: { name: fluent-bit, namespace: observability }
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRole
|
||||
metadata: { name: fluent-bit-read }
|
||||
rules:
|
||||
- apiGroups: [""]
|
||||
resources: ["pods", "namespaces"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRoleBinding
|
||||
metadata: { name: fluent-bit-read }
|
||||
roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
kind: ClusterRole
|
||||
name: fluent-bit-read
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: fluent-bit
|
||||
namespace: observability
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: DaemonSet
|
||||
metadata: { name: fluent-bit, namespace: observability }
|
||||
spec:
|
||||
selector: { matchLabels: { app: fluent-bit } }
|
||||
template:
|
||||
metadata: { labels: { app: fluent-bit } }
|
||||
spec:
|
||||
serviceAccountName: fluent-bit
|
||||
containers:
|
||||
- name: fluent-bit
|
||||
image: cr.fluentbit.io/fluent/fluent-bit:2.2.2
|
||||
volumeMounts:
|
||||
- { name: varlog, mountPath: /var/log }
|
||||
- { name: containers, mountPath: /var/lib/docker/containers, readOnly: true }
|
||||
env:
|
||||
- { name: FLUENT_ELASTICSEARCH_HOST, value: elasticsearch.elastic.svc.cluster.local }
|
||||
- { name: FLUENT_ELASTICSEARCH_PORT, value: "9200" }
|
||||
args: ["-i","tail","-p","path=/var/log/containers/*.log","-F","kubernetes","-o","es","-p","host=${FLUENT_ELASTICSEARCH_HOST}","-p","port=${FLUENT_ELASTICSEARCH_PORT}","-p","logstash_format=On","-p","logstash_prefix=k8s-logs"]
|
||||
volumes:
|
||||
- { name: varlog, hostPath: { path: /var/log } }
|
||||
- { name: containers, hostPath: { path: /var/lib/docker/containers, type: DirectoryOrCreate } }
|
||||
|
|
@ -0,0 +1,73 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: otel-collector, namespace: observability }
|
||||
spec:
|
||||
selector: { app: otel-collector }
|
||||
ports:
|
||||
- { name: otlp-http, port: 4318, targetPort: 4318 }
|
||||
- { name: otlp-grpc, port: 4317, targetPort: 4317 }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: otel-collector, namespace: observability }
|
||||
spec:
|
||||
replicas: 2
|
||||
selector: { matchLabels: { app: otel-collector } }
|
||||
template:
|
||||
metadata: { labels: { app: otel-collector } }
|
||||
spec:
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: otel-collector
|
||||
image: otel/opentelemetry-collector-contrib:0.102.0
|
||||
args: ["--config=/etc/otel/config.yaml"]
|
||||
ports:
|
||||
- { containerPort: 4318 }
|
||||
- { containerPort: 4317 }
|
||||
volumeMounts:
|
||||
- { name: cfg, mountPath: /etc/otel }
|
||||
volumes:
|
||||
- { name: cfg, configMap: { name: otel-config } }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata: { name: otel-config, namespace: observability }
|
||||
data:
|
||||
config.yaml: |
|
||||
receivers:
|
||||
otlp:
|
||||
protocols: { http: {}, grpc: {} }
|
||||
processors: { batch: {} }
|
||||
exporters:
|
||||
logging: {}
|
||||
elasticsearch:
|
||||
endpoints: ["http://elasticsearch.elastic.svc.cluster.local:9200"]
|
||||
logs_index: "k8s-logs"
|
||||
service:
|
||||
pipelines:
|
||||
logs: { receivers: [otlp], processors: [batch], exporters: [elasticsearch, logging] }
|
||||
traces: { receivers: [otlp], processors: [batch], exporters: [logging] }
|
||||
metrics: { receivers: [otlp], processors: [batch], exporters: [logging] }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: otlp
|
||||
namespace: observability
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["otlp.betelgeusebytes.io"], secretName: otlp-tls }]
|
||||
rules:
|
||||
- host: otlp.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /v1/traces
|
||||
pathType: Prefix
|
||||
backend: { service: { name: otel-collector, port: { number: 4318 } } }
|
||||
- path: /v1/metrics
|
||||
pathType: Prefix
|
||||
backend: { service: { name: otel-collector, port: { number: 4318 } } }
|
||||
- path: /v1/logs
|
||||
pathType: Prefix
|
||||
backend: { service: { name: otel-collector, port: { number: 4318 } } }
|
||||
Binary file not shown.
|
|
@ -0,0 +1,217 @@
|
|||
# k8s/postgres/pg-init-sql-configmap.yaml
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: pg-init-sql
|
||||
namespace: db
|
||||
data:
|
||||
00_extensions.sql: |
|
||||
\connect gitea
|
||||
CREATE EXTENSION IF NOT EXISTS postgis;
|
||||
CREATE EXTENSION IF NOT EXISTS postgis_topology;
|
||||
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm;
|
||||
CREATE EXTENSION IF NOT EXISTS hstore;
|
||||
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
|
||||
CREATE EXTENSION IF NOT EXISTS citext;
|
||||
CREATE EXTENSION IF NOT EXISTS unaccent;
|
||||
CREATE EXTENSION IF NOT EXISTS pgcrypto;
|
||||
DO $$ BEGIN
|
||||
CREATE EXTENSION IF NOT EXISTS plpython3u;
|
||||
EXCEPTION WHEN undefined_file THEN
|
||||
RAISE NOTICE 'plpython3u not available in this image';
|
||||
END $$;
|
||||
01_tune.sql: |
|
||||
ALTER SYSTEM SET shared_buffers = '1GB';
|
||||
ALTER SYSTEM SET work_mem = '32MB';
|
||||
ALTER SYSTEM SET maintenance_work_mem = '512MB';
|
||||
ALTER SYSTEM SET max_connections = 200;
|
||||
SELECT pg_reload_conf();
|
||||
---
|
||||
# k8s/postgres/pg-conf.yaml
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: pg-conf
|
||||
namespace: db
|
||||
data:
|
||||
pg_hba.conf: |
|
||||
# Local connections
|
||||
local all all trust
|
||||
host all all 127.0.0.1/32 trust
|
||||
host all all ::1/128 trust
|
||||
# TLS-only access from ANY external IP (harden as needed)
|
||||
hostssl all all 0.0.0.0/0 md5
|
||||
hostssl all all ::/0 md5
|
||||
---
|
||||
# k8s/postgres/pg-secret.yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: pg18-secret
|
||||
namespace: db
|
||||
type: Opaque
|
||||
stringData:
|
||||
POSTGRES_PASSWORD: "pa$$word"
|
||||
---
|
||||
# k8s/postgres/pg-certificate.yaml
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Certificate
|
||||
metadata:
|
||||
name: pg-tls
|
||||
namespace: db
|
||||
spec:
|
||||
secretName: pg-tls
|
||||
dnsNames:
|
||||
- pg.betelgeusebytes.io
|
||||
issuerRef:
|
||||
kind: ClusterIssuer
|
||||
name: letsencrypt-prod
|
||||
---
|
||||
# k8s/postgres/postgres-svc.yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: postgres
|
||||
namespace: db
|
||||
spec:
|
||||
selector:
|
||||
app: postgres
|
||||
ports:
|
||||
- name: postgres
|
||||
port: 5432
|
||||
targetPort: 5432
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: postgres-hl
|
||||
namespace: db
|
||||
spec:
|
||||
clusterIP: None
|
||||
selector:
|
||||
app: postgres
|
||||
ports:
|
||||
- name: postgres
|
||||
port: 5432
|
||||
targetPort: 5432
|
||||
---
|
||||
# k8s/postgres/postgres.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: postgres
|
||||
namespace: db
|
||||
spec:
|
||||
serviceName: postgres-hl
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: postgres
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: postgres
|
||||
spec:
|
||||
securityContext:
|
||||
runAsUser: 999
|
||||
runAsGroup: 999
|
||||
fsGroup: 999
|
||||
fsGroupChangePolicy: "Always"
|
||||
initContainers:
|
||||
- name: install-certs
|
||||
image: busybox:1.36
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
cp /in/tls.crt /out/server.crt
|
||||
cp /in/tls.key /out/server.key
|
||||
chown 999:999 /out/* || true
|
||||
chmod 600 /out/server.key
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
volumeMounts:
|
||||
- { name: pg-tls, mountPath: /in, readOnly: true }
|
||||
- { name: pg-certs, mountPath: /out }
|
||||
containers:
|
||||
- name: postgres
|
||||
image: axxs/postgres:18-postgis-vector
|
||||
imagePullPolicy: IfNotPresent
|
||||
args:
|
||||
- -c
|
||||
- ssl=on
|
||||
- -c
|
||||
- ssl_cert_file=/certs/server.crt
|
||||
- -c
|
||||
- ssl_key_file=/certs/server.key
|
||||
- -c
|
||||
- hba_file=/etc/postgresql-custom/pg_hba.conf
|
||||
env:
|
||||
- name: POSTGRES_USER
|
||||
value: "app"
|
||||
- name: POSTGRES_DB
|
||||
value: "gitea"
|
||||
- name: POSTGRES_PASSWORD
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: pg18-secret
|
||||
key: POSTGRES_PASSWORD
|
||||
- name: TZ
|
||||
value: "Europe/Paris"
|
||||
ports:
|
||||
- name: postgres
|
||||
containerPort: 5432
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /var/lib/postgresql } # PG18 expects parent, creates /var/lib/postgresql/18/main
|
||||
- { name: init, mountPath: /docker-entrypoint-initdb.d, readOnly: true }
|
||||
- { name: pg-certs, mountPath: /certs }
|
||||
- { name: pg-conf, mountPath: /etc/postgresql-custom }
|
||||
readinessProbe:
|
||||
exec: { command: ["sh","-c","pg_isready -U \"$POSTGRES_USER\" -d \"$POSTGRES_DB\" -h 127.0.0.1"] }
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 6
|
||||
livenessProbe:
|
||||
exec: { command: ["sh","-c","pg_isready -U \"$POSTGRES_USER\" -d \"$POSTGRES_DB\" -h 127.0.0.1"] }
|
||||
initialDelaySeconds: 20
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 6
|
||||
resources:
|
||||
requests: { cpu: "250m", memory: "512Mi" }
|
||||
limits: { cpu: "1", memory: "2Gi" }
|
||||
volumes:
|
||||
- name: init
|
||||
configMap:
|
||||
name: pg-init-sql
|
||||
defaultMode: 0444
|
||||
- name: pg-tls
|
||||
secret:
|
||||
secretName: pg-tls
|
||||
- name: pg-certs
|
||||
emptyDir: {}
|
||||
- name: pg-conf
|
||||
configMap:
|
||||
name: pg-conf
|
||||
defaultMode: 0444
|
||||
volumeClaimTemplates:
|
||||
- metadata:
|
||||
name: data
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources:
|
||||
requests:
|
||||
storage: 80Gi
|
||||
|
||||
|
||||
# kubectl -n ingress-nginx create configmap tcp-services \
|
||||
# --from-literal="5432=db/postgres:5432" \
|
||||
# -o yaml --dry-run=client | kubectl apply -f -
|
||||
# kubectl -n ingress-nginx patch deploy ingress-nginx-controller \
|
||||
# --type='json' -p='[
|
||||
# {"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--tcp-services-configmap=$(POD_NAMESPACE)/tcp-services"}
|
||||
# ]'
|
||||
# # controller must listen on hostPort:5432 (we already patched earlier)
|
||||
|
|
@ -0,0 +1,275 @@
|
|||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: db
|
||||
---
|
||||
# Password secret (replace with your own or generate one)
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: pg18-secret
|
||||
namespace: db
|
||||
type: Opaque
|
||||
stringData:
|
||||
POSTGRES_PASSWORD: "pa$$word"
|
||||
---
|
||||
# Init SQL: keeps your original name and keeps enabling PostGIS + vector
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: pg-init-sql
|
||||
namespace: db
|
||||
data:
|
||||
00_extensions.sql: |
|
||||
-- enable common extensions in the default DB and template1 so future DBs inherit them
|
||||
\connect gitea
|
||||
CREATE EXTENSION IF NOT EXISTS postgis;
|
||||
CREATE EXTENSION IF NOT EXISTS vector;
|
||||
CREATE COLLATION IF NOT EXISTS arabic (provider = icu, locale = 'ar', deterministic = false);
|
||||
CREATE EXTENSION IF NOT EXISTS tablefunc;
|
||||
-- postpone pg_stat_statements CREATE to postStart (needs preload)
|
||||
CREATE EXTENSION IF NOT EXISTS postgis_topology;
|
||||
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm;
|
||||
CREATE EXTENSION IF NOT EXISTS hstore;
|
||||
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
|
||||
CREATE EXTENSION IF NOT EXISTS citext;
|
||||
CREATE EXTENSION IF NOT EXISTS unaccent;
|
||||
CREATE EXTENSION IF NOT EXISTS pgcrypto;
|
||||
|
||||
-- PL/Python (available in your image)
|
||||
DO $$ BEGIN
|
||||
CREATE EXTENSION IF NOT EXISTS plpython3u;
|
||||
EXCEPTION WHEN undefined_file THEN
|
||||
RAISE NOTICE 'plpython3u not available in this image';
|
||||
END $$;
|
||||
|
||||
-- Also on template1 for new DBs (heavier, but intentional)
|
||||
\connect template1
|
||||
CREATE EXTENSION IF NOT EXISTS postgis;
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm;
|
||||
CREATE EXTENSION IF NOT EXISTS hstore;
|
||||
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
|
||||
CREATE EXTENSION IF NOT EXISTS citext;
|
||||
CREATE EXTENSION IF NOT EXISTS unaccent;
|
||||
CREATE EXTENSION IF NOT EXISTS pgcrypto;
|
||||
|
||||
-- Arabic-friendly ICU collation, non-deterministic for case/diacritics
|
||||
DO $$
|
||||
BEGIN
|
||||
PERFORM 1 FROM pg_collation WHERE collname='arabic';
|
||||
IF NOT FOUND THEN
|
||||
CREATE COLLATION arabic (provider = icu, locale = 'ar', deterministic = false);
|
||||
END IF;
|
||||
END$$;
|
||||
|
||||
01_tune.sql: |
|
||||
-- Enable pg_stat_statements on next server start
|
||||
DO $$
|
||||
DECLARE
|
||||
cur text := current_setting('shared_preload_libraries', true);
|
||||
BEGIN
|
||||
IF cur IS NULL OR position('pg_stat_statements' in cur) = 0 THEN
|
||||
PERFORM pg_catalog.pg_reload_conf(); -- harmless even if no changes yet
|
||||
EXECUTE $$ALTER SYSTEM SET shared_preload_libraries =
|
||||
$$ || quote_literal(coalesce(NULLIF(cur,'' ) || ',pg_stat_statements', 'pg_stat_statements'));
|
||||
END IF;
|
||||
END$$;
|
||||
|
||||
-- Optional tuning (adjust to your limits)
|
||||
ALTER SYSTEM SET shared_buffers = '1GB';
|
||||
ALTER SYSTEM SET work_mem = '32MB';
|
||||
ALTER SYSTEM SET maintenance_work_mem = '512MB';
|
||||
ALTER SYSTEM SET max_connections = 200;
|
||||
|
||||
-- Reload applies some settings immediately; others need restart (OK after init completes)
|
||||
SELECT pg_reload_conf();
|
||||
ALTER SYSTEM SET pg_stat_statements.max = 10000;
|
||||
ALTER SYSTEM SET pg_stat_statements.track = 'all';
|
||||
ALTER SYSTEM SET pg_stat_statements.save = on;
|
||||
pg_hba.conf: |
|
||||
# Allow loopback
|
||||
local all all trust
|
||||
host all all 127.0.0.1/32 trust
|
||||
host all all ::1/128 trust
|
||||
# Allow TLS connections from your IP(s) only
|
||||
hostssl all all YOUR_PUBLIC_IP/32 md5
|
||||
# (Optional) Add more CIDRs or a private network range here:
|
||||
# hostssl all all 10.0.0.0/8 md5
|
||||
---
|
||||
# Headless service required by StatefulSet for stable network IDs
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: postgres-hl
|
||||
namespace: db
|
||||
spec:
|
||||
clusterIP: None
|
||||
selector:
|
||||
app: postgres
|
||||
ports:
|
||||
- name: postgres
|
||||
port: 5432
|
||||
targetPort: 5432
|
||||
---
|
||||
# Regular ClusterIP service for clients (keeps your original name)
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: postgres
|
||||
namespace: db
|
||||
spec:
|
||||
selector:
|
||||
app: postgres
|
||||
ports:
|
||||
- name: postgres
|
||||
port: 5432
|
||||
targetPort: 5432
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: postgres
|
||||
namespace: db
|
||||
spec:
|
||||
serviceName: postgres-hl
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: postgres
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: postgres
|
||||
spec:
|
||||
securityContext:
|
||||
runAsUser: 999
|
||||
runAsGroup: 999
|
||||
fsGroup: 999
|
||||
fsGroupChangePolicy: "Always"
|
||||
initContainers:
|
||||
# Copy cert-manager certs to a writable path with correct perms for Postgres
|
||||
- name: install-certs
|
||||
image: busybox:1.36
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
cp /in/tls.crt /out/server.crt
|
||||
cp /in/tls.key /out/server.key
|
||||
cp /in/ca.crt /out/ca.crt || true
|
||||
chown 999:999 /out/* || true
|
||||
chmod 600 /out/server.key
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
volumeMounts:
|
||||
- { name: pg-tls, mountPath: /in, readOnly: true }
|
||||
- { name: pg-certs, mountPath: /out }
|
||||
containers:
|
||||
- name: postgres
|
||||
image: axxs/postgres:18-postgis-vector
|
||||
imagePullPolicy: IfNotPresent
|
||||
args:
|
||||
- -c
|
||||
- ssl=on
|
||||
- -c
|
||||
- ssl_cert_file=/certs/server.crt
|
||||
- -c
|
||||
- ssl_key_file=/certs/server.key
|
||||
- -c
|
||||
- ssl_ca_file=/certs/ca.crt
|
||||
- -c
|
||||
- hba_file=/etc/postgresql-custom/pg_hba.conf
|
||||
lifecycle:
|
||||
postStart:
|
||||
exec:
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- |
|
||||
set -e
|
||||
# Wait until server accepts connections
|
||||
for i in $(seq 1 30); do
|
||||
pg_isready -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" && break
|
||||
sleep 1
|
||||
done
|
||||
psql -v ON_ERROR_STOP=1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "CREATE EXTENSION IF NOT EXISTS pg_stat_statements;"
|
||||
env:
|
||||
- name: POSTGRES_USER
|
||||
value: "app"
|
||||
- name: POSTGRES_DB
|
||||
value: "gitea" # matches your \connect gitea
|
||||
- name: POSTGRES_PASSWORD
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: pg18-secret
|
||||
key: POSTGRES_PASSWORD
|
||||
- name: TZ
|
||||
value: "Europe/Paris"
|
||||
ports:
|
||||
- name: postgres
|
||||
containerPort: 5432
|
||||
volumeMounts:
|
||||
# ✅ PG 18 requires this parent path; it will create /var/lib/postgresql/18/main
|
||||
- name: data
|
||||
mountPath: /var/lib/postgresql
|
||||
# your init scripts ConfigMap
|
||||
- name: init
|
||||
mountPath: /docker-entrypoint-initdb.d
|
||||
readOnly: true
|
||||
- name: pg-certs
|
||||
mountPath: /certs
|
||||
# pg_hba.conf
|
||||
- name: pg-conf
|
||||
mountPath: /etc/postgresql-custom
|
||||
readinessProbe:
|
||||
exec:
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB" -h 127.0.0.1
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 6
|
||||
livenessProbe:
|
||||
exec:
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB" -h 127.0.0.1
|
||||
initialDelaySeconds: 20
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 6
|
||||
resources:
|
||||
requests:
|
||||
cpu: "250m"
|
||||
memory: "512Mi"
|
||||
limits:
|
||||
cpu: "1"
|
||||
memory: "2Gi"
|
||||
volumes:
|
||||
- name: init
|
||||
configMap:
|
||||
name: pg-init-sql
|
||||
defaultMode: 0444
|
||||
- name: pg-tls
|
||||
secret:
|
||||
secretName: pg-tls
|
||||
- name: pg-certs
|
||||
emptyDir: {}
|
||||
- name: pg-conf
|
||||
configMap:
|
||||
name: pg-conf
|
||||
defaultMode: 0444
|
||||
volumeClaimTemplates:
|
||||
- metadata:
|
||||
name: data
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
# storageClassName: <your-storageclass> # optionally pin this
|
||||
|
|
@ -0,0 +1,122 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: postgres, namespace: db }
|
||||
spec:
|
||||
ports: [{ port: 5432, targetPort: 5432 }]
|
||||
selector: { app: postgres }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata: { name: pg-init-sql, namespace: db }
|
||||
data:
|
||||
00_extensions.sql: |
|
||||
-- enable common extensions in the default DB and template1 so future DBs inherit them
|
||||
\connect gitea
|
||||
CREATE EXTENSION IF NOT EXISTS postgis;
|
||||
CREATE EXTENSION IF NOT EXISTS vector;
|
||||
CREATE COLLATION IF NOT EXISTS arabic (provider = icu, locale = 'ar', deterministic = false);
|
||||
CREATE EXTENSION IF NOT EXISTS tablefunc;
|
||||
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
|
||||
|
||||
CREATE EXTENSION IF NOT EXISTS postgis_topology;
|
||||
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm;
|
||||
CREATE EXTENSION IF NOT EXISTS hstore;
|
||||
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
|
||||
CREATE EXTENSION IF NOT EXISTS citext;
|
||||
CREATE EXTENSION IF NOT EXISTS unaccent;
|
||||
CREATE EXTENSION IF NOT EXISTS pgcrypto;
|
||||
-- PL/Python (optional; requires image with plpython3u, postgis image has it)
|
||||
DO $$ BEGIN
|
||||
CREATE EXTENSION IF NOT EXISTS plpython3u;
|
||||
EXCEPTION WHEN undefined_file THEN
|
||||
RAISE NOTICE 'plpython3u not available in this image';
|
||||
END $$;
|
||||
|
||||
-- Also on template1 for new DBs:
|
||||
\connect template1
|
||||
CREATE EXTENSION IF NOT EXISTS postgis;
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm;
|
||||
CREATE EXTENSION IF NOT EXISTS hstore;
|
||||
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
|
||||
CREATE EXTENSION IF NOT EXISTS citext;
|
||||
CREATE EXTENSION IF NOT EXISTS unaccent;
|
||||
CREATE EXTENSION IF NOT EXISTS pgcrypto;
|
||||
|
||||
-- Arabic-friendly ICU collation (PostgreSQL >= 13)
|
||||
-- Non-deterministic collation helps proper case/diacritics comparisons
|
||||
DO $$
|
||||
BEGIN
|
||||
PERFORM 1 FROM pg_collation WHERE collname='arabic';
|
||||
IF NOT FOUND THEN
|
||||
CREATE COLLATION arabic (provider = icu, locale = 'ar', deterministic = false);
|
||||
END IF;
|
||||
END$$;
|
||||
|
||||
-- Example: ensure gitea DB uses UTF8; Arabic text search often needs unaccent + custom dictionaries.
|
||||
-- You can create additional DBs with: CREATE DATABASE mydb TEMPLATE template1 ENCODING 'UTF8';
|
||||
|
||||
01_tune.sql: |
|
||||
-- small safe defaults; adjust later
|
||||
ALTER SYSTEM SET shared_buffers = '1GB';
|
||||
ALTER SYSTEM SET work_mem = '32MB';
|
||||
ALTER SYSTEM SET maintenance_work_mem = '512MB';
|
||||
ALTER SYSTEM SET max_connections = 200;
|
||||
SELECT pg_reload_conf();
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata: { name: postgres, namespace: db }
|
||||
spec:
|
||||
serviceName: postgres
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: postgres } }
|
||||
template:
|
||||
metadata: { labels: { app: postgres } }
|
||||
spec:
|
||||
nodeSelector:
|
||||
node: hetzner-2
|
||||
securityContext:
|
||||
fsGroup: 999 # Debian postgres user/group in postgis image
|
||||
fsGroupChangePolicy: OnRootMismatch
|
||||
initContainers:
|
||||
- name: fix-perms
|
||||
image: busybox:1.36
|
||||
command: ["sh","-c","chown -R 999:999 /var/lib/postgresql/data || true"]
|
||||
securityContext: { runAsUser: 0 }
|
||||
volumeMounts: [{ name: data, mountPath: /var/lib/postgresql/data }]
|
||||
containers:
|
||||
- name: postgres
|
||||
image: postgres:16-3.4
|
||||
env:
|
||||
- name: POSTGRES_PASSWORD
|
||||
valueFrom: { secretKeyRef: { name: postgres-auth, key: POSTGRES_PASSWORD } }
|
||||
- { name: POSTGRES_USER, value: gitea }
|
||||
- { name: POSTGRES_DB, value: gitea }
|
||||
- name: POSTGRES_INITDB_ARGS
|
||||
value: "--encoding=UTF8 --locale=C.UTF-8"
|
||||
ports: [{ containerPort: 5432 }]
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /var/lib/postgresql/data }
|
||||
- { name: init, mountPath: /docker-entrypoint-initdb.d }
|
||||
volumeClaimTemplates:
|
||||
- metadata: { name: data }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 80Gi } }
|
||||
---
|
||||
# Mount the init scripts
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: postgres
|
||||
namespace: db
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
volumes:
|
||||
- name: init
|
||||
configMap:
|
||||
name: pg-init-sql
|
||||
defaultMode: 0444
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: postgres-auth, namespace: db }
|
||||
type: Opaque
|
||||
stringData:
|
||||
POSTGRES_PASSWORD: "PG-ADM1N"
|
||||
GITEA_DB_PASSWORD: "G1TEA"
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata: { name: prometheus-config, namespace: monitoring }
|
||||
data:
|
||||
prometheus.yml: |
|
||||
global: { scrape_interval: 15s }
|
||||
scrape_configs:
|
||||
- job_name: 'kubernetes-pods'
|
||||
kubernetes_sd_configs: [ { role: pod } ]
|
||||
relabel_configs:
|
||||
- action: keep
|
||||
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
|
||||
regex: 'true'
|
||||
|
|
@ -0,0 +1,55 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: prometheus, namespace: monitoring }
|
||||
spec:
|
||||
ports: [{ port: 9090, targetPort: 9090 }]
|
||||
selector: { app: prometheus }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata: { name: prometheus, namespace: monitoring }
|
||||
spec:
|
||||
serviceName: prometheus
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: prometheus } }
|
||||
template:
|
||||
metadata: { labels: { app: prometheus } }
|
||||
spec:
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: prometheus
|
||||
image: prom/prometheus:v2.53.0
|
||||
args: ["--config.file=/etc/prometheus/prometheus.yml","--storage.tsdb.path=/prometheus"]
|
||||
ports: [{ containerPort: 9090 }]
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /prometheus }
|
||||
- { name: config, mountPath: /etc/prometheus }
|
||||
volumes:
|
||||
- { name: config, configMap: { name: prometheus-config } }
|
||||
volumeClaimTemplates:
|
||||
- metadata: { name: data }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 50Gi } }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: prometheus
|
||||
namespace: monitoring
|
||||
annotations:
|
||||
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||
nginx.ingress.kubernetes.io/auth-type: basic
|
||||
nginx.ingress.kubernetes.io/auth-secret: basic-auth-prometheus
|
||||
nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["prometheus.betelgeusebytes.io"], secretName: prometheus-tls }]
|
||||
rules:
|
||||
- host: prometheus.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: prometheus, port: { number: 9090 } } }
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-redis
|
||||
spec:
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/redis
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,40 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: redis, namespace: db }
|
||||
spec:
|
||||
ports: [{ port: 6379, targetPort: 6379 }]
|
||||
selector: { app: redis }
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata: { name: redis, namespace: db }
|
||||
spec:
|
||||
serviceName: redis
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: redis } }
|
||||
template:
|
||||
metadata: { labels: { app: redis } }
|
||||
spec:
|
||||
nodeSelector: { node: hetzner-2 }
|
||||
containers:
|
||||
- name: redis
|
||||
image: redis:7
|
||||
args: ["--requirepass", "$(REDIS_PASSWORD)"]
|
||||
env:
|
||||
- name: REDIS_PASSWORD
|
||||
valueFrom: { secretKeyRef: { name: redis-auth, key: REDIS_PASSWORD } }
|
||||
ports: [{ containerPort: 6379 }]
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /data }
|
||||
volumeClaimTemplates:
|
||||
- metadata: { name: data }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 10Gi } }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: redis-auth, namespace: db }
|
||||
type: Opaque
|
||||
stringData: { REDIS_PASSWORD: "RED1S" }
|
||||
|
|
@ -0,0 +1,319 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "=========================================================="
|
||||
echo "Removing Existing Monitoring Stack"
|
||||
echo "=========================================================="
|
||||
echo ""
|
||||
|
||||
RED='\033[0;31m'
|
||||
YELLOW='\033[1;33m'
|
||||
GREEN='\033[0;32m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
echo -e "${YELLOW}This script will remove common monitoring deployments including:${NC}"
|
||||
echo " - Prometheus (standalone or operator)"
|
||||
echo " - Grafana"
|
||||
echo " - Fluent Bit"
|
||||
echo " - Vector"
|
||||
echo " - Loki"
|
||||
echo " - Tempo"
|
||||
echo " - Node exporters"
|
||||
echo " - kube-state-metrics"
|
||||
echo " - Any monitoring/prometheus/grafana namespaces"
|
||||
echo ""
|
||||
echo -e "${RED}WARNING: This will delete all existing monitoring data!${NC}"
|
||||
echo ""
|
||||
read -p "Are you sure you want to continue? (yes/no): " confirm
|
||||
|
||||
if [ "$confirm" != "yes" ]; then
|
||||
echo "Cleanup cancelled."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 1: Checking for existing monitoring namespaces...${NC}"
|
||||
|
||||
# Common namespace names for monitoring
|
||||
NAMESPACES=("monitoring" "prometheus" "grafana" "loki" "tempo" "logging")
|
||||
|
||||
for ns in "${NAMESPACES[@]}"; do
|
||||
if kubectl get namespace "$ns" &> /dev/null; then
|
||||
echo -e "${GREEN}Found namespace: $ns${NC}"
|
||||
|
||||
# Show what's in the namespace
|
||||
echo " Resources in $ns:"
|
||||
kubectl get all -n "$ns" 2>/dev/null | head -20 || true
|
||||
echo ""
|
||||
|
||||
read -p " Delete namespace '$ns'? (yes/no): " delete_ns
|
||||
if [ "$delete_ns" = "yes" ]; then
|
||||
echo " Deleting namespace $ns..."
|
||||
kubectl delete namespace "$ns" --timeout=120s || {
|
||||
echo -e "${YELLOW} Warning: Namespace deletion timed out, forcing...${NC}"
|
||||
kubectl delete namespace "$ns" --grace-period=0 --force &
|
||||
}
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 2: Removing common monitoring Helm releases...${NC}"
|
||||
|
||||
# Check if helm is available
|
||||
if command -v helm &> /dev/null; then
|
||||
echo "Checking for Helm releases..."
|
||||
|
||||
# Common Helm release names
|
||||
RELEASES=("prometheus" "grafana" "loki" "tempo" "fluent-bit" "prometheus-operator" "kube-prometheus-stack")
|
||||
|
||||
for release in "${RELEASES[@]}"; do
|
||||
# Check all namespaces for the release
|
||||
if helm list -A | grep -q "$release"; then
|
||||
ns=$(helm list -A | grep "$release" | awk '{print $2}')
|
||||
echo -e "${GREEN}Found Helm release: $release in namespace $ns${NC}"
|
||||
read -p " Uninstall Helm release '$release'? (yes/no): " uninstall
|
||||
if [ "$uninstall" = "yes" ]; then
|
||||
echo " Uninstalling $release..."
|
||||
helm uninstall "$release" -n "$ns" || echo -e "${YELLOW} Warning: Failed to uninstall $release${NC}"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
else
|
||||
echo "Helm not found, skipping Helm releases check"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 3: Removing standalone monitoring components...${NC}"
|
||||
|
||||
# Remove common DaemonSets in kube-system or default
|
||||
echo "Checking for monitoring DaemonSets..."
|
||||
for ns in kube-system default; do
|
||||
if kubectl get daemonset -n "$ns" 2>/dev/null | grep -q "node-exporter\|fluent-bit\|fluentd\|vector"; then
|
||||
echo -e "${GREEN}Found monitoring DaemonSets in $ns${NC}"
|
||||
kubectl get daemonset -n "$ns" | grep -E "node-exporter|fluent-bit|fluentd|vector"
|
||||
read -p " Delete these DaemonSets? (yes/no): " delete_ds
|
||||
if [ "$delete_ds" = "yes" ]; then
|
||||
kubectl delete daemonset -n "$ns" -l app=node-exporter --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" -l app=fluent-bit --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" -l app=fluentd --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" -l app=vector --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" node-exporter --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" fluent-bit --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" fluentd --ignore-not-found
|
||||
kubectl delete daemonset -n "$ns" vector --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
# Remove common Deployments
|
||||
echo ""
|
||||
echo "Checking for monitoring Deployments..."
|
||||
for ns in kube-system default; do
|
||||
if kubectl get deployment -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|kube-state-metrics\|loki\|tempo"; then
|
||||
echo -e "${GREEN}Found monitoring Deployments in $ns${NC}"
|
||||
kubectl get deployment -n "$ns" | grep -E "prometheus|grafana|kube-state-metrics|loki|tempo"
|
||||
read -p " Delete these Deployments? (yes/no): " delete_deploy
|
||||
if [ "$delete_deploy" = "yes" ]; then
|
||||
kubectl delete deployment -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" -l app=kube-state-metrics --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" -l app=loki --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" -l app=tempo --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" prometheus --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" grafana --ignore-not-found
|
||||
kubectl delete deployment -n "$ns" kube-state-metrics --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
# Remove common StatefulSets
|
||||
echo ""
|
||||
echo "Checking for monitoring StatefulSets..."
|
||||
for ns in kube-system default; do
|
||||
if kubectl get statefulset -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|loki\|tempo"; then
|
||||
echo -e "${GREEN}Found monitoring StatefulSets in $ns${NC}"
|
||||
kubectl get statefulset -n "$ns" | grep -E "prometheus|grafana|loki|tempo"
|
||||
read -p " Delete these StatefulSets? (yes/no): " delete_sts
|
||||
if [ "$delete_sts" = "yes" ]; then
|
||||
kubectl delete statefulset -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" -l app=loki --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" -l app=tempo --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" prometheus --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" grafana --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" loki --ignore-not-found
|
||||
kubectl delete statefulset -n "$ns" tempo --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 4: Removing monitoring ConfigMaps...${NC}"
|
||||
|
||||
# Ask before removing ConfigMaps (they might contain important configs)
|
||||
echo "Checking for monitoring ConfigMaps..."
|
||||
for ns in kube-system default monitoring prometheus grafana; do
|
||||
if kubectl get namespace "$ns" &> /dev/null; then
|
||||
if kubectl get configmap -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|loki\|tempo\|fluent"; then
|
||||
echo -e "${GREEN}Found monitoring ConfigMaps in $ns${NC}"
|
||||
kubectl get configmap -n "$ns" | grep -E "prometheus|grafana|loki|tempo|fluent"
|
||||
read -p " Delete these ConfigMaps? (yes/no): " delete_cm
|
||||
if [ "$delete_cm" = "yes" ]; then
|
||||
kubectl delete configmap -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete configmap -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete configmap -n "$ns" -l app=loki --ignore-not-found
|
||||
kubectl delete configmap -n "$ns" -l app=fluent-bit --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 5: Removing ClusterRoles and ClusterRoleBindings...${NC}"
|
||||
|
||||
# Remove monitoring-related RBAC
|
||||
echo "Checking for monitoring ClusterRoles..."
|
||||
if kubectl get clusterrole 2>/dev/null | grep -q "prometheus\|grafana\|kube-state-metrics\|fluent-bit\|node-exporter"; then
|
||||
echo -e "${GREEN}Found monitoring ClusterRoles${NC}"
|
||||
kubectl get clusterrole | grep -E "prometheus|grafana|kube-state-metrics|fluent-bit|node-exporter"
|
||||
read -p " Delete these ClusterRoles? (yes/no): " delete_cr
|
||||
if [ "$delete_cr" = "yes" ]; then
|
||||
kubectl delete clusterrole prometheus --ignore-not-found
|
||||
kubectl delete clusterrole grafana --ignore-not-found
|
||||
kubectl delete clusterrole kube-state-metrics --ignore-not-found
|
||||
kubectl delete clusterrole fluent-bit --ignore-not-found
|
||||
kubectl delete clusterrole node-exporter --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "Checking for monitoring ClusterRoleBindings..."
|
||||
if kubectl get clusterrolebinding 2>/dev/null | grep -q "prometheus\|grafana\|kube-state-metrics\|fluent-bit\|node-exporter"; then
|
||||
echo -e "${GREEN}Found monitoring ClusterRoleBindings${NC}"
|
||||
kubectl get clusterrolebinding | grep -E "prometheus|grafana|kube-state-metrics|fluent-bit|node-exporter"
|
||||
read -p " Delete these ClusterRoleBindings? (yes/no): " delete_crb
|
||||
if [ "$delete_crb" = "yes" ]; then
|
||||
kubectl delete clusterrolebinding prometheus --ignore-not-found
|
||||
kubectl delete clusterrolebinding grafana --ignore-not-found
|
||||
kubectl delete clusterrolebinding kube-state-metrics --ignore-not-found
|
||||
kubectl delete clusterrolebinding fluent-bit --ignore-not-found
|
||||
kubectl delete clusterrolebinding node-exporter --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 6: Removing PVCs and PVs...${NC}"
|
||||
|
||||
# Check for monitoring PVCs
|
||||
echo "Checking for monitoring PersistentVolumeClaims..."
|
||||
for ns in kube-system default monitoring prometheus grafana; do
|
||||
if kubectl get namespace "$ns" &> /dev/null; then
|
||||
if kubectl get pvc -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|loki\|tempo"; then
|
||||
echo -e "${GREEN}Found monitoring PVCs in $ns${NC}"
|
||||
kubectl get pvc -n "$ns" | grep -E "prometheus|grafana|loki|tempo"
|
||||
echo -e "${RED} WARNING: Deleting PVCs will delete all stored data!${NC}"
|
||||
read -p " Delete these PVCs? (yes/no): " delete_pvc
|
||||
if [ "$delete_pvc" = "yes" ]; then
|
||||
kubectl delete pvc -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete pvc -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete pvc -n "$ns" -l app=loki --ignore-not-found
|
||||
kubectl delete pvc -n "$ns" -l app=tempo --ignore-not-found
|
||||
# Also try by name patterns
|
||||
kubectl get pvc -n "$ns" -o name | grep -E "prometheus|grafana|loki|tempo" | xargs -r kubectl delete -n "$ns" || true
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
# Check for monitoring PVs
|
||||
echo ""
|
||||
echo "Checking for monitoring PersistentVolumes..."
|
||||
if kubectl get pv 2>/dev/null | grep -q "prometheus\|grafana\|loki\|tempo\|monitoring"; then
|
||||
echo -e "${GREEN}Found monitoring PVs${NC}"
|
||||
kubectl get pv | grep -E "prometheus|grafana|loki|tempo|monitoring"
|
||||
echo -e "${RED} WARNING: Deleting PVs may delete data on disk!${NC}"
|
||||
read -p " Delete these PVs? (yes/no): " delete_pv
|
||||
if [ "$delete_pv" = "yes" ]; then
|
||||
kubectl get pv -o name | grep -E "prometheus|grafana|loki|tempo|monitoring" | xargs -r kubectl delete || true
|
||||
fi
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 7: Checking for monitoring Ingresses...${NC}"
|
||||
|
||||
for ns in kube-system default monitoring prometheus grafana; do
|
||||
if kubectl get namespace "$ns" &> /dev/null; then
|
||||
if kubectl get ingress -n "$ns" 2>/dev/null | grep -q "prometheus\|grafana\|loki"; then
|
||||
echo -e "${GREEN}Found monitoring Ingresses in $ns${NC}"
|
||||
kubectl get ingress -n "$ns" | grep -E "prometheus|grafana|loki"
|
||||
read -p " Delete these Ingresses? (yes/no): " delete_ing
|
||||
if [ "$delete_ing" = "yes" ]; then
|
||||
kubectl delete ingress -n "$ns" -l app=prometheus --ignore-not-found
|
||||
kubectl delete ingress -n "$ns" -l app=grafana --ignore-not-found
|
||||
kubectl delete ingress -n "$ns" prometheus-ingress --ignore-not-found
|
||||
kubectl delete ingress -n "$ns" grafana-ingress --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 8: Checking for Prometheus Operator CRDs...${NC}"
|
||||
|
||||
# Check for Prometheus Operator CRDs
|
||||
if kubectl get crd 2>/dev/null | grep -q "monitoring.coreos.com"; then
|
||||
echo -e "${GREEN}Found Prometheus Operator CRDs${NC}"
|
||||
kubectl get crd | grep "monitoring.coreos.com"
|
||||
echo ""
|
||||
echo -e "${RED}WARNING: Deleting these CRDs will remove ALL Prometheus Operator resources cluster-wide!${NC}"
|
||||
read -p " Delete Prometheus Operator CRDs? (yes/no): " delete_crd
|
||||
if [ "$delete_crd" = "yes" ]; then
|
||||
kubectl delete crd prometheuses.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd prometheusrules.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd servicemonitors.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd podmonitors.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd alertmanagers.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd alertmanagerconfigs.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd probes.monitoring.coreos.com --ignore-not-found
|
||||
kubectl delete crd thanosrulers.monitoring.coreos.com --ignore-not-found
|
||||
fi
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${YELLOW}Step 9: Optional - Clean up data directories on nodes...${NC}"
|
||||
echo ""
|
||||
echo "You may have monitoring data stored on your nodes at:"
|
||||
echo " - /mnt/local-ssd/prometheus"
|
||||
echo " - /mnt/local-ssd/grafana"
|
||||
echo " - /mnt/local-ssd/loki"
|
||||
echo " - /mnt/local-ssd/tempo"
|
||||
echo " - /var/lib/prometheus"
|
||||
echo " - /var/lib/grafana"
|
||||
echo ""
|
||||
echo "To remove these, SSH to each node and run:"
|
||||
echo " sudo rm -rf /mnt/local-ssd/{prometheus,grafana,loki,tempo}"
|
||||
echo " sudo rm -rf /var/lib/{prometheus,grafana,loki,tempo}"
|
||||
echo ""
|
||||
read -p "Have you cleaned up the data directories? (yes to continue, no to skip): " cleanup_dirs
|
||||
|
||||
echo ""
|
||||
echo -e "${GREEN}=========================================================="
|
||||
echo "Existing Monitoring Stack Cleanup Complete!"
|
||||
echo "==========================================================${NC}"
|
||||
echo ""
|
||||
echo "Summary of actions taken:"
|
||||
echo " - Removed monitoring namespaces (if confirmed)"
|
||||
echo " - Uninstalled Helm releases (if found and confirmed)"
|
||||
echo " - Removed standalone monitoring components"
|
||||
echo " - Removed monitoring ConfigMaps"
|
||||
echo " - Removed RBAC resources"
|
||||
echo " - Removed PVCs and PVs (if confirmed)"
|
||||
echo " - Removed Ingresses"
|
||||
echo " - Removed Prometheus Operator CRDs (if confirmed)"
|
||||
echo ""
|
||||
echo -e "${YELLOW}Next Steps:${NC}"
|
||||
echo "1. Verify cleanup: kubectl get all -A | grep -E 'prometheus|grafana|loki|tempo|monitoring'"
|
||||
echo "2. Clean up node data directories (see above)"
|
||||
echo "3. Deploy new observability stack: ./deploy.sh"
|
||||
echo ""
|
||||
|
|
@ -0,0 +1,98 @@
|
|||
# PV
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-auth
|
||||
spec:
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/auth
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
# k8s/auth/keycloak/secret.yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: keycloak-admin, namespace: db }
|
||||
type: Opaque
|
||||
stringData: { KEYCLOAK_ADMIN: "admin", KEYCLOAK_ADMIN_PASSWORD: "admin" }
|
||||
|
||||
---
|
||||
# k8s/auth/keycloak/pvc.yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata: { name: keycloak-data, namespace: db }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 10Gi } }
|
||||
|
||||
---
|
||||
# k8s/auth/keycloak/deploy.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: keycloak, namespace: db }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: keycloak } }
|
||||
template:
|
||||
metadata: { labels: { app: keycloak } }
|
||||
spec:
|
||||
# Ensure the PV is owned by the Keycloak UID/GID
|
||||
securityContext:
|
||||
fsGroup: 1000
|
||||
initContainers:
|
||||
- name: fix-permissions
|
||||
image: busybox
|
||||
command: ['sh', '-c', 'chown -R 1000:1000 /opt/keycloak/data && chmod -R 755 /opt/keycloak/data']
|
||||
volumeMounts:
|
||||
- name: data
|
||||
mountPath: /opt/keycloak/data
|
||||
containers:
|
||||
- name: keycloak
|
||||
image: quay.io/keycloak/keycloak:latest
|
||||
args: ["start","--http-enabled=true","--proxy-headers=xforwarded","--hostname-strict=false"]
|
||||
env:
|
||||
- { name: KEYCLOAK_ADMIN, valueFrom: { secretKeyRef: { name: keycloak-admin, key: KEYCLOAK_ADMIN } } }
|
||||
- { name: KEYCLOAK_ADMIN_PASSWORD, valueFrom: { secretKeyRef: { name: keycloak-admin, key: KEYCLOAK_ADMIN_PASSWORD } } }
|
||||
ports: [{ containerPort: 8080 }]
|
||||
volumeMounts: [{ name: data, mountPath: /opt/keycloak/data }]
|
||||
securityContext:
|
||||
runAsUser: 1000
|
||||
runAsGroup: 1000
|
||||
volumes:
|
||||
- name: data
|
||||
persistentVolumeClaim: { claimName: keycloak-data }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: keycloak, namespace: db }
|
||||
spec: { selector: { app: keycloak }, ports: [ { port: 80, targetPort: 8080 } ] }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: keycloak
|
||||
namespace: db
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["auth.betelgeusebytes.io"], secretName: keycloak-tls }]
|
||||
rules:
|
||||
- host: auth.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: keycloak, port: { number: 80 } } }
|
||||
|
|
@ -0,0 +1,175 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-postgres
|
||||
spec:
|
||||
capacity:
|
||||
storage: 80Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/postgres
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-elasticsearch
|
||||
spec:
|
||||
capacity:
|
||||
storage: 300Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/elasticsearch
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-gitea
|
||||
spec:
|
||||
capacity:
|
||||
storage: 50Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/gitea
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-jupyter
|
||||
spec:
|
||||
capacity:
|
||||
storage: 20Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/jupyter
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-kafka
|
||||
spec:
|
||||
capacity:
|
||||
storage: 50Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/kafka
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-zookeeper-data
|
||||
spec:
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/zookeeper-data
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-zookeeper-log
|
||||
spec:
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/zookeeper-log
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-prometheus
|
||||
spec:
|
||||
capacity:
|
||||
storage: 50Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/prometheus
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: local-ssd-hetzner
|
||||
provisioner: kubernetes.io/no-provisioner
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
|
|
@ -0,0 +1,37 @@
|
|||
# k8s/ai/tei/deploy.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: tei, namespace: ml }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: tei } }
|
||||
template:
|
||||
metadata: { labels: { app: tei } }
|
||||
spec:
|
||||
containers:
|
||||
- name: tei
|
||||
image: ghcr.io/huggingface/text-embeddings-inference:cpu-latest
|
||||
env: [{ name: MODEL_ID, value: "mixedbread-ai/mxbai-embed-large-v1" }]
|
||||
ports: [{ containerPort: 80 }]
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: tei, namespace: ml }
|
||||
spec: { selector: { app: tei }, ports: [ { port: 80, targetPort: 80 } ] }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: tei
|
||||
namespace: ml
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["embeddings.betelgeusebytes.io"], secretName: tei-tls }]
|
||||
rules:
|
||||
- host: embeddings.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: tei, port: { number: 80 } } }
|
||||
|
|
@ -0,0 +1,541 @@
|
|||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: trading
|
||||
labels:
|
||||
name: trading
|
||||
environment: production
|
||||
---
|
||||
# OPTIONAL: Use this if you want to persist IB Gateway settings/logs
|
||||
# across pod restarts. For most use cases, this is NOT needed since
|
||||
# IB Gateway is mostly stateless and credentials are in Secrets.
|
||||
#
|
||||
# Only create this PV/PVC if you need to persist:
|
||||
# - TWS session data
|
||||
# - Custom workspace layouts
|
||||
# - Historical API usage logs
|
||||
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: ib-gateway-data
|
||||
labels:
|
||||
type: local
|
||||
app: ib-gateway
|
||||
spec:
|
||||
capacity:
|
||||
storage: 5Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-storage
|
||||
local:
|
||||
path: /mnt/local-ssd/ib-gateway # Adjust to your local SSD path
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: ib-gateway-data
|
||||
namespace: trading
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
storageClassName: local-storage
|
||||
selector:
|
||||
matchLabels:
|
||||
app: ib-gateway
|
||||
|
||||
# To use this PVC, add to Deployment volumeMounts:
|
||||
# - name: data
|
||||
# mountPath: /root/Jts
|
||||
# And to volumes:
|
||||
# - name: data
|
||||
# persistentVolumeClaim:
|
||||
# claimName: ib-gateway-data
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: ib-credentials
|
||||
namespace: trading
|
||||
type: Opaque
|
||||
stringData:
|
||||
# IMPORTANT: Replace these with your actual IB credentials
|
||||
# For paper trading, use your paper trading account
|
||||
username: "saladin85"
|
||||
password: "3Lcd@05041985"
|
||||
# Trading mode: "paper" or "live"
|
||||
trading-mode: "paper"
|
||||
|
||||
# IB Gateway config (jts.ini equivalent)
|
||||
# This enables headless mode and configures ports
|
||||
ibgateway.conf: |
|
||||
[IBGateway]
|
||||
TradingMode=paper
|
||||
ApiOnly=true
|
||||
ReadOnlyApi=false
|
||||
TrustedIPs=127.0.0.1
|
||||
|
||||
[IBGatewayAPI]
|
||||
ApiPortNumber=4002
|
||||
|
||||
[Logon]
|
||||
UseRemoteSettings=no
|
||||
Locale=en
|
||||
ColorPaletteName=dark
|
||||
|
||||
[Display]
|
||||
ShowSplashScreen=no
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: ib-gateway-config
|
||||
namespace: trading
|
||||
data:
|
||||
# Startup script to configure IB Gateway for headless operation
|
||||
startup.sh: |
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
echo "Starting IB Gateway in headless mode..."
|
||||
echo "Trading Mode: ${TRADING_MODE}"
|
||||
echo "Port: ${TWS_PORT}"
|
||||
|
||||
# Configure based on trading mode
|
||||
if [ "${TRADING_MODE}" == "live" ]; then
|
||||
export TWS_PORT=4001
|
||||
echo "⚠️ LIVE TRADING MODE - USE WITH CAUTION ⚠️"
|
||||
else
|
||||
export TWS_PORT=4002
|
||||
echo "📝 Paper Trading Mode (Safe)"
|
||||
fi
|
||||
# IMPORTANT: use the env vars provided by the Deployment
|
||||
export IB_USERNAME="${TWS_USERID}"
|
||||
export IB_PASSWORD="${TWS_PASSWORD}"
|
||||
|
||||
# Start IB Gateway
|
||||
exec /opt/ibgateway/ibgateway-latest-standalone-linux-x64.sh \
|
||||
--tws-path=/root/Jts \
|
||||
--tws-settings-path=/root \
|
||||
--user="${IB_USERNAME}" \
|
||||
--pw="${IB_PASSWORD}" \
|
||||
--mode="${TRADING_MODE}" \
|
||||
--port="${TWS_PORT}"
|
||||
|
||||
# Health check script
|
||||
healthcheck.sh: |
|
||||
#!/bin/bash
|
||||
# Check if TWS API port is listening
|
||||
# PORT=${TWS_PORT:-4002}
|
||||
# nc -z localhost $PORT
|
||||
# exit $?
|
||||
#!/bin/sh
|
||||
# Pure-python TCP check (no nc required)
|
||||
PORT="${TWS_PORT:-4002}"
|
||||
python - <<'PY'
|
||||
import os, socket, sys
|
||||
port = int(os.environ.get("TWS_PORT", os.environ.get("PORT", "4002")))
|
||||
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
s.settimeout(2)
|
||||
try:
|
||||
s.connect(("127.0.0.1", port))
|
||||
sys.exit(0)
|
||||
except Exception:
|
||||
sys.exit(1)
|
||||
finally:
|
||||
s.close()
|
||||
PY
|
||||
---
|
||||
# apiVersion: apps/v1
|
||||
# kind: Deployment
|
||||
# metadata:
|
||||
# name: ib-gateway
|
||||
# namespace: trading
|
||||
# labels:
|
||||
# app: ib-gateway
|
||||
# component: trading-infrastructure
|
||||
# spec:
|
||||
# replicas: 1 # IB Gateway should only have 1 instance per account
|
||||
# strategy:
|
||||
# type: Recreate # Avoid multiple simultaneous logins
|
||||
# selector:
|
||||
# matchLabels:
|
||||
# app: ib-gateway
|
||||
# template:
|
||||
# metadata:
|
||||
# labels:
|
||||
# app: ib-gateway
|
||||
# annotations:
|
||||
# prometheus.io/scrape: "false" # No metrics endpoint by default
|
||||
# spec:
|
||||
# # Pin to hetzner-2 (matches your existing pattern)
|
||||
# nodeSelector:
|
||||
# kubernetes.io/hostname: hetzner-2
|
||||
|
||||
# # Security context
|
||||
# securityContext:
|
||||
# runAsNonRoot: false # IB Gateway requires root for VNC (even if unused)
|
||||
# fsGroup: 1000
|
||||
|
||||
# containers:
|
||||
# - name: ib-gateway
|
||||
# # Using community-maintained IB Gateway image
|
||||
# # Alternative: waytrade/ib-gateway:latest
|
||||
# image: ghcr.io/gnzsnz/ib-gateway:stable
|
||||
# imagePullPolicy: IfNotPresent
|
||||
|
||||
# env:
|
||||
# - name: TWS_USERID
|
||||
# valueFrom:
|
||||
# secretKeyRef:
|
||||
# name: ib-credentials
|
||||
# key: username
|
||||
# - name: TWS_PASSWORD
|
||||
# valueFrom:
|
||||
# secretKeyRef:
|
||||
# name: ib-credentials
|
||||
# key: password
|
||||
# - name: TRADING_MODE
|
||||
# valueFrom:
|
||||
# secretKeyRef:
|
||||
# name: ib-credentials
|
||||
# key: trading-mode
|
||||
# - name: TWS_PORT
|
||||
# value: "4002" # Default to paper trading
|
||||
# - name: READ_ONLY_API
|
||||
# value: "no"
|
||||
|
||||
# # Ports
|
||||
# ports:
|
||||
# - name: paper-trading
|
||||
# containerPort: 4002
|
||||
# protocol: TCP
|
||||
# - name: live-trading
|
||||
# containerPort: 4001
|
||||
# protocol: TCP
|
||||
# - name: vnc
|
||||
# containerPort: 5900
|
||||
# protocol: TCP # VNC (not exposed externally)
|
||||
|
||||
# # Resource limits
|
||||
# resources:
|
||||
# requests:
|
||||
# memory: "1Gi"
|
||||
# cpu: "500m"
|
||||
# limits:
|
||||
# memory: "2Gi"
|
||||
# cpu: "1000m"
|
||||
|
||||
# # Liveness probe (check if API port is responsive)
|
||||
# startupProbe:
|
||||
# tcpSocket:
|
||||
# port: 4002
|
||||
# initialDelaySeconds: 60 # Wait 60s before first check
|
||||
# periodSeconds: 10 # Check every 10s
|
||||
# timeoutSeconds: 5
|
||||
# failureThreshold: 18 # 60s + (10s * 18) = 240s total startup time
|
||||
|
||||
# livenessProbe:
|
||||
# tcpSocket:
|
||||
# port: 4002
|
||||
# initialDelaySeconds: 0 # IB Gateway takes time to start
|
||||
# periodSeconds: 60
|
||||
# timeoutSeconds: 5
|
||||
# failureThreshold: 3
|
||||
|
||||
# # Readiness probe
|
||||
# readinessProbe:
|
||||
# tcpSocket:
|
||||
# port: 4002
|
||||
# initialDelaySeconds: 0
|
||||
# periodSeconds: 10
|
||||
# timeoutSeconds: 5
|
||||
# failureThreshold: 2
|
||||
|
||||
# # Volume mounts for config
|
||||
# volumeMounts:
|
||||
# - name: ib-config
|
||||
# mountPath: /root/Jts/jts.ini
|
||||
# subPath: ibgateway.conf
|
||||
# - name: startup-script
|
||||
# mountPath: /startup.sh
|
||||
# subPath: startup.sh
|
||||
# - name: data
|
||||
# mountPath: /root/Jts
|
||||
|
||||
# # Logging to stdout (Fluent Bit will collect)
|
||||
# # IB Gateway logs go to /root/Jts/log by default
|
||||
# lifecycle:
|
||||
# postStart:
|
||||
# exec:
|
||||
# command:
|
||||
# - /bin/sh
|
||||
# - -c
|
||||
# - |
|
||||
# mkdir -p /root/Jts/log
|
||||
# ln -sf /dev/stdout /root/Jts/log/ibgateway.log || true
|
||||
|
||||
# volumes:
|
||||
# - name: ib-config
|
||||
# secret:
|
||||
# secretName: ib-credentials
|
||||
# defaultMode: 0644
|
||||
# - name: startup-script
|
||||
# configMap:
|
||||
# name: ib-gateway-config
|
||||
# defaultMode: 0755
|
||||
# - name: data
|
||||
# persistentVolumeClaim:
|
||||
# claimName: ib-gateway-data
|
||||
|
||||
# # Restart policy
|
||||
# restartPolicy: Always
|
||||
|
||||
# # DNS policy for internal cluster resolution
|
||||
# dnsPolicy: ClusterFirst
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: ib-gateway
|
||||
namespace: trading
|
||||
labels:
|
||||
app: ib-gateway
|
||||
component: trading-infrastructure
|
||||
spec:
|
||||
replicas: 1
|
||||
strategy:
|
||||
type: Recreate
|
||||
selector:
|
||||
matchLabels:
|
||||
app: ib-gateway
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: ib-gateway
|
||||
annotations:
|
||||
prometheus.io/scrape: "false"
|
||||
spec:
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: hetzner-2
|
||||
|
||||
securityContext:
|
||||
runAsNonRoot: false
|
||||
fsGroup: 1000
|
||||
|
||||
# Seed writable jts.ini into the PVC once
|
||||
initContainers:
|
||||
- name: seed-jts-config
|
||||
image: busybox:1.36
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
set -e
|
||||
mkdir -p /data
|
||||
if [ ! -f /data/jts.ini ]; then
|
||||
echo "Seeding jts.ini into PVC"
|
||||
cp /config/ibgateway.conf /data/jts.ini
|
||||
chmod 644 /data/jts.ini
|
||||
else
|
||||
echo "jts.ini already exists in PVC"
|
||||
fi
|
||||
volumeMounts:
|
||||
- name: ib-config
|
||||
mountPath: /config
|
||||
readOnly: true
|
||||
- name: data
|
||||
mountPath: /data
|
||||
|
||||
containers:
|
||||
# ------------------------------------------------------------------
|
||||
# IB Gateway
|
||||
# ------------------------------------------------------------------
|
||||
- name: ib-gateway
|
||||
image: ghcr.io/gnzsnz/ib-gateway:stable
|
||||
imagePullPolicy: IfNotPresent
|
||||
|
||||
env:
|
||||
- name: TWS_USERID
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: ib-credentials
|
||||
key: username
|
||||
- name: TWS_PASSWORD
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: ib-credentials
|
||||
key: password
|
||||
- name: TRADING_MODE
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: ib-credentials
|
||||
key: trading-mode
|
||||
- name: TWS_PORT
|
||||
value: "4002"
|
||||
- name: READ_ONLY_API
|
||||
value: "no"
|
||||
|
||||
ports:
|
||||
- name: ib-api-local
|
||||
containerPort: 4002
|
||||
protocol: TCP
|
||||
- name: live-trading
|
||||
containerPort: 4001
|
||||
protocol: TCP
|
||||
- name: vnc
|
||||
containerPort: 5900
|
||||
protocol: TCP
|
||||
|
||||
resources:
|
||||
requests:
|
||||
memory: "1Gi"
|
||||
cpu: "500m"
|
||||
limits:
|
||||
memory: "2Gi"
|
||||
cpu: "1000m"
|
||||
|
||||
# IMPORTANT: Probes should check the local IB port (4002)
|
||||
startupProbe:
|
||||
tcpSocket:
|
||||
port: 4002
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 18
|
||||
|
||||
livenessProbe:
|
||||
tcpSocket:
|
||||
port: 4002
|
||||
periodSeconds: 60
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
|
||||
readinessProbe:
|
||||
tcpSocket:
|
||||
port: 4002
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 2
|
||||
|
||||
volumeMounts:
|
||||
- name: data
|
||||
mountPath: /root/Jts
|
||||
|
||||
lifecycle:
|
||||
postStart:
|
||||
exec:
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
mkdir -p /root/Jts/log
|
||||
ln -sf /dev/stdout /root/Jts/log/ibgateway.log || true
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Sidecar TCP proxy: accepts cluster traffic, forwards to localhost:4002
|
||||
# ------------------------------------------------------------------
|
||||
- name: ib-api-proxy
|
||||
image: alpine/socat:1.8.0.0
|
||||
imagePullPolicy: IfNotPresent
|
||||
args:
|
||||
- "-d"
|
||||
- "-d"
|
||||
- "TCP-LISTEN:4003,fork,reuseaddr"
|
||||
- "TCP:127.0.0.1:4002"
|
||||
ports:
|
||||
- name: ib-api
|
||||
containerPort: 4003
|
||||
protocol: TCP
|
||||
resources:
|
||||
requests:
|
||||
memory: "32Mi"
|
||||
cpu: "10m"
|
||||
limits:
|
||||
memory: "128Mi"
|
||||
cpu: "100m"
|
||||
# basic probe: is proxy listening
|
||||
readinessProbe:
|
||||
tcpSocket:
|
||||
port: 4003
|
||||
periodSeconds: 5
|
||||
timeoutSeconds: 2
|
||||
failureThreshold: 3
|
||||
|
||||
volumes:
|
||||
- name: ib-config
|
||||
secret:
|
||||
secretName: ib-credentials
|
||||
defaultMode: 0644
|
||||
|
||||
- name: data
|
||||
persistentVolumeClaim:
|
||||
claimName: ib-gateway-data
|
||||
|
||||
restartPolicy: Always
|
||||
dnsPolicy: ClusterFirst
|
||||
|
||||
|
||||
---
|
||||
# apiVersion: v1
|
||||
# kind: Service
|
||||
# metadata:
|
||||
# name: ib-gateway
|
||||
# namespace: trading
|
||||
# labels:
|
||||
# app: ib-gateway
|
||||
# spec:
|
||||
# type: ClusterIP # Internal-only, not exposed publicly
|
||||
# clusterIP: None # Headless service (optional, remove if you want a stable ClusterIP)
|
||||
# selector:
|
||||
# app: ib-gateway
|
||||
# ports:
|
||||
# - name: paper-trading
|
||||
# port: 4002
|
||||
# targetPort: 4002
|
||||
# protocol: TCP
|
||||
# - name: live-trading
|
||||
# port: 4001
|
||||
# targetPort: 4001
|
||||
# protocol: TCP
|
||||
# sessionAffinity: ClientIP # Stick to same pod (important for stateful TWS sessions)
|
||||
# sessionAffinityConfig:
|
||||
# clientIP:
|
||||
# timeoutSeconds: 3600 # 1 hour session stickiness
|
||||
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: ib-gateway
|
||||
namespace: trading
|
||||
labels:
|
||||
app: ib-gateway
|
||||
spec:
|
||||
type: ClusterIP
|
||||
selector:
|
||||
app: ib-gateway
|
||||
ports:
|
||||
- name: paper-trading
|
||||
port: 4002
|
||||
targetPort: 4003 # <-- proxy sidecar, not the gateway directly
|
||||
protocol: TCP
|
||||
- name: live-trading
|
||||
port: 4001
|
||||
targetPort: 4001
|
||||
protocol: TCP
|
||||
sessionAffinity: ClientIP
|
||||
sessionAffinityConfig:
|
||||
clientIP:
|
||||
timeoutSeconds: 3600
|
||||
|
|
@ -0,0 +1,169 @@
|
|||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: trading
|
||||
labels:
|
||||
name: trading
|
||||
environment: production
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: ib-credentials
|
||||
namespace: trading
|
||||
type: Opaque
|
||||
stringData:
|
||||
# Rotate your creds (you pasted them earlier).
|
||||
username: "saladin85"
|
||||
password: "3Lcd@05041985"
|
||||
trading-mode: "paper"
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: ib-gateway
|
||||
namespace: trading
|
||||
labels:
|
||||
app: ib-gateway
|
||||
component: trading-infrastructure
|
||||
spec:
|
||||
replicas: 1
|
||||
strategy:
|
||||
type: Recreate
|
||||
selector:
|
||||
matchLabels:
|
||||
app: ib-gateway
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: ib-gateway
|
||||
annotations:
|
||||
prometheus.io/scrape: "false"
|
||||
spec:
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: hetzner-2
|
||||
|
||||
# Keep your original security context
|
||||
securityContext:
|
||||
runAsNonRoot: false
|
||||
fsGroup: 1000
|
||||
|
||||
containers:
|
||||
- name: ib-gateway
|
||||
image: ghcr.io/gnzsnz/ib-gateway:stable
|
||||
imagePullPolicy: IfNotPresent
|
||||
|
||||
# IMPORTANT: use env vars this image expects
|
||||
env:
|
||||
- name: TWS_USERID
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: ib-credentials
|
||||
key: username
|
||||
- name: TWS_PASSWORD
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: ib-credentials
|
||||
key: password
|
||||
- name: TRADING_MODE
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: ib-credentials
|
||||
key: trading-mode
|
||||
- name: READ_ONLY_API
|
||||
value: "no"
|
||||
|
||||
# These two match what your log shows the image uses
|
||||
- name: API_PORT
|
||||
value: "4002"
|
||||
- name: SOCAT_PORT
|
||||
value: "4004"
|
||||
|
||||
# optional but nice
|
||||
- name: TIME_ZONE
|
||||
value: "Etc/UTC"
|
||||
- name: TWOFA_TIMEOUT_ACTION
|
||||
value: "exit"
|
||||
|
||||
ports:
|
||||
# IB API ports (inside container / localhost use)
|
||||
- name: api-paper
|
||||
containerPort: 4002
|
||||
protocol: TCP
|
||||
- name: api-live
|
||||
containerPort: 4001
|
||||
protocol: TCP
|
||||
|
||||
# socat relay port for non-localhost clients (what we expose via Service)
|
||||
- name: api-socat
|
||||
containerPort: 4004
|
||||
protocol: TCP
|
||||
|
||||
# optional UI/VNC
|
||||
- name: vnc
|
||||
containerPort: 5900
|
||||
protocol: TCP
|
||||
|
||||
resources:
|
||||
requests:
|
||||
memory: "1Gi"
|
||||
cpu: "500m"
|
||||
limits:
|
||||
memory: "2Gi"
|
||||
cpu: "1000m"
|
||||
|
||||
# Probe the socat port (represents remote connectivity)
|
||||
startupProbe:
|
||||
tcpSocket:
|
||||
port: 4004
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 18
|
||||
|
||||
readinessProbe:
|
||||
tcpSocket:
|
||||
port: 4004
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 2
|
||||
|
||||
livenessProbe:
|
||||
tcpSocket:
|
||||
port: 4004
|
||||
periodSeconds: 60
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
|
||||
restartPolicy: Always
|
||||
dnsPolicy: ClusterFirst
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: ib-gateway
|
||||
namespace: trading
|
||||
labels:
|
||||
app: ib-gateway
|
||||
spec:
|
||||
type: ClusterIP
|
||||
selector:
|
||||
app: ib-gateway
|
||||
ports:
|
||||
# Clients connect to 4002, but we forward to SOCAT_PORT=4004
|
||||
- name: paper-trading
|
||||
port: 4002
|
||||
targetPort: 4004
|
||||
protocol: TCP
|
||||
|
||||
# If you truly need live, you should relay live via another socat port too.
|
||||
# For now keep it direct (or remove it entirely for safety).
|
||||
- name: live-trading
|
||||
port: 4001
|
||||
targetPort: 4001
|
||||
protocol: TCP
|
||||
|
||||
sessionAffinity: ClientIP
|
||||
sessionAffinityConfig:
|
||||
clientIP:
|
||||
timeoutSeconds: 3600
|
||||
|
|
@ -0,0 +1,80 @@
|
|||
# k8s/vec/qdrant/pvc.yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata: { name: qdrant-data, namespace: db}
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 20Gi } }
|
||||
|
||||
---
|
||||
# k8s/vec/qdrant/deploy.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: qdrant, namespace: db}
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: qdrant } }
|
||||
template:
|
||||
metadata: { labels: { app: qdrant } }
|
||||
spec:
|
||||
containers:
|
||||
- name: qdrant
|
||||
image: qdrant/qdrant:latest
|
||||
ports:
|
||||
- { containerPort: 6333 } # HTTP + Web UI
|
||||
- { containerPort: 6334 } # gRPC
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /qdrant/storage }
|
||||
volumes:
|
||||
- name: data
|
||||
persistentVolumeClaim: { claimName: qdrant-data }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: qdrant, namespace: db}
|
||||
spec:
|
||||
selector: { app: qdrant }
|
||||
ports:
|
||||
- { name: http, port: 80, targetPort: 6333 }
|
||||
- { name: grpc, port: 6334, targetPort: 6334 }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: qdrant
|
||||
namespace: db
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["vector.betelgeusebytes.io"], secretName: qdrant-tls }]
|
||||
rules:
|
||||
- host: vector.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: qdrant, port: { number: 80 } } }
|
||||
---
|
||||
# PV
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-qdrant
|
||||
spec:
|
||||
capacity:
|
||||
storage: 20Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/qdrant
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
|
|
@ -0,0 +1,142 @@
|
|||
# PV
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pv-vllm
|
||||
spec:
|
||||
capacity:
|
||||
storage: 50Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-ssd-hetzner
|
||||
local:
|
||||
path: /mnt/local-ssd/vllm
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- hetzner-2
|
||||
---
|
||||
# k8s/ai/vllm/secret.yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata: { name: vllm-auth, namespace: ml }
|
||||
type: Opaque
|
||||
stringData: { API_KEY: "replace_me" }
|
||||
|
||||
---
|
||||
# k8s/ai/ollama/deploy.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata: { name: ollama, namespace: ml }
|
||||
spec:
|
||||
replicas: 1
|
||||
selector: { matchLabels: { app: ollama } }
|
||||
template:
|
||||
metadata: { labels: { app: ollama } }
|
||||
spec:
|
||||
securityContext:
|
||||
runAsUser: 0 # needed so the init can write into /root/.ollama
|
||||
initContainers:
|
||||
- name: warm-models
|
||||
image: ollama/ollama:latest
|
||||
command: ["/bin/sh","-c"]
|
||||
args:
|
||||
- |
|
||||
ollama serve & # start a temp daemon
|
||||
sleep 2
|
||||
# pull one or more small, quantized models for CPU
|
||||
ollama pull qwen2.5:3b-instruct-q4_K_M || true
|
||||
ollama pull llama3.2:3b-instruct-q4_K_M || true
|
||||
pkill ollama || true
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /root/.ollama }
|
||||
containers:
|
||||
- name: ollama
|
||||
image: ollama/ollama:latest
|
||||
env:
|
||||
- { name: OLLAMA_ORIGINS, value: "*" } # CORS if you call from browser
|
||||
ports:
|
||||
- { containerPort: 11434 }
|
||||
volumeMounts:
|
||||
- { name: data, mountPath: /root/.ollama }
|
||||
resources:
|
||||
requests: { cpu: "2", memory: "4Gi" }
|
||||
limits: { cpu: "4", memory: "8Gi" }
|
||||
volumes:
|
||||
- name: data
|
||||
persistentVolumeClaim: { claimName: ollama-data }
|
||||
|
||||
---
|
||||
# k8s/ai/ollama/svc-ing.yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata: { name: ollama, namespace: ml }
|
||||
spec:
|
||||
selector: { app: ollama }
|
||||
ports: [ { name: http, port: 80, targetPort: 11434 } ]
|
||||
|
||||
# ---
|
||||
# # old k8s/ai/vllm/deploy.yaml
|
||||
# apiVersion: apps/v1
|
||||
# kind: Deployment
|
||||
# metadata: { name: vllm, namespace: ml }
|
||||
# spec:
|
||||
# replicas: 1
|
||||
# selector: { matchLabels: { app: vllm } }
|
||||
# template:
|
||||
# metadata: { labels: { app: vllm } }
|
||||
# spec:
|
||||
# containers:
|
||||
# - name: vllm
|
||||
# image: vllm/vllm-openai:latest
|
||||
# args: ["--model","Qwen/Qwen2.5-7B-Instruct","--max-model-len","8192","--port","8000","--host","0.0.0.0"]
|
||||
# env:
|
||||
# - name: VLLM_API_KEY
|
||||
# valueFrom: { secretKeyRef: { name: vllm-auth, key: API_KEY } }
|
||||
# ports: [{ containerPort: 8000 }]
|
||||
# resources:
|
||||
# limits:
|
||||
# nvidia.com/gpu: 1
|
||||
# requests:
|
||||
# nvidia.com/gpu: 1
|
||||
# volumeMounts:
|
||||
# - { name: cache, mountPath: /root/.cache/huggingface }
|
||||
# volumes:
|
||||
# - name: cache
|
||||
# persistentVolumeClaim: { claimName: vllm-cache-pvc }
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata: { name: ollama-data, namespace: ml }
|
||||
spec:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
storageClassName: local-ssd-hetzner
|
||||
resources: { requests: { storage: 50Gi } }
|
||||
# ---
|
||||
#old k8s/ai/vllm/svc-ing.yaml
|
||||
# apiVersion: v1
|
||||
# kind: Service
|
||||
# metadata: { name: vllm, namespace: ml }
|
||||
# spec: { selector: { app: vllm }, ports: [ { port: 80, targetPort: 8000 } ] }
|
||||
---
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: vllm
|
||||
namespace: ml
|
||||
annotations: { cert-manager.io/cluster-issuer: letsencrypt-prod }
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [{ hosts: ["llm.betelgeusebytes.io"], secretName: vllm-tls }]
|
||||
rules:
|
||||
- host: llm.betelgeusebytes.io
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend: { service: { name: vllm, port: { number: 80 } } }
|
||||
Loading…
Reference in New Issue