Add usage documentation and MLOps diagrams for model training and deployment

2026-01-28 11:53:39 +01:00 · 2026-01-28 11:53:39 +01:00 · 6463d4a140
parent 15b11cb180
commit 6463d4a140
2 changed files with 244 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -12,6 +12,7 @@ designed to power an **Islamic Hadith Scholar AI** and future AI/data projects.
 - [Observability](OBSERVABILITY.md)
 - [Roadmap & Next Steps](ROADMAP.md)
 - [Future Projects & Use Cases](FUTURE-PROJECTS.md)
+- [USAGE and Graphs](USAGE.md)

 ## 🎯 Current Focus

--- a/USAGE.md
+++ b/USAGE.md
@ -0,0 +1,243 @@
+## MLOps Loop Diagram (Label → Train → Registry → Deploy)
+```mermaid
+
+flowchart TB
+  LS["Label Studio
+(label.betelgeusebytes.io)"] -->|export tasks/labels| S3["MinIO S3
+(minio.betelgeusebytes.io)"]
+  S3 -->|dataset version| ARGO["Argo Workflows
+(argo.betelgeusebytes.io)"]
+  ARGO -->|train/eval| TR["Training Job
+(PyTorch/Transformers)"]
+  TR -->|metrics, params| MLF["MLflow
+(mlflow.betelgeusebytes.io)"]
+  TR -->|model artifacts| S3
+  MLF -->|register model| REG["Model Registry"]
+  ARGO -->|promote model tag| REG
+  REG -->|deploy image / config| ARGOCD["Argo CD
+(GitOps)"]
+  ARGOCD -->|rollout| SVC["NER/RE Services
+(custom, later)"]
+  SVC -->|inference| ORCH["Orchestrator API
+(hadith-api...)"] 
+  ORCH -->|observability| OBS["Grafana LGTM
+(grafana...)"]
+```
+
+
+## Isnād Extraction Pipeline Diagram (Your actual deployed stack)
+This shows ***how a hadith text becomes a sanad chain***, how it is stored, and how the ***Neo4j graph*** is built — using your endpoints: LLM (CPU), TEI, Qdrant, Postgres, Neo4j, MinIO, Argo.
+```mermaid
+flowchart TB
+  H["Hadith Text Input<br/>(UI/API)"] --> ORCH["Orchestrator API<br/>(hadith-api...);"]
+  ORCH -->|optional: auth| KC["Keycloak<br/>(auth...)"]
+  ORCH -->|normalize/clean| PRE["Preprocess<br/>(arabic cleanup, tokens)"]
+  PRE -->|retrieve examples| TEI["TEI Embeddings<br/>(embeddings...)"]
+  TEI --> QD["Qdrant<br/>(vector...)"]
+  QD -->|top-k similar hadiths + patterns| CTX["Context Pack<br/>(examples, schema)"]
+
+  ORCH -->|prompt+schema+ctx| LLM["LLM CPU<br/>(llm...)"]
+  LLM -->|JSON: chain nodes + links| JSON["Parsed Isnād JSON<br/>(raw extraction)"]
+  ORCH -->|validate + dedupe| RES["Resolve Entities<br/>(name variants, kunya)"]
+  RES --> PG["PostgreSQL<br/>canonical people, aliases"]
+  RES -->|canonical IDs| CAN["Canonical Chain<br/>(person_id sequence)"]
+
+  CAN -->|write nodes/edges| N4["Neo4j<br/>(neo4j...)"]
+  ORCH -->|store provenance| S3["MinIO<br/>(minio...)"]
+  ORCH -->|optional: embed matn| TEI --> QD
+  ORCH -->|return result| OUT["Response<br/>chain + matn + provenance"]
+
+  N4 -->|graph queries| OUT
+  PG -->|metadata| OUT
+  ```
+## Training a Model/Algorithm to Extract Isnād and Build the Neo4j Graph
+This diagram covers ***end-to-end training + deployment + ingestion***, including:
+Label Studio → MinIO → Argo Workflows → MLflow → NER/RE service → Orchestrator → Postgres/Neo4j/Qdrant.
+```mermaid
+flowchart TB
+  %% Data creation
+  TXT[Raw Hadith Corpora] --> INGEST["Ingest/ETL\n(Argo Workflow)"]
+  INGEST --> S3["MinIO S3\n(versioned datasets)"]
+
+  %% Annotation
+  S3 -->|sampling| LS["Label Studio\n(label...)"]
+  LS -->|"annotated spans\n(narrators, connectors)"| S3
+
+  %% Training
+  S3 --> ARGO["Argo Workflows\n(train pipeline)"]
+  ARGO --> TR["Train NER/RE\n(or rules+CRF)\nCPU-friendly"]
+  TR --> MLF["MLflow\n(metrics + registry)"]
+  TR -->|model artifacts| S3
+
+  %% Deployment of extractor
+  MLF -->|promote| REG[Model Version]
+  REG --> DEPLOY["Deploy extractor svc\n(custom later)"]
+  DEPLOY --> EXT["Isnād Extractor API\n(NER + RE)"]
+  EXT -->|"entities+relations"| ORCH[Orchestrator API]
+
+  %% Graph building + storage
+  ORCH --> RES["Canonicalization\n(alias merge)"]
+  RES --> PG[("PostgreSQL\npeople, aliases, provenance")]
+  ORCH --> N4["Neo4j\n(isnad graph)"]
+  ORCH --> TEI[TEI embeddings] --> QD[Qdrant vectors]
+  ORCH --> S3B["MinIO\nartifacts/provenance"]
+
+  %% Monitoring
+  ORCH --> OBS["Grafana LGTM\n(metrics/logs/traces)"]
+  EXT --> OBS
+  ARGO --> OBS
+```
+
+
+## Postgres ER Diagram for Canonicalization & Provenance
+This is a practical relational layer that fits your stack: ***Orchestrator ↔ Postgres*** for identity resolution, provenance, and auditability.
+```mermaid
+erDiagram
+  PERSON ||--o{ PERSON_ALIAS : has
+  PERSON ||--o{ BIO_SOURCE : described_by
+  DOCUMENT ||--o{ MENTION : contains
+  PERSON ||--o{ MENTION : referenced_as
+  DOCUMENT ||--o{ HADITH : has
+  HADITH ||--o{ ISNAD_CHAIN : has
+  ISNAD_CHAIN ||--o{ ISNAD_LINK : contains
+  PERSON ||--o{ ISNAD_LINK : narrator
+  EXTRACTION_RUN ||--o{ ISNAD_CHAIN : produced
+  EXTRACTION_RUN ||--o{ MENTION : produced
+  SOURCE ||--o{ DOCUMENT : provides
+
+  PERSON {
+    uuid id PK
+    text canonical_name
+    text kunya
+    text nisba
+    text era
+    text notes
+    timestamptz created_at
+  }
+
+  PERSON_ALIAS {
+    uuid id PK
+    uuid person_id FK
+    text alias_text
+    text alias_type  "kunya|ism|nisba|laqab|spelling"
+    float confidence
+  }
+
+  SOURCE {
+    uuid id PK
+    text name
+    text type "book|website|manuscript"
+    text ref
+  }
+
+  DOCUMENT {
+    uuid id PK
+    uuid source_id FK
+    text doc_type "hadith|bio|other"
+    text lang
+    text title
+    text raw_text
+    timestamptz created_at
+  }
+
+  HADITH {
+    uuid id PK
+    uuid document_id FK
+    text matn_text
+    text collection
+    text hadith_no
+  }
+
+  MENTION {
+    uuid id PK
+    uuid document_id FK
+    uuid person_id FK
+    int start_char
+    int end_char
+    text surface_text
+    text role_hint "narrator|teacher|student|unknown"
+    float confidence
+  }
+
+  EXTRACTION_RUN {
+    uuid id PK
+    uuid document_id FK
+    text method "llm|ner_re|rules"
+    text model_version
+    json params
+    json raw_output
+    timestamptz created_at
+  }
+
+  ISNAD_CHAIN {
+    uuid id PK
+    uuid hadith_id FK
+    uuid run_id FK
+    text chain_text
+    float confidence
+  }
+
+  ISNAD_LINK {
+    uuid id PK
+    uuid chain_id FK
+    int seq_no
+    uuid narrator_person_id FK
+    uuid from_person_id FK
+    uuid to_person_id FK
+    text rel_type "narrated_from|heard_from|teacher_of"
+    float confidence
+  }
+
+  BIO_SOURCE {
+    uuid id PK
+    uuid person_id FK
+    uuid document_id FK
+    text ref
+    float reliability
+  }
+```
+## Neo4j Graph Model Draft (Labels + Relationship Types)
+This is a **graph-first** view of what you’ll store in Neo4j, aligned with your workflow:
+
+- Extract chain → canonicalize in Postgres → write graph edges
+
+- Keep provenance and source references so it’s ***scholar-grade***
+```mermaid
+flowchart LR
+  %% Node labels
+  P1(("Person\n:Person"))
+  P2(("Person\n:Person"))
+  P3(("Person\n:Person"))
+
+  H(("Hadith\n:Hadith"))
+  C(("Chain\n:IsnadChain"))
+  M(("Matn\n:Matn"))
+  S(("Source\n:Source"))
+  D(("Doc\n:Document"))
+
+  %% Core isnad representation
+  H -->|HAS_CHAIN| C
+  H -->|HAS_MATN| M
+
+  C -->|HAS_LINK seq| L1["Link\n:IsnadLink"]
+  C -->|HAS_LINK seq| L2["Link\n:IsnadLink"]
+
+  L1 -->|NARRATOR| P1
+  L1 -->|NARRATED_FROM| P2
+  L2 -->|NARRATOR| P2
+  L2 -->|NARRATED_FROM| P3
+
+  %% Optional direct edges (derived)
+  P1 -->|NARRATED_FROM| P2
+  P2 -->|NARRATED_FROM| P3
+
+  %% Family / biography relations (separate but connected)
+  P1 -->|FATHER_OF| P2
+  P2 -->|STUDENT_OF| P3
+
+  %% Provenance
+  H -->|CITED_IN| D
+  D -->|FROM_SOURCE| S
+  C -->|EXTRACTED_BY| E["Run\n:ExtractionRun"]
+  P1 -->|MENTIONED_IN| D
+```