From 6463d4a1406f36d0b8261fd83553aff08252ac8b Mon Sep 17 00:00:00 2001 From: salah Date: Wed, 28 Jan 2026 11:53:39 +0100 Subject: [PATCH] Add usage documentation and MLOps diagrams for model training and deployment --- README.md | 1 + USAGE.md | 243 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 244 insertions(+) create mode 100644 USAGE.md diff --git a/README.md b/README.md index c916850..2e0efbb 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,7 @@ designed to power an **Islamic Hadith Scholar AI** and future AI/data projects. - [Observability](OBSERVABILITY.md) - [Roadmap & Next Steps](ROADMAP.md) - [Future Projects & Use Cases](FUTURE-PROJECTS.md) +- [USAGE and Graphs](USAGE.md) ## 🎯 Current Focus diff --git a/USAGE.md b/USAGE.md new file mode 100644 index 0000000..186926e --- /dev/null +++ b/USAGE.md @@ -0,0 +1,243 @@ +## MLOps Loop Diagram (Label → Train → Registry → Deploy) +```mermaid + +flowchart TB + LS["Label Studio +(label.betelgeusebytes.io)"] -->|export tasks/labels| S3["MinIO S3 +(minio.betelgeusebytes.io)"] + S3 -->|dataset version| ARGO["Argo Workflows +(argo.betelgeusebytes.io)"] + ARGO -->|train/eval| TR["Training Job +(PyTorch/Transformers)"] + TR -->|metrics, params| MLF["MLflow +(mlflow.betelgeusebytes.io)"] + TR -->|model artifacts| S3 + MLF -->|register model| REG["Model Registry"] + ARGO -->|promote model tag| REG + REG -->|deploy image / config| ARGOCD["Argo CD +(GitOps)"] + ARGOCD -->|rollout| SVC["NER/RE Services +(custom, later)"] + SVC -->|inference| ORCH["Orchestrator API +(hadith-api...)"] + ORCH -->|observability| OBS["Grafana LGTM +(grafana...)"] +``` + + +## Isnād Extraction Pipeline Diagram (Your actual deployed stack) +This shows ***how a hadith text becomes a sanad chain***, how it is stored, and how the ***Neo4j graph*** is built — using your endpoints: LLM (CPU), TEI, Qdrant, Postgres, Neo4j, MinIO, Argo. +```mermaid +flowchart TB + H["Hadith Text Input
(UI/API)"] --> ORCH["Orchestrator API
(hadith-api...);"] + ORCH -->|optional: auth| KC["Keycloak
(auth...)"] + ORCH -->|normalize/clean| PRE["Preprocess
(arabic cleanup, tokens)"] + PRE -->|retrieve examples| TEI["TEI Embeddings
(embeddings...)"] + TEI --> QD["Qdrant
(vector...)"] + QD -->|top-k similar hadiths + patterns| CTX["Context Pack
(examples, schema)"] + + ORCH -->|prompt+schema+ctx| LLM["LLM CPU
(llm...)"] + LLM -->|JSON: chain nodes + links| JSON["Parsed Isnād JSON
(raw extraction)"] + ORCH -->|validate + dedupe| RES["Resolve Entities
(name variants, kunya)"] + RES --> PG["PostgreSQL
canonical people, aliases"] + RES -->|canonical IDs| CAN["Canonical Chain
(person_id sequence)"] + + CAN -->|write nodes/edges| N4["Neo4j
(neo4j...)"] + ORCH -->|store provenance| S3["MinIO
(minio...)"] + ORCH -->|optional: embed matn| TEI --> QD + ORCH -->|return result| OUT["Response
chain + matn + provenance"] + + N4 -->|graph queries| OUT + PG -->|metadata| OUT + ``` +## Training a Model/Algorithm to Extract Isnād and Build the Neo4j Graph +This diagram covers ***end-to-end training + deployment + ingestion***, including: +Label Studio → MinIO → Argo Workflows → MLflow → NER/RE service → Orchestrator → Postgres/Neo4j/Qdrant. +```mermaid +flowchart TB + %% Data creation + TXT[Raw Hadith Corpora] --> INGEST["Ingest/ETL\n(Argo Workflow)"] + INGEST --> S3["MinIO S3\n(versioned datasets)"] + + %% Annotation + S3 -->|sampling| LS["Label Studio\n(label...)"] + LS -->|"annotated spans\n(narrators, connectors)"| S3 + + %% Training + S3 --> ARGO["Argo Workflows\n(train pipeline)"] + ARGO --> TR["Train NER/RE\n(or rules+CRF)\nCPU-friendly"] + TR --> MLF["MLflow\n(metrics + registry)"] + TR -->|model artifacts| S3 + + %% Deployment of extractor + MLF -->|promote| REG[Model Version] + REG --> DEPLOY["Deploy extractor svc\n(custom later)"] + DEPLOY --> EXT["Isnād Extractor API\n(NER + RE)"] + EXT -->|"entities+relations"| ORCH[Orchestrator API] + + %% Graph building + storage + ORCH --> RES["Canonicalization\n(alias merge)"] + RES --> PG[("PostgreSQL\npeople, aliases, provenance")] + ORCH --> N4["Neo4j\n(isnad graph)"] + ORCH --> TEI[TEI embeddings] --> QD[Qdrant vectors] + ORCH --> S3B["MinIO\nartifacts/provenance"] + + %% Monitoring + ORCH --> OBS["Grafana LGTM\n(metrics/logs/traces)"] + EXT --> OBS + ARGO --> OBS +``` + + +## Postgres ER Diagram for Canonicalization & Provenance +This is a practical relational layer that fits your stack: ***Orchestrator ↔ Postgres*** for identity resolution, provenance, and auditability. +```mermaid +erDiagram + PERSON ||--o{ PERSON_ALIAS : has + PERSON ||--o{ BIO_SOURCE : described_by + DOCUMENT ||--o{ MENTION : contains + PERSON ||--o{ MENTION : referenced_as + DOCUMENT ||--o{ HADITH : has + HADITH ||--o{ ISNAD_CHAIN : has + ISNAD_CHAIN ||--o{ ISNAD_LINK : contains + PERSON ||--o{ ISNAD_LINK : narrator + EXTRACTION_RUN ||--o{ ISNAD_CHAIN : produced + EXTRACTION_RUN ||--o{ MENTION : produced + SOURCE ||--o{ DOCUMENT : provides + + PERSON { + uuid id PK + text canonical_name + text kunya + text nisba + text era + text notes + timestamptz created_at + } + + PERSON_ALIAS { + uuid id PK + uuid person_id FK + text alias_text + text alias_type "kunya|ism|nisba|laqab|spelling" + float confidence + } + + SOURCE { + uuid id PK + text name + text type "book|website|manuscript" + text ref + } + + DOCUMENT { + uuid id PK + uuid source_id FK + text doc_type "hadith|bio|other" + text lang + text title + text raw_text + timestamptz created_at + } + + HADITH { + uuid id PK + uuid document_id FK + text matn_text + text collection + text hadith_no + } + + MENTION { + uuid id PK + uuid document_id FK + uuid person_id FK + int start_char + int end_char + text surface_text + text role_hint "narrator|teacher|student|unknown" + float confidence + } + + EXTRACTION_RUN { + uuid id PK + uuid document_id FK + text method "llm|ner_re|rules" + text model_version + json params + json raw_output + timestamptz created_at + } + + ISNAD_CHAIN { + uuid id PK + uuid hadith_id FK + uuid run_id FK + text chain_text + float confidence + } + + ISNAD_LINK { + uuid id PK + uuid chain_id FK + int seq_no + uuid narrator_person_id FK + uuid from_person_id FK + uuid to_person_id FK + text rel_type "narrated_from|heard_from|teacher_of" + float confidence + } + + BIO_SOURCE { + uuid id PK + uuid person_id FK + uuid document_id FK + text ref + float reliability + } +``` +## Neo4j Graph Model Draft (Labels + Relationship Types) +This is a **graph-first** view of what you’ll store in Neo4j, aligned with your workflow: + +- Extract chain → canonicalize in Postgres → write graph edges + +- Keep provenance and source references so it’s ***scholar-grade*** +```mermaid +flowchart LR + %% Node labels + P1(("Person\n:Person")) + P2(("Person\n:Person")) + P3(("Person\n:Person")) + + H(("Hadith\n:Hadith")) + C(("Chain\n:IsnadChain")) + M(("Matn\n:Matn")) + S(("Source\n:Source")) + D(("Doc\n:Document")) + + %% Core isnad representation + H -->|HAS_CHAIN| C + H -->|HAS_MATN| M + + C -->|HAS_LINK seq| L1["Link\n:IsnadLink"] + C -->|HAS_LINK seq| L2["Link\n:IsnadLink"] + + L1 -->|NARRATOR| P1 + L1 -->|NARRATED_FROM| P2 + L2 -->|NARRATOR| P2 + L2 -->|NARRATED_FROM| P3 + + %% Optional direct edges (derived) + P1 -->|NARRATED_FROM| P2 + P2 -->|NARRATED_FROM| P3 + + %% Family / biography relations (separate but connected) + P1 -->|FATHER_OF| P2 + P2 -->|STUDENT_OF| P3 + + %% Provenance + H -->|CITED_IN| D + D -->|FROM_SOURCE| S + C -->|EXTRACTED_BY| E["Run\n:ExtractionRun"] + P1 -->|MENTIONED_IN| D +``` \ No newline at end of file