## MLOps Loop Diagram (Label → Train → Registry → Deploy) ```mermaid flowchart TB LS["Label Studio (label.betelgeusebytes.io)"] -->|export tasks/labels| S3["MinIO S3 (minio.betelgeusebytes.io)"] S3 -->|dataset version| ARGO["Argo Workflows (argo.betelgeusebytes.io)"] ARGO -->|train/eval| TR["Training Job (PyTorch/Transformers)"] TR -->|metrics, params| MLF["MLflow (mlflow.betelgeusebytes.io)"] TR -->|model artifacts| S3 MLF -->|register model| REG["Model Registry"] ARGO -->|promote model tag| REG REG -->|deploy image / config| ARGOCD["Argo CD (GitOps)"] ARGOCD -->|rollout| SVC["NER/RE Services (custom, later)"] SVC -->|inference| ORCH["Orchestrator API (hadith-api...)"] ORCH -->|observability| OBS["Grafana LGTM (grafana...)"] ``` ## Isnād Extraction Pipeline Diagram (Your actual deployed stack) This shows ***how a hadith text becomes a sanad chain***, how it is stored, and how the ***Neo4j graph*** is built — using your endpoints: LLM (CPU), TEI, Qdrant, Postgres, Neo4j, MinIO, Argo. ```mermaid flowchart TB H["Hadith Text Input
(UI/API)"] --> ORCH["Orchestrator API
(hadith-api...);"] ORCH -->|optional: auth| KC["Keycloak
(auth...)"] ORCH -->|normalize/clean| PRE["Preprocess
(arabic cleanup, tokens)"] PRE -->|retrieve examples| TEI["TEI Embeddings
(embeddings...)"] TEI --> QD["Qdrant
(vector...)"] QD -->|top-k similar hadiths + patterns| CTX["Context Pack
(examples, schema)"] ORCH -->|prompt+schema+ctx| LLM["LLM CPU
(llm...)"] LLM -->|JSON: chain nodes + links| JSON["Parsed Isnād JSON
(raw extraction)"] ORCH -->|validate + dedupe| RES["Resolve Entities
(name variants, kunya)"] RES --> PG["PostgreSQL
canonical people, aliases"] RES -->|canonical IDs| CAN["Canonical Chain
(person_id sequence)"] CAN -->|write nodes/edges| N4["Neo4j
(neo4j...)"] ORCH -->|store provenance| S3["MinIO
(minio...)"] ORCH -->|optional: embed matn| TEI --> QD ORCH -->|return result| OUT["Response
chain + matn + provenance"] N4 -->|graph queries| OUT PG -->|metadata| OUT ``` ## Training a Model/Algorithm to Extract Isnād and Build the Neo4j Graph This diagram covers ***end-to-end training + deployment + ingestion***, including: Label Studio → MinIO → Argo Workflows → MLflow → NER/RE service → Orchestrator → Postgres/Neo4j/Qdrant. ```mermaid flowchart TB %% Data creation TXT[Raw Hadith Corpora] --> INGEST["Ingest/ETL\n(Argo Workflow)"] INGEST --> S3["MinIO S3\n(versioned datasets)"] %% Annotation S3 -->|sampling| LS["Label Studio\n(label...)"] LS -->|"annotated spans\n(narrators, connectors)"| S3 %% Training S3 --> ARGO["Argo Workflows\n(train pipeline)"] ARGO --> TR["Train NER/RE\n(or rules+CRF)\nCPU-friendly"] TR --> MLF["MLflow\n(metrics + registry)"] TR -->|model artifacts| S3 %% Deployment of extractor MLF -->|promote| REG[Model Version] REG --> DEPLOY["Deploy extractor svc\n(custom later)"] DEPLOY --> EXT["Isnād Extractor API\n(NER + RE)"] EXT -->|"entities+relations"| ORCH[Orchestrator API] %% Graph building + storage ORCH --> RES["Canonicalization\n(alias merge)"] RES --> PG[("PostgreSQL\npeople, aliases, provenance")] ORCH --> N4["Neo4j\n(isnad graph)"] ORCH --> TEI[TEI embeddings] --> QD[Qdrant vectors] ORCH --> S3B["MinIO\nartifacts/provenance"] %% Monitoring ORCH --> OBS["Grafana LGTM\n(metrics/logs/traces)"] EXT --> OBS ARGO --> OBS ``` ## Postgres ER Diagram for Canonicalization & Provenance This is a practical relational layer that fits your stack: ***Orchestrator ↔ Postgres*** for identity resolution, provenance, and auditability. ```mermaid erDiagram PERSON ||--o{ PERSON_ALIAS : has PERSON ||--o{ BIO_SOURCE : described_by DOCUMENT ||--o{ MENTION : contains PERSON ||--o{ MENTION : referenced_as DOCUMENT ||--o{ HADITH : has HADITH ||--o{ ISNAD_CHAIN : has ISNAD_CHAIN ||--o{ ISNAD_LINK : contains PERSON ||--o{ ISNAD_LINK : narrator EXTRACTION_RUN ||--o{ ISNAD_CHAIN : produced EXTRACTION_RUN ||--o{ MENTION : produced SOURCE ||--o{ DOCUMENT : provides PERSON { uuid id PK text canonical_name text kunya text nisba text era text notes timestamptz created_at } PERSON_ALIAS { uuid id PK uuid person_id FK text alias_text text alias_type "kunya|ism|nisba|laqab|spelling" float confidence } SOURCE { uuid id PK text name text type "book|website|manuscript" text ref } DOCUMENT { uuid id PK uuid source_id FK text doc_type "hadith|bio|other" text lang text title text raw_text timestamptz created_at } HADITH { uuid id PK uuid document_id FK text matn_text text collection text hadith_no } MENTION { uuid id PK uuid document_id FK uuid person_id FK int start_char int end_char text surface_text text role_hint "narrator|teacher|student|unknown" float confidence } EXTRACTION_RUN { uuid id PK uuid document_id FK text method "llm|ner_re|rules" text model_version json params json raw_output timestamptz created_at } ISNAD_CHAIN { uuid id PK uuid hadith_id FK uuid run_id FK text chain_text float confidence } ISNAD_LINK { uuid id PK uuid chain_id FK int seq_no uuid narrator_person_id FK uuid from_person_id FK uuid to_person_id FK text rel_type "narrated_from|heard_from|teacher_of" float confidence } BIO_SOURCE { uuid id PK uuid person_id FK uuid document_id FK text ref float reliability } ``` ## Neo4j Graph Model Draft (Labels + Relationship Types) This is a **graph-first** view of what you’ll store in Neo4j, aligned with your workflow: - Extract chain → canonicalize in Postgres → write graph edges - Keep provenance and source references so it’s ***scholar-grade*** ```mermaid flowchart LR %% Node labels P1(("Person\n:Person")) P2(("Person\n:Person")) P3(("Person\n:Person")) H(("Hadith\n:Hadith")) C(("Chain\n:IsnadChain")) M(("Matn\n:Matn")) S(("Source\n:Source")) D(("Doc\n:Document")) %% Core isnad representation H -->|HAS_CHAIN| C H -->|HAS_MATN| M C -->|HAS_LINK seq| L1["Link\n:IsnadLink"] C -->|HAS_LINK seq| L2["Link\n:IsnadLink"] L1 -->|NARRATOR| P1 L1 -->|NARRATED_FROM| P2 L2 -->|NARRATOR| P2 L2 -->|NARRATED_FROM| P3 %% Optional direct edges (derived) P1 -->|NARRATED_FROM| P2 P2 -->|NARRATED_FROM| P3 %% Family / biography relations (separate but connected) P1 -->|FATHER_OF| P2 P2 -->|STUDENT_OF| P3 %% Provenance H -->|CITED_IN| D D -->|FROM_SOURCE| S C -->|EXTRACTED_BY| E["Run\n:ExtractionRun"] P1 -->|MENTIONED_IN| D ```