betelgeusebytes/USAGE.md at 6463d4a1406f36d0b8261fd83553aff08252ac8b

6.6 KiB

Raw Blame History

MLOps Loop Diagram (Label → Train → Registry → Deploy)


flowchart TB
  LS["Label Studio
(label.betelgeusebytes.io)"] -->|export tasks/labels| S3["MinIO S3
(minio.betelgeusebytes.io)"]
  S3 -->|dataset version| ARGO["Argo Workflows
(argo.betelgeusebytes.io)"]
  ARGO -->|train/eval| TR["Training Job
(PyTorch/Transformers)"]
  TR -->|metrics, params| MLF["MLflow
(mlflow.betelgeusebytes.io)"]
  TR -->|model artifacts| S3
  MLF -->|register model| REG["Model Registry"]
  ARGO -->|promote model tag| REG
  REG -->|deploy image / config| ARGOCD["Argo CD
(GitOps)"]
  ARGOCD -->|rollout| SVC["NER/RE Services
(custom, later)"]
  SVC -->|inference| ORCH["Orchestrator API
(hadith-api...)"] 
  ORCH -->|observability| OBS["Grafana LGTM
(grafana...)"]

Isnād Extraction Pipeline Diagram (Your actual deployed stack)

This shows how a hadith text becomes a sanad chain, how it is stored, and how the Neo4j graph is built — using your endpoints: LLM (CPU), TEI, Qdrant, Postgres, Neo4j, MinIO, Argo.

flowchart TB
  H["Hadith Text Input<br/>(UI/API)"] --> ORCH["Orchestrator API<br/>(hadith-api...);"]
  ORCH -->|optional: auth| KC["Keycloak<br/>(auth...)"]
  ORCH -->|normalize/clean| PRE["Preprocess<br/>(arabic cleanup, tokens)"]
  PRE -->|retrieve examples| TEI["TEI Embeddings<br/>(embeddings...)"]
  TEI --> QD["Qdrant<br/>(vector...)"]
  QD -->|top-k similar hadiths + patterns| CTX["Context Pack<br/>(examples, schema)"]

  ORCH -->|prompt+schema+ctx| LLM["LLM CPU<br/>(llm...)"]
  LLM -->|JSON: chain nodes + links| JSON["Parsed Isnād JSON<br/>(raw extraction)"]
  ORCH -->|validate + dedupe| RES["Resolve Entities<br/>(name variants, kunya)"]
  RES --> PG["PostgreSQL<br/>canonical people, aliases"]
  RES -->|canonical IDs| CAN["Canonical Chain<br/>(person_id sequence)"]

  CAN -->|write nodes/edges| N4["Neo4j<br/>(neo4j...)"]
  ORCH -->|store provenance| S3["MinIO<br/>(minio...)"]
  ORCH -->|optional: embed matn| TEI --> QD
  ORCH -->|return result| OUT["Response<br/>chain + matn + provenance"]

  N4 -->|graph queries| OUT
  PG -->|metadata| OUT

Training a Model/Algorithm to Extract Isnād and Build the Neo4j Graph

This diagram covers end-to-end training + deployment + ingestion, including: Label Studio → MinIO → Argo Workflows → MLflow → NER/RE service → Orchestrator → Postgres/Neo4j/Qdrant.

flowchart TB
  %% Data creation
  TXT[Raw Hadith Corpora] --> INGEST["Ingest/ETL\n(Argo Workflow)"]
  INGEST --> S3["MinIO S3\n(versioned datasets)"]

  %% Annotation
  S3 -->|sampling| LS["Label Studio\n(label...)"]
  LS -->|"annotated spans\n(narrators, connectors)"| S3

  %% Training
  S3 --> ARGO["Argo Workflows\n(train pipeline)"]
  ARGO --> TR["Train NER/RE\n(or rules+CRF)\nCPU-friendly"]
  TR --> MLF["MLflow\n(metrics + registry)"]
  TR -->|model artifacts| S3

  %% Deployment of extractor
  MLF -->|promote| REG[Model Version]
  REG --> DEPLOY["Deploy extractor svc\n(custom later)"]
  DEPLOY --> EXT["Isnād Extractor API\n(NER + RE)"]
  EXT -->|"entities+relations"| ORCH[Orchestrator API]

  %% Graph building + storage
  ORCH --> RES["Canonicalization\n(alias merge)"]
  RES --> PG[("PostgreSQL\npeople, aliases, provenance")]
  ORCH --> N4["Neo4j\n(isnad graph)"]
  ORCH --> TEI[TEI embeddings] --> QD[Qdrant vectors]
  ORCH --> S3B["MinIO\nartifacts/provenance"]

  %% Monitoring
  ORCH --> OBS["Grafana LGTM\n(metrics/logs/traces)"]
  EXT --> OBS
  ARGO --> OBS

Postgres ER Diagram for Canonicalization & Provenance

This is a practical relational layer that fits your stack: Orchestrator ↔ Postgres for identity resolution, provenance, and auditability.

erDiagram
  PERSON ||--o{ PERSON_ALIAS : has
  PERSON ||--o{ BIO_SOURCE : described_by
  DOCUMENT ||--o{ MENTION : contains
  PERSON ||--o{ MENTION : referenced_as
  DOCUMENT ||--o{ HADITH : has
  HADITH ||--o{ ISNAD_CHAIN : has
  ISNAD_CHAIN ||--o{ ISNAD_LINK : contains
  PERSON ||--o{ ISNAD_LINK : narrator
  EXTRACTION_RUN ||--o{ ISNAD_CHAIN : produced
  EXTRACTION_RUN ||--o{ MENTION : produced
  SOURCE ||--o{ DOCUMENT : provides

  PERSON {
    uuid id PK
    text canonical_name
    text kunya
    text nisba
    text era
    text notes
    timestamptz created_at
  }

  PERSON_ALIAS {
    uuid id PK
    uuid person_id FK
    text alias_text
    text alias_type  "kunya|ism|nisba|laqab|spelling"
    float confidence
  }

  SOURCE {
    uuid id PK
    text name
    text type "book|website|manuscript"
    text ref
  }

  DOCUMENT {
    uuid id PK
    uuid source_id FK
    text doc_type "hadith|bio|other"
    text lang
    text title
    text raw_text
    timestamptz created_at
  }

  HADITH {
    uuid id PK
    uuid document_id FK
    text matn_text
    text collection
    text hadith_no
  }

  MENTION {
    uuid id PK
    uuid document_id FK
    uuid person_id FK
    int start_char
    int end_char
    text surface_text
    text role_hint "narrator|teacher|student|unknown"
    float confidence
  }

  EXTRACTION_RUN {
    uuid id PK
    uuid document_id FK
    text method "llm|ner_re|rules"
    text model_version
    json params
    json raw_output
    timestamptz created_at
  }

  ISNAD_CHAIN {
    uuid id PK
    uuid hadith_id FK
    uuid run_id FK
    text chain_text
    float confidence
  }

  ISNAD_LINK {
    uuid id PK
    uuid chain_id FK
    int seq_no
    uuid narrator_person_id FK
    uuid from_person_id FK
    uuid to_person_id FK
    text rel_type "narrated_from|heard_from|teacher_of"
    float confidence
  }

  BIO_SOURCE {
    uuid id PK
    uuid person_id FK
    uuid document_id FK
    text ref
    float reliability
  }

Neo4j Graph Model Draft (Labels + Relationship Types)

This is a graph-first view of what you’ll store in Neo4j, aligned with your workflow:

Extract chain → canonicalize in Postgres → write graph edges
Keep provenance and source references so it’s scholar-grade

flowchart LR
  %% Node labels
  P1(("Person\n:Person"))
  P2(("Person\n:Person"))
  P3(("Person\n:Person"))

  H(("Hadith\n:Hadith"))
  C(("Chain\n:IsnadChain"))
  M(("Matn\n:Matn"))
  S(("Source\n:Source"))
  D(("Doc\n:Document"))

  %% Core isnad representation
  H -->|HAS_CHAIN| C
  H -->|HAS_MATN| M

  C -->|HAS_LINK seq| L1["Link\n:IsnadLink"]
  C -->|HAS_LINK seq| L2["Link\n:IsnadLink"]

  L1 -->|NARRATOR| P1
  L1 -->|NARRATED_FROM| P2
  L2 -->|NARRATOR| P2
  L2 -->|NARRATED_FROM| P3

  %% Optional direct edges (derived)
  P1 -->|NARRATED_FROM| P2
  P2 -->|NARRATED_FROM| P3

  %% Family / biography relations (separate but connected)
  P1 -->|FATHER_OF| P2
  P2 -->|STUDENT_OF| P3

  %% Provenance
  H -->|CITED_IN| D
  D -->|FROM_SOURCE| S
  C -->|EXTRACTED_BY| E["Run\n:ExtractionRun"]
  P1 -->|MENTIONED_IN| D