betelgeusebytes/USAGE.md

243 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## MLOps Loop Diagram (Label → Train → Registry → Deploy)
```mermaid
flowchart TB
LS["Label Studio
(label.betelgeusebytes.io)"] -->|export tasks/labels| S3["MinIO S3
(minio.betelgeusebytes.io)"]
S3 -->|dataset version| ARGO["Argo Workflows
(argo.betelgeusebytes.io)"]
ARGO -->|train/eval| TR["Training Job
(PyTorch/Transformers)"]
TR -->|metrics, params| MLF["MLflow
(mlflow.betelgeusebytes.io)"]
TR -->|model artifacts| S3
MLF -->|register model| REG["Model Registry"]
ARGO -->|promote model tag| REG
REG -->|deploy image / config| ARGOCD["Argo CD
(GitOps)"]
ARGOCD -->|rollout| SVC["NER/RE Services
(custom, later)"]
SVC -->|inference| ORCH["Orchestrator API
(hadith-api...)"]
ORCH -->|observability| OBS["Grafana LGTM
(grafana...)"]
```
## Isnād Extraction Pipeline Diagram (Your actual deployed stack)
This shows ***how a hadith text becomes a sanad chain***, how it is stored, and how the ***Neo4j graph*** is built — using your endpoints: LLM (CPU), TEI, Qdrant, Postgres, Neo4j, MinIO, Argo.
```mermaid
flowchart TB
H["Hadith Text Input<br/>(UI/API)"] --> ORCH["Orchestrator API<br/>(hadith-api...);"]
ORCH -->|optional: auth| KC["Keycloak<br/>(auth...)"]
ORCH -->|normalize/clean| PRE["Preprocess<br/>(arabic cleanup, tokens)"]
PRE -->|retrieve examples| TEI["TEI Embeddings<br/>(embeddings...)"]
TEI --> QD["Qdrant<br/>(vector...)"]
QD -->|top-k similar hadiths + patterns| CTX["Context Pack<br/>(examples, schema)"]
ORCH -->|prompt+schema+ctx| LLM["LLM CPU<br/>(llm...)"]
LLM -->|JSON: chain nodes + links| JSON["Parsed Isnād JSON<br/>(raw extraction)"]
ORCH -->|validate + dedupe| RES["Resolve Entities<br/>(name variants, kunya)"]
RES --> PG["PostgreSQL<br/>canonical people, aliases"]
RES -->|canonical IDs| CAN["Canonical Chain<br/>(person_id sequence)"]
CAN -->|write nodes/edges| N4["Neo4j<br/>(neo4j...)"]
ORCH -->|store provenance| S3["MinIO<br/>(minio...)"]
ORCH -->|optional: embed matn| TEI --> QD
ORCH -->|return result| OUT["Response<br/>chain + matn + provenance"]
N4 -->|graph queries| OUT
PG -->|metadata| OUT
```
## Training a Model/Algorithm to Extract Isnād and Build the Neo4j Graph
This diagram covers ***end-to-end training + deployment + ingestion***, including:
Label Studio → MinIO → Argo Workflows → MLflow → NER/RE service → Orchestrator → Postgres/Neo4j/Qdrant.
```mermaid
flowchart TB
%% Data creation
TXT[Raw Hadith Corpora] --> INGEST["Ingest/ETL\n(Argo Workflow)"]
INGEST --> S3["MinIO S3\n(versioned datasets)"]
%% Annotation
S3 -->|sampling| LS["Label Studio\n(label...)"]
LS -->|"annotated spans\n(narrators, connectors)"| S3
%% Training
S3 --> ARGO["Argo Workflows\n(train pipeline)"]
ARGO --> TR["Train NER/RE\n(or rules+CRF)\nCPU-friendly"]
TR --> MLF["MLflow\n(metrics + registry)"]
TR -->|model artifacts| S3
%% Deployment of extractor
MLF -->|promote| REG[Model Version]
REG --> DEPLOY["Deploy extractor svc\n(custom later)"]
DEPLOY --> EXT["Isnād Extractor API\n(NER + RE)"]
EXT -->|"entities+relations"| ORCH[Orchestrator API]
%% Graph building + storage
ORCH --> RES["Canonicalization\n(alias merge)"]
RES --> PG[("PostgreSQL\npeople, aliases, provenance")]
ORCH --> N4["Neo4j\n(isnad graph)"]
ORCH --> TEI[TEI embeddings] --> QD[Qdrant vectors]
ORCH --> S3B["MinIO\nartifacts/provenance"]
%% Monitoring
ORCH --> OBS["Grafana LGTM\n(metrics/logs/traces)"]
EXT --> OBS
ARGO --> OBS
```
## Postgres ER Diagram for Canonicalization & Provenance
This is a practical relational layer that fits your stack: ***Orchestrator ↔ Postgres*** for identity resolution, provenance, and auditability.
```mermaid
erDiagram
PERSON ||--o{ PERSON_ALIAS : has
PERSON ||--o{ BIO_SOURCE : described_by
DOCUMENT ||--o{ MENTION : contains
PERSON ||--o{ MENTION : referenced_as
DOCUMENT ||--o{ HADITH : has
HADITH ||--o{ ISNAD_CHAIN : has
ISNAD_CHAIN ||--o{ ISNAD_LINK : contains
PERSON ||--o{ ISNAD_LINK : narrator
EXTRACTION_RUN ||--o{ ISNAD_CHAIN : produced
EXTRACTION_RUN ||--o{ MENTION : produced
SOURCE ||--o{ DOCUMENT : provides
PERSON {
uuid id PK
text canonical_name
text kunya
text nisba
text era
text notes
timestamptz created_at
}
PERSON_ALIAS {
uuid id PK
uuid person_id FK
text alias_text
text alias_type "kunya|ism|nisba|laqab|spelling"
float confidence
}
SOURCE {
uuid id PK
text name
text type "book|website|manuscript"
text ref
}
DOCUMENT {
uuid id PK
uuid source_id FK
text doc_type "hadith|bio|other"
text lang
text title
text raw_text
timestamptz created_at
}
HADITH {
uuid id PK
uuid document_id FK
text matn_text
text collection
text hadith_no
}
MENTION {
uuid id PK
uuid document_id FK
uuid person_id FK
int start_char
int end_char
text surface_text
text role_hint "narrator|teacher|student|unknown"
float confidence
}
EXTRACTION_RUN {
uuid id PK
uuid document_id FK
text method "llm|ner_re|rules"
text model_version
json params
json raw_output
timestamptz created_at
}
ISNAD_CHAIN {
uuid id PK
uuid hadith_id FK
uuid run_id FK
text chain_text
float confidence
}
ISNAD_LINK {
uuid id PK
uuid chain_id FK
int seq_no
uuid narrator_person_id FK
uuid from_person_id FK
uuid to_person_id FK
text rel_type "narrated_from|heard_from|teacher_of"
float confidence
}
BIO_SOURCE {
uuid id PK
uuid person_id FK
uuid document_id FK
text ref
float reliability
}
```
## Neo4j Graph Model Draft (Labels + Relationship Types)
This is a **graph-first** view of what youll store in Neo4j, aligned with your workflow:
- Extract chain → canonicalize in Postgres → write graph edges
- Keep provenance and source references so its ***scholar-grade***
```mermaid
flowchart LR
%% Node labels
P1(("Person\n:Person"))
P2(("Person\n:Person"))
P3(("Person\n:Person"))
H(("Hadith\n:Hadith"))
C(("Chain\n:IsnadChain"))
M(("Matn\n:Matn"))
S(("Source\n:Source"))
D(("Doc\n:Document"))
%% Core isnad representation
H -->|HAS_CHAIN| C
H -->|HAS_MATN| M
C -->|HAS_LINK seq| L1["Link\n:IsnadLink"]
C -->|HAS_LINK seq| L2["Link\n:IsnadLink"]
L1 -->|NARRATOR| P1
L1 -->|NARRATED_FROM| P2
L2 -->|NARRATOR| P2
L2 -->|NARRATED_FROM| P3
%% Optional direct edges (derived)
P1 -->|NARRATED_FROM| P2
P2 -->|NARRATED_FROM| P3
%% Family / biography relations (separate but connected)
P1 -->|FATHER_OF| P2
P2 -->|STUDENT_OF| P3
%% Provenance
H -->|CITED_IN| D
D -->|FROM_SOURCE| S
C -->|EXTRACTED_BY| E["Run\n:ExtractionRun"]
P1 -->|MENTIONED_IN| D
```