Add usage documentation and MLOps diagrams for model training and deployment
This commit is contained in:
parent
15b11cb180
commit
6463d4a140
|
|
@ -12,6 +12,7 @@ designed to power an **Islamic Hadith Scholar AI** and future AI/data projects.
|
||||||
- [Observability](OBSERVABILITY.md)
|
- [Observability](OBSERVABILITY.md)
|
||||||
- [Roadmap & Next Steps](ROADMAP.md)
|
- [Roadmap & Next Steps](ROADMAP.md)
|
||||||
- [Future Projects & Use Cases](FUTURE-PROJECTS.md)
|
- [Future Projects & Use Cases](FUTURE-PROJECTS.md)
|
||||||
|
- [USAGE and Graphs](USAGE.md)
|
||||||
|
|
||||||
## 🎯 Current Focus
|
## 🎯 Current Focus
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,243 @@
|
||||||
|
## MLOps Loop Diagram (Label → Train → Registry → Deploy)
|
||||||
|
```mermaid
|
||||||
|
|
||||||
|
flowchart TB
|
||||||
|
LS["Label Studio
|
||||||
|
(label.betelgeusebytes.io)"] -->|export tasks/labels| S3["MinIO S3
|
||||||
|
(minio.betelgeusebytes.io)"]
|
||||||
|
S3 -->|dataset version| ARGO["Argo Workflows
|
||||||
|
(argo.betelgeusebytes.io)"]
|
||||||
|
ARGO -->|train/eval| TR["Training Job
|
||||||
|
(PyTorch/Transformers)"]
|
||||||
|
TR -->|metrics, params| MLF["MLflow
|
||||||
|
(mlflow.betelgeusebytes.io)"]
|
||||||
|
TR -->|model artifacts| S3
|
||||||
|
MLF -->|register model| REG["Model Registry"]
|
||||||
|
ARGO -->|promote model tag| REG
|
||||||
|
REG -->|deploy image / config| ARGOCD["Argo CD
|
||||||
|
(GitOps)"]
|
||||||
|
ARGOCD -->|rollout| SVC["NER/RE Services
|
||||||
|
(custom, later)"]
|
||||||
|
SVC -->|inference| ORCH["Orchestrator API
|
||||||
|
(hadith-api...)"]
|
||||||
|
ORCH -->|observability| OBS["Grafana LGTM
|
||||||
|
(grafana...)"]
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Isnād Extraction Pipeline Diagram (Your actual deployed stack)
|
||||||
|
This shows ***how a hadith text becomes a sanad chain***, how it is stored, and how the ***Neo4j graph*** is built — using your endpoints: LLM (CPU), TEI, Qdrant, Postgres, Neo4j, MinIO, Argo.
|
||||||
|
```mermaid
|
||||||
|
flowchart TB
|
||||||
|
H["Hadith Text Input<br/>(UI/API)"] --> ORCH["Orchestrator API<br/>(hadith-api...);"]
|
||||||
|
ORCH -->|optional: auth| KC["Keycloak<br/>(auth...)"]
|
||||||
|
ORCH -->|normalize/clean| PRE["Preprocess<br/>(arabic cleanup, tokens)"]
|
||||||
|
PRE -->|retrieve examples| TEI["TEI Embeddings<br/>(embeddings...)"]
|
||||||
|
TEI --> QD["Qdrant<br/>(vector...)"]
|
||||||
|
QD -->|top-k similar hadiths + patterns| CTX["Context Pack<br/>(examples, schema)"]
|
||||||
|
|
||||||
|
ORCH -->|prompt+schema+ctx| LLM["LLM CPU<br/>(llm...)"]
|
||||||
|
LLM -->|JSON: chain nodes + links| JSON["Parsed Isnād JSON<br/>(raw extraction)"]
|
||||||
|
ORCH -->|validate + dedupe| RES["Resolve Entities<br/>(name variants, kunya)"]
|
||||||
|
RES --> PG["PostgreSQL<br/>canonical people, aliases"]
|
||||||
|
RES -->|canonical IDs| CAN["Canonical Chain<br/>(person_id sequence)"]
|
||||||
|
|
||||||
|
CAN -->|write nodes/edges| N4["Neo4j<br/>(neo4j...)"]
|
||||||
|
ORCH -->|store provenance| S3["MinIO<br/>(minio...)"]
|
||||||
|
ORCH -->|optional: embed matn| TEI --> QD
|
||||||
|
ORCH -->|return result| OUT["Response<br/>chain + matn + provenance"]
|
||||||
|
|
||||||
|
N4 -->|graph queries| OUT
|
||||||
|
PG -->|metadata| OUT
|
||||||
|
```
|
||||||
|
## Training a Model/Algorithm to Extract Isnād and Build the Neo4j Graph
|
||||||
|
This diagram covers ***end-to-end training + deployment + ingestion***, including:
|
||||||
|
Label Studio → MinIO → Argo Workflows → MLflow → NER/RE service → Orchestrator → Postgres/Neo4j/Qdrant.
|
||||||
|
```mermaid
|
||||||
|
flowchart TB
|
||||||
|
%% Data creation
|
||||||
|
TXT[Raw Hadith Corpora] --> INGEST["Ingest/ETL\n(Argo Workflow)"]
|
||||||
|
INGEST --> S3["MinIO S3\n(versioned datasets)"]
|
||||||
|
|
||||||
|
%% Annotation
|
||||||
|
S3 -->|sampling| LS["Label Studio\n(label...)"]
|
||||||
|
LS -->|"annotated spans\n(narrators, connectors)"| S3
|
||||||
|
|
||||||
|
%% Training
|
||||||
|
S3 --> ARGO["Argo Workflows\n(train pipeline)"]
|
||||||
|
ARGO --> TR["Train NER/RE\n(or rules+CRF)\nCPU-friendly"]
|
||||||
|
TR --> MLF["MLflow\n(metrics + registry)"]
|
||||||
|
TR -->|model artifacts| S3
|
||||||
|
|
||||||
|
%% Deployment of extractor
|
||||||
|
MLF -->|promote| REG[Model Version]
|
||||||
|
REG --> DEPLOY["Deploy extractor svc\n(custom later)"]
|
||||||
|
DEPLOY --> EXT["Isnād Extractor API\n(NER + RE)"]
|
||||||
|
EXT -->|"entities+relations"| ORCH[Orchestrator API]
|
||||||
|
|
||||||
|
%% Graph building + storage
|
||||||
|
ORCH --> RES["Canonicalization\n(alias merge)"]
|
||||||
|
RES --> PG[("PostgreSQL\npeople, aliases, provenance")]
|
||||||
|
ORCH --> N4["Neo4j\n(isnad graph)"]
|
||||||
|
ORCH --> TEI[TEI embeddings] --> QD[Qdrant vectors]
|
||||||
|
ORCH --> S3B["MinIO\nartifacts/provenance"]
|
||||||
|
|
||||||
|
%% Monitoring
|
||||||
|
ORCH --> OBS["Grafana LGTM\n(metrics/logs/traces)"]
|
||||||
|
EXT --> OBS
|
||||||
|
ARGO --> OBS
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Postgres ER Diagram for Canonicalization & Provenance
|
||||||
|
This is a practical relational layer that fits your stack: ***Orchestrator ↔ Postgres*** for identity resolution, provenance, and auditability.
|
||||||
|
```mermaid
|
||||||
|
erDiagram
|
||||||
|
PERSON ||--o{ PERSON_ALIAS : has
|
||||||
|
PERSON ||--o{ BIO_SOURCE : described_by
|
||||||
|
DOCUMENT ||--o{ MENTION : contains
|
||||||
|
PERSON ||--o{ MENTION : referenced_as
|
||||||
|
DOCUMENT ||--o{ HADITH : has
|
||||||
|
HADITH ||--o{ ISNAD_CHAIN : has
|
||||||
|
ISNAD_CHAIN ||--o{ ISNAD_LINK : contains
|
||||||
|
PERSON ||--o{ ISNAD_LINK : narrator
|
||||||
|
EXTRACTION_RUN ||--o{ ISNAD_CHAIN : produced
|
||||||
|
EXTRACTION_RUN ||--o{ MENTION : produced
|
||||||
|
SOURCE ||--o{ DOCUMENT : provides
|
||||||
|
|
||||||
|
PERSON {
|
||||||
|
uuid id PK
|
||||||
|
text canonical_name
|
||||||
|
text kunya
|
||||||
|
text nisba
|
||||||
|
text era
|
||||||
|
text notes
|
||||||
|
timestamptz created_at
|
||||||
|
}
|
||||||
|
|
||||||
|
PERSON_ALIAS {
|
||||||
|
uuid id PK
|
||||||
|
uuid person_id FK
|
||||||
|
text alias_text
|
||||||
|
text alias_type "kunya|ism|nisba|laqab|spelling"
|
||||||
|
float confidence
|
||||||
|
}
|
||||||
|
|
||||||
|
SOURCE {
|
||||||
|
uuid id PK
|
||||||
|
text name
|
||||||
|
text type "book|website|manuscript"
|
||||||
|
text ref
|
||||||
|
}
|
||||||
|
|
||||||
|
DOCUMENT {
|
||||||
|
uuid id PK
|
||||||
|
uuid source_id FK
|
||||||
|
text doc_type "hadith|bio|other"
|
||||||
|
text lang
|
||||||
|
text title
|
||||||
|
text raw_text
|
||||||
|
timestamptz created_at
|
||||||
|
}
|
||||||
|
|
||||||
|
HADITH {
|
||||||
|
uuid id PK
|
||||||
|
uuid document_id FK
|
||||||
|
text matn_text
|
||||||
|
text collection
|
||||||
|
text hadith_no
|
||||||
|
}
|
||||||
|
|
||||||
|
MENTION {
|
||||||
|
uuid id PK
|
||||||
|
uuid document_id FK
|
||||||
|
uuid person_id FK
|
||||||
|
int start_char
|
||||||
|
int end_char
|
||||||
|
text surface_text
|
||||||
|
text role_hint "narrator|teacher|student|unknown"
|
||||||
|
float confidence
|
||||||
|
}
|
||||||
|
|
||||||
|
EXTRACTION_RUN {
|
||||||
|
uuid id PK
|
||||||
|
uuid document_id FK
|
||||||
|
text method "llm|ner_re|rules"
|
||||||
|
text model_version
|
||||||
|
json params
|
||||||
|
json raw_output
|
||||||
|
timestamptz created_at
|
||||||
|
}
|
||||||
|
|
||||||
|
ISNAD_CHAIN {
|
||||||
|
uuid id PK
|
||||||
|
uuid hadith_id FK
|
||||||
|
uuid run_id FK
|
||||||
|
text chain_text
|
||||||
|
float confidence
|
||||||
|
}
|
||||||
|
|
||||||
|
ISNAD_LINK {
|
||||||
|
uuid id PK
|
||||||
|
uuid chain_id FK
|
||||||
|
int seq_no
|
||||||
|
uuid narrator_person_id FK
|
||||||
|
uuid from_person_id FK
|
||||||
|
uuid to_person_id FK
|
||||||
|
text rel_type "narrated_from|heard_from|teacher_of"
|
||||||
|
float confidence
|
||||||
|
}
|
||||||
|
|
||||||
|
BIO_SOURCE {
|
||||||
|
uuid id PK
|
||||||
|
uuid person_id FK
|
||||||
|
uuid document_id FK
|
||||||
|
text ref
|
||||||
|
float reliability
|
||||||
|
}
|
||||||
|
```
|
||||||
|
## Neo4j Graph Model Draft (Labels + Relationship Types)
|
||||||
|
This is a **graph-first** view of what you’ll store in Neo4j, aligned with your workflow:
|
||||||
|
|
||||||
|
- Extract chain → canonicalize in Postgres → write graph edges
|
||||||
|
|
||||||
|
- Keep provenance and source references so it’s ***scholar-grade***
|
||||||
|
```mermaid
|
||||||
|
flowchart LR
|
||||||
|
%% Node labels
|
||||||
|
P1(("Person\n:Person"))
|
||||||
|
P2(("Person\n:Person"))
|
||||||
|
P3(("Person\n:Person"))
|
||||||
|
|
||||||
|
H(("Hadith\n:Hadith"))
|
||||||
|
C(("Chain\n:IsnadChain"))
|
||||||
|
M(("Matn\n:Matn"))
|
||||||
|
S(("Source\n:Source"))
|
||||||
|
D(("Doc\n:Document"))
|
||||||
|
|
||||||
|
%% Core isnad representation
|
||||||
|
H -->|HAS_CHAIN| C
|
||||||
|
H -->|HAS_MATN| M
|
||||||
|
|
||||||
|
C -->|HAS_LINK seq| L1["Link\n:IsnadLink"]
|
||||||
|
C -->|HAS_LINK seq| L2["Link\n:IsnadLink"]
|
||||||
|
|
||||||
|
L1 -->|NARRATOR| P1
|
||||||
|
L1 -->|NARRATED_FROM| P2
|
||||||
|
L2 -->|NARRATOR| P2
|
||||||
|
L2 -->|NARRATED_FROM| P3
|
||||||
|
|
||||||
|
%% Optional direct edges (derived)
|
||||||
|
P1 -->|NARRATED_FROM| P2
|
||||||
|
P2 -->|NARRATED_FROM| P3
|
||||||
|
|
||||||
|
%% Family / biography relations (separate but connected)
|
||||||
|
P1 -->|FATHER_OF| P2
|
||||||
|
P2 -->|STUDENT_OF| P3
|
||||||
|
|
||||||
|
%% Provenance
|
||||||
|
H -->|CITED_IN| D
|
||||||
|
D -->|FROM_SOURCE| S
|
||||||
|
C -->|EXTRACTED_BY| E["Run\n:ExtractionRun"]
|
||||||
|
P1 -->|MENTIONED_IN| D
|
||||||
|
```
|
||||||
Loading…
Reference in New Issue