init README

2025-11-14 10:17:54 +01:00 · 2025-11-14 10:17:54 +01:00 · 942119559a
parent b059fcab6e
commit 942119559a
1 changed files with 275 additions and 0 deletions
--- a/hadith-ingestion/README.md
+++ b/hadith-ingestion/README.md
@ -0,0 +1,275 @@
+# 🚀 HadithAPI.com Deployment - Quick Start
+
+## What You Got
+
+Three comprehensive guides:
+1. **PHASE_2_IMPLEMENTATION_GUIDE.md** - Original guide with PostgreSQL schema
+2. **HADITHAPI_INTEGRATION_GUIDE.md** - Complete HadithAPI.com implementation
+3. **This summary** - Quick deployment steps
+
+## 📦 Complete Package Structure
+
+The HadithAPI guide includes everything you need:
+
+### Production-Ready Code
+✅ **hadithapi_client.py** - Full API client with pagination and rate limiting  
+✅ **main_hadithapi.py** - Complete ingestion service  
+✅ **settings.py** - Configuration with your API key  
+✅ **Dockerfile** - Container image  
+✅ **Argo Workflows** - Kubernetes automation  
+✅ **Test scripts** - Validation and troubleshooting  
+
+### Key Features
+- ✅ Automatic pagination handling
+- ✅ Rate limiting (30 req/min)
+- ✅ Error handling and retries
+- ✅ Progress tracking
+- ✅ Structured logging
+- ✅ Multi-language support (Arabic, English, Urdu)
+
+## 🎯 5-Minute Quick Start
+
+### 1. Database Setup (2 min)
+```bash
+# Use schema from PHASE_2_IMPLEMENTATION_GUIDE.md Section 1
+kubectl -n db exec -it postgres-0 -- psql -U app -d gitea
+
+# Copy all SQL from Section 1.2 through 1.6
+# This creates hadith_db with complete schema
+```
+
+### 2. Create Project Structure (1 min)
+```bash
+mkdir -p hadith-ingestion/{config,src/{api_clients,processors,database,utils},argo/workflows}
+cd hadith-ingestion/
+
+# Copy code from HADITHAPI_INTEGRATION_GUIDE.md:
+# - Section 2.1 → src/api_clients/hadithapi_client.py
+# - Section 4.1 → src/main_hadithapi.py
+# - Section 5.1 → config/settings.py
+# - Section 6.1 → Dockerfile
+# - Section 6.4 → argo/workflows/ingest-hadithapi.yaml
+
+# Also copy from PHASE_2_IMPLEMENTATION_GUIDE.md:
+# - Section 3.4 → src/api_clients/base_client.py
+# - Section 3.6 → src/processors/text_cleaner.py
+# - Section 3.7 → src/database/repository.py
+```
+
+### 3. Build & Deploy (2 min)
+```bash
+# Build image
+docker build -t hadith-ingestion:latest .
+
+# Create secrets
+kubectl -n argo create secret generic hadith-db-secret \
+  --from-literal=password='YOUR_PASSWORD'
+
+kubectl -n argo create secret generic hadithapi-secret \
+  --from-literal=api-key='$2y$10$nTJnyX3WUDoGmjKrKqSmbecANVsQWKyffmtp9fxmsQwR15DEv4mK'
+
+# Test with 10 hadiths
+argo submit -n argo argo/workflows/ingest-hadithapi.yaml \
+  --parameter book-slug=sahih-bukhari \
+  --parameter limit=10 \
+  --watch
+```
+
+## 📊 Expected Results
+
+### Available Collections
+| Book | Hadiths | Time |
+|------|---------|------|
+| Sahih Bukhari | ~7,500 | 2-3h |
+| Sahih Muslim | ~7,000 | 2-3h |
+| Sunan Abu Dawood | ~5,000 | 1-2h |
+| Jami` at-Tirmidhi | ~4,000 | 1-2h |
+| Sunan an-Nasa'i | ~5,700 | 2h |
+| Sunan Ibn Majah | ~4,300 | 1-2h |
+| **TOTAL** | **~33,500** | **10-15h** |
+
+## 🔧 Key Differences from Sunnah.com
+
+| Feature | HadithAPI.com | Sunnah.com |
+|---------|---------------|------------|
+| **API Key** | ✅ Public (provided) | ❌ Requires PR |
+| **Rate Limit** | Unknown (using 30/min) | 100/min |
+| **Coverage** | 6 major books | 10+ books |
+| **Languages** | Arabic, English, Urdu | Arabic, English |
+| **Cost** | ✅ Free | Free |
+| **Stability** | Good | Excellent |
+
+## 📝 Complete File Checklist
+
+Create these files from the guides:
+
+```
+hadith-ingestion/
+├── Dockerfile                           ✓ Section 6.1
+├── requirements.txt                     ✓ Phase 2 Section 3.2
+├── .env                                 ✓ Section 5.2
+├── build-hadithapi-ingestion.sh        ✓ Section 6.2
+├── create-secrets.sh                    ✓ Section 6.3
+├── test-hadithapi-local.sh             ✓ Section 7.1
+├── test-hadithapi-k8s.sh               ✓ Section 7.2
+├── run-full-ingestion.sh               ✓ Section 7.3
+├── config/
+│   ├── __init__.py                     (empty file)
+│   └── settings.py                      ✓ Section 5.1
+├── src/
+│   ├── __init__.py                     (empty file)
+│   ├── main_hadithapi.py               ✓ Section 4.1
+│   ├── api_clients/
+│   │   ├── __init__.py                 (empty file)
+│   │   ├── base_client.py              ✓ Phase 2 Sec 3.4
+│   │   └── hadithapi_client.py         ✓ Section 2.1
+│   ├── processors/
+│   │   ├── __init__.py                 (empty file)
+│   │   └── text_cleaner.py             ✓ Phase 2 Sec 3.6
+│   ├── database/
+│   │   ├── __init__.py                 (empty file)
+│   │   ├── connection.py               (optional)
+│   │   └── repository.py               ✓ Phase 2 Sec 3.7
+│   └── utils/
+│       ├── __init__.py                 (empty file)
+│       └── logger.py                   (optional)
+└── argo/
+    └── workflows/
+        └── ingest-hadithapi.yaml        ✓ Section 6.4
+```
+
+## 🎬 Step-by-Step Execution
+
+### Day 1: Setup & Test (2-3 hours)
+```bash
+# 1. Create database schema
+# 2. Set up project structure
+# 3. Build Docker image
+# 4. Create secrets
+# 5. Run test with 10 hadiths
+# 6. Verify data
+```
+
+### Day 2: Ingest Major Collections (10-15 hours)
+```bash
+# Ingest all 6 major collections sequentially
+./run-full-ingestion.sh
+
+# Or manually one by one:
+argo submit ... --parameter book-slug=sahih-bukhari
+argo submit ... --parameter book-slug=sahih-muslim
+# etc...
+```
+
+### Day 3: Validation & Next Steps
+```bash
+# 1. Verify data quality
+# 2. Check statistics
+# 3. Proceed to Phase 3 (ML model development)
+```
+
+## ✅ Verification Checklist
+
+After ingestion completes:
+
+```bash
+# 1. Check total hadiths
+kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
+SELECT COUNT(*) FROM hadiths;
+"
+# Expected: ~33,500
+
+# 2. Check per collection
+kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
+SELECT 
+  c.name_english,
+  COUNT(h.id) as count
+FROM collections c
+LEFT JOIN hadiths h ON c.id = h.collection_id
+WHERE c.abbreviation IN ('bukhari', 'muslim', 'abudawud', 'tirmidhi', 'nasai', 'ibnmajah')
+GROUP BY c.name_english;
+"
+
+# 3. Check for errors
+kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
+SELECT * FROM ingestion_jobs 
+WHERE status = 'failed' 
+ORDER BY created_at DESC;
+"
+```
+
+## 🐛 Common Issues & Solutions
+
+### Issue: Rate Limiting
+```
+Error: 429 Too Many Requests
+Solution: Already set to conservative 30/min
+If still hitting limits, edit settings.py:
+  API_RATE_LIMIT = 20
+```
+
+### Issue: Connection Timeout
+```
+Error: Connection timeout to database
+Solution: 
+1. Check PostgreSQL is running
+2. Verify credentials in secrets
+3. Test connection manually
+```
+
+### Issue: Missing Chapters
+```
+Warning: chapters_fetch_failed
+Solution: Script automatically falls back to fetching all hadiths
+This is expected and not critical
+```
+
+## 📚 Documentation References
+
+All details in the comprehensive guides:
+
+1. **PHASE_2_IMPLEMENTATION_GUIDE.md**
+   - PostgreSQL schema (Section 1)
+   - Base utilities (Section 3)
+   - Database repository (Section 3.7)
+
+2. **HADITHAPI_INTEGRATION_GUIDE.md**
+   - API client (Section 2)
+   - Main ingestion service (Section 4)
+   - Deployment (Section 6)
+   - Testing (Section 7)
+
+## 🎯 Next Phase
+
+After Phase 2 completion:
+→ **Phase 3: ML Model Development**
+   - Annotate sample hadiths (Label Studio)
+   - Train NER model
+   - Train relation extraction model
+   - Fine-tune LLM with LoRA
+
+## 💡 Pro Tips
+
+1. **Start Small**: Test with `--limit 10` first
+2. **Monitor Progress**: Use `argo logs -n argo <workflow> -f`
+3. **Check Logs**: Structured JSON logs for easy debugging
+4. **Backup Data**: Before major operations
+5. **Rate Limiting**: Be conservative to avoid blocks
+
+## 🎉 Success Criteria
+
+Phase 2 is complete when:
+- ✅ Database schema created
+- ✅ 33,500+ hadiths ingested
+- ✅ All 6 collections present
+- ✅ No critical errors
+- ✅ Data validated
+- ✅ Ready for embedding generation
+
+---
+
+**Estimated Total Time: 1-2 days**  
+**Difficulty: Intermediate**  
+**Prerequisites: Phase 1 completed (all core services running)**
+
+Ready to start? Begin with Section 1 of PHASE_2_IMPLEMENTATION_GUIDE.md!