From 942119559ae69af8ea997dd7ce8ed17fe0c5e9e5 Mon Sep 17 00:00:00 2001 From: salahangal Date: Fri, 14 Nov 2025 10:17:54 +0100 Subject: [PATCH] init README --- hadith-ingestion/README.md | 275 +++++++++++++++++++++++++++++++++++++ 1 file changed, 275 insertions(+) diff --git a/hadith-ingestion/README.md b/hadith-ingestion/README.md index e69de29..95ce4a3 100644 --- a/hadith-ingestion/README.md +++ b/hadith-ingestion/README.md @@ -0,0 +1,275 @@ +# 🚀 HadithAPI.com Deployment - Quick Start + +## What You Got + +Three comprehensive guides: +1. **PHASE_2_IMPLEMENTATION_GUIDE.md** - Original guide with PostgreSQL schema +2. **HADITHAPI_INTEGRATION_GUIDE.md** - Complete HadithAPI.com implementation +3. **This summary** - Quick deployment steps + +## 📦 Complete Package Structure + +The HadithAPI guide includes everything you need: + +### Production-Ready Code +✅ **hadithapi_client.py** - Full API client with pagination and rate limiting +✅ **main_hadithapi.py** - Complete ingestion service +✅ **settings.py** - Configuration with your API key +✅ **Dockerfile** - Container image +✅ **Argo Workflows** - Kubernetes automation +✅ **Test scripts** - Validation and troubleshooting + +### Key Features +- ✅ Automatic pagination handling +- ✅ Rate limiting (30 req/min) +- ✅ Error handling and retries +- ✅ Progress tracking +- ✅ Structured logging +- ✅ Multi-language support (Arabic, English, Urdu) + +## 🎯 5-Minute Quick Start + +### 1. Database Setup (2 min) +```bash +# Use schema from PHASE_2_IMPLEMENTATION_GUIDE.md Section 1 +kubectl -n db exec -it postgres-0 -- psql -U app -d gitea + +# Copy all SQL from Section 1.2 through 1.6 +# This creates hadith_db with complete schema +``` + +### 2. Create Project Structure (1 min) +```bash +mkdir -p hadith-ingestion/{config,src/{api_clients,processors,database,utils},argo/workflows} +cd hadith-ingestion/ + +# Copy code from HADITHAPI_INTEGRATION_GUIDE.md: +# - Section 2.1 → src/api_clients/hadithapi_client.py +# - Section 4.1 → src/main_hadithapi.py +# - Section 5.1 → config/settings.py +# - Section 6.1 → Dockerfile +# - Section 6.4 → argo/workflows/ingest-hadithapi.yaml + +# Also copy from PHASE_2_IMPLEMENTATION_GUIDE.md: +# - Section 3.4 → src/api_clients/base_client.py +# - Section 3.6 → src/processors/text_cleaner.py +# - Section 3.7 → src/database/repository.py +``` + +### 3. Build & Deploy (2 min) +```bash +# Build image +docker build -t hadith-ingestion:latest . + +# Create secrets +kubectl -n argo create secret generic hadith-db-secret \ + --from-literal=password='YOUR_PASSWORD' + +kubectl -n argo create secret generic hadithapi-secret \ + --from-literal=api-key='$2y$10$nTJnyX3WUDoGmjKrKqSmbecANVsQWKyffmtp9fxmsQwR15DEv4mK' + +# Test with 10 hadiths +argo submit -n argo argo/workflows/ingest-hadithapi.yaml \ + --parameter book-slug=sahih-bukhari \ + --parameter limit=10 \ + --watch +``` + +## 📊 Expected Results + +### Available Collections +| Book | Hadiths | Time | +|------|---------|------| +| Sahih Bukhari | ~7,500 | 2-3h | +| Sahih Muslim | ~7,000 | 2-3h | +| Sunan Abu Dawood | ~5,000 | 1-2h | +| Jami` at-Tirmidhi | ~4,000 | 1-2h | +| Sunan an-Nasa'i | ~5,700 | 2h | +| Sunan Ibn Majah | ~4,300 | 1-2h | +| **TOTAL** | **~33,500** | **10-15h** | + +## 🔧 Key Differences from Sunnah.com + +| Feature | HadithAPI.com | Sunnah.com | +|---------|---------------|------------| +| **API Key** | ✅ Public (provided) | ❌ Requires PR | +| **Rate Limit** | Unknown (using 30/min) | 100/min | +| **Coverage** | 6 major books | 10+ books | +| **Languages** | Arabic, English, Urdu | Arabic, English | +| **Cost** | ✅ Free | Free | +| **Stability** | Good | Excellent | + +## 📝 Complete File Checklist + +Create these files from the guides: + +``` +hadith-ingestion/ +├── Dockerfile ✓ Section 6.1 +├── requirements.txt ✓ Phase 2 Section 3.2 +├── .env ✓ Section 5.2 +├── build-hadithapi-ingestion.sh ✓ Section 6.2 +├── create-secrets.sh ✓ Section 6.3 +├── test-hadithapi-local.sh ✓ Section 7.1 +├── test-hadithapi-k8s.sh ✓ Section 7.2 +├── run-full-ingestion.sh ✓ Section 7.3 +├── config/ +│ ├── __init__.py (empty file) +│ └── settings.py ✓ Section 5.1 +├── src/ +│ ├── __init__.py (empty file) +│ ├── main_hadithapi.py ✓ Section 4.1 +│ ├── api_clients/ +│ │ ├── __init__.py (empty file) +│ │ ├── base_client.py ✓ Phase 2 Sec 3.4 +│ │ └── hadithapi_client.py ✓ Section 2.1 +│ ├── processors/ +│ │ ├── __init__.py (empty file) +│ │ └── text_cleaner.py ✓ Phase 2 Sec 3.6 +│ ├── database/ +│ │ ├── __init__.py (empty file) +│ │ ├── connection.py (optional) +│ │ └── repository.py ✓ Phase 2 Sec 3.7 +│ └── utils/ +│ ├── __init__.py (empty file) +│ └── logger.py (optional) +└── argo/ + └── workflows/ + └── ingest-hadithapi.yaml ✓ Section 6.4 +``` + +## 🎬 Step-by-Step Execution + +### Day 1: Setup & Test (2-3 hours) +```bash +# 1. Create database schema +# 2. Set up project structure +# 3. Build Docker image +# 4. Create secrets +# 5. Run test with 10 hadiths +# 6. Verify data +``` + +### Day 2: Ingest Major Collections (10-15 hours) +```bash +# Ingest all 6 major collections sequentially +./run-full-ingestion.sh + +# Or manually one by one: +argo submit ... --parameter book-slug=sahih-bukhari +argo submit ... --parameter book-slug=sahih-muslim +# etc... +``` + +### Day 3: Validation & Next Steps +```bash +# 1. Verify data quality +# 2. Check statistics +# 3. Proceed to Phase 3 (ML model development) +``` + +## ✅ Verification Checklist + +After ingestion completes: + +```bash +# 1. Check total hadiths +kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c " +SELECT COUNT(*) FROM hadiths; +" +# Expected: ~33,500 + +# 2. Check per collection +kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c " +SELECT + c.name_english, + COUNT(h.id) as count +FROM collections c +LEFT JOIN hadiths h ON c.id = h.collection_id +WHERE c.abbreviation IN ('bukhari', 'muslim', 'abudawud', 'tirmidhi', 'nasai', 'ibnmajah') +GROUP BY c.name_english; +" + +# 3. Check for errors +kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c " +SELECT * FROM ingestion_jobs +WHERE status = 'failed' +ORDER BY created_at DESC; +" +``` + +## 🐛 Common Issues & Solutions + +### Issue: Rate Limiting +``` +Error: 429 Too Many Requests +Solution: Already set to conservative 30/min +If still hitting limits, edit settings.py: + API_RATE_LIMIT = 20 +``` + +### Issue: Connection Timeout +``` +Error: Connection timeout to database +Solution: +1. Check PostgreSQL is running +2. Verify credentials in secrets +3. Test connection manually +``` + +### Issue: Missing Chapters +``` +Warning: chapters_fetch_failed +Solution: Script automatically falls back to fetching all hadiths +This is expected and not critical +``` + +## 📚 Documentation References + +All details in the comprehensive guides: + +1. **PHASE_2_IMPLEMENTATION_GUIDE.md** + - PostgreSQL schema (Section 1) + - Base utilities (Section 3) + - Database repository (Section 3.7) + +2. **HADITHAPI_INTEGRATION_GUIDE.md** + - API client (Section 2) + - Main ingestion service (Section 4) + - Deployment (Section 6) + - Testing (Section 7) + +## 🎯 Next Phase + +After Phase 2 completion: +→ **Phase 3: ML Model Development** + - Annotate sample hadiths (Label Studio) + - Train NER model + - Train relation extraction model + - Fine-tune LLM with LoRA + +## 💡 Pro Tips + +1. **Start Small**: Test with `--limit 10` first +2. **Monitor Progress**: Use `argo logs -n argo -f` +3. **Check Logs**: Structured JSON logs for easy debugging +4. **Backup Data**: Before major operations +5. **Rate Limiting**: Be conservative to avoid blocks + +## 🎉 Success Criteria + +Phase 2 is complete when: +- ✅ Database schema created +- ✅ 33,500+ hadiths ingested +- ✅ All 6 collections present +- ✅ No critical errors +- ✅ Data validated +- ✅ Ready for embedding generation + +--- + +**Estimated Total Time: 1-2 days** +**Difficulty: Intermediate** +**Prerequisites: Phase 1 completed (all core services running)** + +Ready to start? Begin with Section 1 of PHASE_2_IMPLEMENTATION_GUIDE.md! \ No newline at end of file