init README

This commit is contained in:
salahangal 2025-11-14 10:17:54 +01:00
parent b059fcab6e
commit 942119559a
1 changed files with 275 additions and 0 deletions

View File

@ -0,0 +1,275 @@
# 🚀 HadithAPI.com Deployment - Quick Start
## What You Got
Three comprehensive guides:
1. **PHASE_2_IMPLEMENTATION_GUIDE.md** - Original guide with PostgreSQL schema
2. **HADITHAPI_INTEGRATION_GUIDE.md** - Complete HadithAPI.com implementation
3. **This summary** - Quick deployment steps
## 📦 Complete Package Structure
The HadithAPI guide includes everything you need:
### Production-Ready Code
**hadithapi_client.py** - Full API client with pagination and rate limiting
**main_hadithapi.py** - Complete ingestion service
**settings.py** - Configuration with your API key
**Dockerfile** - Container image
**Argo Workflows** - Kubernetes automation
**Test scripts** - Validation and troubleshooting
### Key Features
- ✅ Automatic pagination handling
- ✅ Rate limiting (30 req/min)
- ✅ Error handling and retries
- ✅ Progress tracking
- ✅ Structured logging
- ✅ Multi-language support (Arabic, English, Urdu)
## 🎯 5-Minute Quick Start
### 1. Database Setup (2 min)
```bash
# Use schema from PHASE_2_IMPLEMENTATION_GUIDE.md Section 1
kubectl -n db exec -it postgres-0 -- psql -U app -d gitea
# Copy all SQL from Section 1.2 through 1.6
# This creates hadith_db with complete schema
```
### 2. Create Project Structure (1 min)
```bash
mkdir -p hadith-ingestion/{config,src/{api_clients,processors,database,utils},argo/workflows}
cd hadith-ingestion/
# Copy code from HADITHAPI_INTEGRATION_GUIDE.md:
# - Section 2.1 → src/api_clients/hadithapi_client.py
# - Section 4.1 → src/main_hadithapi.py
# - Section 5.1 → config/settings.py
# - Section 6.1 → Dockerfile
# - Section 6.4 → argo/workflows/ingest-hadithapi.yaml
# Also copy from PHASE_2_IMPLEMENTATION_GUIDE.md:
# - Section 3.4 → src/api_clients/base_client.py
# - Section 3.6 → src/processors/text_cleaner.py
# - Section 3.7 → src/database/repository.py
```
### 3. Build & Deploy (2 min)
```bash
# Build image
docker build -t hadith-ingestion:latest .
# Create secrets
kubectl -n argo create secret generic hadith-db-secret \
--from-literal=password='YOUR_PASSWORD'
kubectl -n argo create secret generic hadithapi-secret \
--from-literal=api-key='$2y$10$nTJnyX3WUDoGmjKrKqSmbecANVsQWKyffmtp9fxmsQwR15DEv4mK'
# Test with 10 hadiths
argo submit -n argo argo/workflows/ingest-hadithapi.yaml \
--parameter book-slug=sahih-bukhari \
--parameter limit=10 \
--watch
```
## 📊 Expected Results
### Available Collections
| Book | Hadiths | Time |
|------|---------|------|
| Sahih Bukhari | ~7,500 | 2-3h |
| Sahih Muslim | ~7,000 | 2-3h |
| Sunan Abu Dawood | ~5,000 | 1-2h |
| Jami` at-Tirmidhi | ~4,000 | 1-2h |
| Sunan an-Nasa'i | ~5,700 | 2h |
| Sunan Ibn Majah | ~4,300 | 1-2h |
| **TOTAL** | **~33,500** | **10-15h** |
## 🔧 Key Differences from Sunnah.com
| Feature | HadithAPI.com | Sunnah.com |
|---------|---------------|------------|
| **API Key** | ✅ Public (provided) | ❌ Requires PR |
| **Rate Limit** | Unknown (using 30/min) | 100/min |
| **Coverage** | 6 major books | 10+ books |
| **Languages** | Arabic, English, Urdu | Arabic, English |
| **Cost** | ✅ Free | Free |
| **Stability** | Good | Excellent |
## 📝 Complete File Checklist
Create these files from the guides:
```
hadith-ingestion/
├── Dockerfile ✓ Section 6.1
├── requirements.txt ✓ Phase 2 Section 3.2
├── .env ✓ Section 5.2
├── build-hadithapi-ingestion.sh ✓ Section 6.2
├── create-secrets.sh ✓ Section 6.3
├── test-hadithapi-local.sh ✓ Section 7.1
├── test-hadithapi-k8s.sh ✓ Section 7.2
├── run-full-ingestion.sh ✓ Section 7.3
├── config/
│ ├── __init__.py (empty file)
│ └── settings.py ✓ Section 5.1
├── src/
│ ├── __init__.py (empty file)
│ ├── main_hadithapi.py ✓ Section 4.1
│ ├── api_clients/
│ │ ├── __init__.py (empty file)
│ │ ├── base_client.py ✓ Phase 2 Sec 3.4
│ │ └── hadithapi_client.py ✓ Section 2.1
│ ├── processors/
│ │ ├── __init__.py (empty file)
│ │ └── text_cleaner.py ✓ Phase 2 Sec 3.6
│ ├── database/
│ │ ├── __init__.py (empty file)
│ │ ├── connection.py (optional)
│ │ └── repository.py ✓ Phase 2 Sec 3.7
│ └── utils/
│ ├── __init__.py (empty file)
│ └── logger.py (optional)
└── argo/
└── workflows/
└── ingest-hadithapi.yaml ✓ Section 6.4
```
## 🎬 Step-by-Step Execution
### Day 1: Setup & Test (2-3 hours)
```bash
# 1. Create database schema
# 2. Set up project structure
# 3. Build Docker image
# 4. Create secrets
# 5. Run test with 10 hadiths
# 6. Verify data
```
### Day 2: Ingest Major Collections (10-15 hours)
```bash
# Ingest all 6 major collections sequentially
./run-full-ingestion.sh
# Or manually one by one:
argo submit ... --parameter book-slug=sahih-bukhari
argo submit ... --parameter book-slug=sahih-muslim
# etc...
```
### Day 3: Validation & Next Steps
```bash
# 1. Verify data quality
# 2. Check statistics
# 3. Proceed to Phase 3 (ML model development)
```
## ✅ Verification Checklist
After ingestion completes:
```bash
# 1. Check total hadiths
kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
SELECT COUNT(*) FROM hadiths;
"
# Expected: ~33,500
# 2. Check per collection
kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
SELECT
c.name_english,
COUNT(h.id) as count
FROM collections c
LEFT JOIN hadiths h ON c.id = h.collection_id
WHERE c.abbreviation IN ('bukhari', 'muslim', 'abudawud', 'tirmidhi', 'nasai', 'ibnmajah')
GROUP BY c.name_english;
"
# 3. Check for errors
kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
SELECT * FROM ingestion_jobs
WHERE status = 'failed'
ORDER BY created_at DESC;
"
```
## 🐛 Common Issues & Solutions
### Issue: Rate Limiting
```
Error: 429 Too Many Requests
Solution: Already set to conservative 30/min
If still hitting limits, edit settings.py:
API_RATE_LIMIT = 20
```
### Issue: Connection Timeout
```
Error: Connection timeout to database
Solution:
1. Check PostgreSQL is running
2. Verify credentials in secrets
3. Test connection manually
```
### Issue: Missing Chapters
```
Warning: chapters_fetch_failed
Solution: Script automatically falls back to fetching all hadiths
This is expected and not critical
```
## 📚 Documentation References
All details in the comprehensive guides:
1. **PHASE_2_IMPLEMENTATION_GUIDE.md**
- PostgreSQL schema (Section 1)
- Base utilities (Section 3)
- Database repository (Section 3.7)
2. **HADITHAPI_INTEGRATION_GUIDE.md**
- API client (Section 2)
- Main ingestion service (Section 4)
- Deployment (Section 6)
- Testing (Section 7)
## 🎯 Next Phase
After Phase 2 completion:
→ **Phase 3: ML Model Development**
- Annotate sample hadiths (Label Studio)
- Train NER model
- Train relation extraction model
- Fine-tune LLM with LoRA
## 💡 Pro Tips
1. **Start Small**: Test with `--limit 10` first
2. **Monitor Progress**: Use `argo logs -n argo <workflow> -f`
3. **Check Logs**: Structured JSON logs for easy debugging
4. **Backup Data**: Before major operations
5. **Rate Limiting**: Be conservative to avoid blocks
## 🎉 Success Criteria
Phase 2 is complete when:
- ✅ Database schema created
- ✅ 33,500+ hadiths ingested
- ✅ All 6 collections present
- ✅ No critical errors
- ✅ Data validated
- ✅ Ready for embedding generation
---
**Estimated Total Time: 1-2 days**
**Difficulty: Intermediate**
**Prerequisites: Phase 1 completed (all core services running)**
Ready to start? Begin with Section 1 of PHASE_2_IMPLEMENTATION_GUIDE.md!