init README
This commit is contained in:
parent
b059fcab6e
commit
942119559a
|
|
@ -0,0 +1,275 @@
|
||||||
|
# 🚀 HadithAPI.com Deployment - Quick Start
|
||||||
|
|
||||||
|
## What You Got
|
||||||
|
|
||||||
|
Three comprehensive guides:
|
||||||
|
1. **PHASE_2_IMPLEMENTATION_GUIDE.md** - Original guide with PostgreSQL schema
|
||||||
|
2. **HADITHAPI_INTEGRATION_GUIDE.md** - Complete HadithAPI.com implementation
|
||||||
|
3. **This summary** - Quick deployment steps
|
||||||
|
|
||||||
|
## 📦 Complete Package Structure
|
||||||
|
|
||||||
|
The HadithAPI guide includes everything you need:
|
||||||
|
|
||||||
|
### Production-Ready Code
|
||||||
|
✅ **hadithapi_client.py** - Full API client with pagination and rate limiting
|
||||||
|
✅ **main_hadithapi.py** - Complete ingestion service
|
||||||
|
✅ **settings.py** - Configuration with your API key
|
||||||
|
✅ **Dockerfile** - Container image
|
||||||
|
✅ **Argo Workflows** - Kubernetes automation
|
||||||
|
✅ **Test scripts** - Validation and troubleshooting
|
||||||
|
|
||||||
|
### Key Features
|
||||||
|
- ✅ Automatic pagination handling
|
||||||
|
- ✅ Rate limiting (30 req/min)
|
||||||
|
- ✅ Error handling and retries
|
||||||
|
- ✅ Progress tracking
|
||||||
|
- ✅ Structured logging
|
||||||
|
- ✅ Multi-language support (Arabic, English, Urdu)
|
||||||
|
|
||||||
|
## 🎯 5-Minute Quick Start
|
||||||
|
|
||||||
|
### 1. Database Setup (2 min)
|
||||||
|
```bash
|
||||||
|
# Use schema from PHASE_2_IMPLEMENTATION_GUIDE.md Section 1
|
||||||
|
kubectl -n db exec -it postgres-0 -- psql -U app -d gitea
|
||||||
|
|
||||||
|
# Copy all SQL from Section 1.2 through 1.6
|
||||||
|
# This creates hadith_db with complete schema
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Create Project Structure (1 min)
|
||||||
|
```bash
|
||||||
|
mkdir -p hadith-ingestion/{config,src/{api_clients,processors,database,utils},argo/workflows}
|
||||||
|
cd hadith-ingestion/
|
||||||
|
|
||||||
|
# Copy code from HADITHAPI_INTEGRATION_GUIDE.md:
|
||||||
|
# - Section 2.1 → src/api_clients/hadithapi_client.py
|
||||||
|
# - Section 4.1 → src/main_hadithapi.py
|
||||||
|
# - Section 5.1 → config/settings.py
|
||||||
|
# - Section 6.1 → Dockerfile
|
||||||
|
# - Section 6.4 → argo/workflows/ingest-hadithapi.yaml
|
||||||
|
|
||||||
|
# Also copy from PHASE_2_IMPLEMENTATION_GUIDE.md:
|
||||||
|
# - Section 3.4 → src/api_clients/base_client.py
|
||||||
|
# - Section 3.6 → src/processors/text_cleaner.py
|
||||||
|
# - Section 3.7 → src/database/repository.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Build & Deploy (2 min)
|
||||||
|
```bash
|
||||||
|
# Build image
|
||||||
|
docker build -t hadith-ingestion:latest .
|
||||||
|
|
||||||
|
# Create secrets
|
||||||
|
kubectl -n argo create secret generic hadith-db-secret \
|
||||||
|
--from-literal=password='YOUR_PASSWORD'
|
||||||
|
|
||||||
|
kubectl -n argo create secret generic hadithapi-secret \
|
||||||
|
--from-literal=api-key='$2y$10$nTJnyX3WUDoGmjKrKqSmbecANVsQWKyffmtp9fxmsQwR15DEv4mK'
|
||||||
|
|
||||||
|
# Test with 10 hadiths
|
||||||
|
argo submit -n argo argo/workflows/ingest-hadithapi.yaml \
|
||||||
|
--parameter book-slug=sahih-bukhari \
|
||||||
|
--parameter limit=10 \
|
||||||
|
--watch
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📊 Expected Results
|
||||||
|
|
||||||
|
### Available Collections
|
||||||
|
| Book | Hadiths | Time |
|
||||||
|
|------|---------|------|
|
||||||
|
| Sahih Bukhari | ~7,500 | 2-3h |
|
||||||
|
| Sahih Muslim | ~7,000 | 2-3h |
|
||||||
|
| Sunan Abu Dawood | ~5,000 | 1-2h |
|
||||||
|
| Jami` at-Tirmidhi | ~4,000 | 1-2h |
|
||||||
|
| Sunan an-Nasa'i | ~5,700 | 2h |
|
||||||
|
| Sunan Ibn Majah | ~4,300 | 1-2h |
|
||||||
|
| **TOTAL** | **~33,500** | **10-15h** |
|
||||||
|
|
||||||
|
## 🔧 Key Differences from Sunnah.com
|
||||||
|
|
||||||
|
| Feature | HadithAPI.com | Sunnah.com |
|
||||||
|
|---------|---------------|------------|
|
||||||
|
| **API Key** | ✅ Public (provided) | ❌ Requires PR |
|
||||||
|
| **Rate Limit** | Unknown (using 30/min) | 100/min |
|
||||||
|
| **Coverage** | 6 major books | 10+ books |
|
||||||
|
| **Languages** | Arabic, English, Urdu | Arabic, English |
|
||||||
|
| **Cost** | ✅ Free | Free |
|
||||||
|
| **Stability** | Good | Excellent |
|
||||||
|
|
||||||
|
## 📝 Complete File Checklist
|
||||||
|
|
||||||
|
Create these files from the guides:
|
||||||
|
|
||||||
|
```
|
||||||
|
hadith-ingestion/
|
||||||
|
├── Dockerfile ✓ Section 6.1
|
||||||
|
├── requirements.txt ✓ Phase 2 Section 3.2
|
||||||
|
├── .env ✓ Section 5.2
|
||||||
|
├── build-hadithapi-ingestion.sh ✓ Section 6.2
|
||||||
|
├── create-secrets.sh ✓ Section 6.3
|
||||||
|
├── test-hadithapi-local.sh ✓ Section 7.1
|
||||||
|
├── test-hadithapi-k8s.sh ✓ Section 7.2
|
||||||
|
├── run-full-ingestion.sh ✓ Section 7.3
|
||||||
|
├── config/
|
||||||
|
│ ├── __init__.py (empty file)
|
||||||
|
│ └── settings.py ✓ Section 5.1
|
||||||
|
├── src/
|
||||||
|
│ ├── __init__.py (empty file)
|
||||||
|
│ ├── main_hadithapi.py ✓ Section 4.1
|
||||||
|
│ ├── api_clients/
|
||||||
|
│ │ ├── __init__.py (empty file)
|
||||||
|
│ │ ├── base_client.py ✓ Phase 2 Sec 3.4
|
||||||
|
│ │ └── hadithapi_client.py ✓ Section 2.1
|
||||||
|
│ ├── processors/
|
||||||
|
│ │ ├── __init__.py (empty file)
|
||||||
|
│ │ └── text_cleaner.py ✓ Phase 2 Sec 3.6
|
||||||
|
│ ├── database/
|
||||||
|
│ │ ├── __init__.py (empty file)
|
||||||
|
│ │ ├── connection.py (optional)
|
||||||
|
│ │ └── repository.py ✓ Phase 2 Sec 3.7
|
||||||
|
│ └── utils/
|
||||||
|
│ ├── __init__.py (empty file)
|
||||||
|
│ └── logger.py (optional)
|
||||||
|
└── argo/
|
||||||
|
└── workflows/
|
||||||
|
└── ingest-hadithapi.yaml ✓ Section 6.4
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🎬 Step-by-Step Execution
|
||||||
|
|
||||||
|
### Day 1: Setup & Test (2-3 hours)
|
||||||
|
```bash
|
||||||
|
# 1. Create database schema
|
||||||
|
# 2. Set up project structure
|
||||||
|
# 3. Build Docker image
|
||||||
|
# 4. Create secrets
|
||||||
|
# 5. Run test with 10 hadiths
|
||||||
|
# 6. Verify data
|
||||||
|
```
|
||||||
|
|
||||||
|
### Day 2: Ingest Major Collections (10-15 hours)
|
||||||
|
```bash
|
||||||
|
# Ingest all 6 major collections sequentially
|
||||||
|
./run-full-ingestion.sh
|
||||||
|
|
||||||
|
# Or manually one by one:
|
||||||
|
argo submit ... --parameter book-slug=sahih-bukhari
|
||||||
|
argo submit ... --parameter book-slug=sahih-muslim
|
||||||
|
# etc...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Day 3: Validation & Next Steps
|
||||||
|
```bash
|
||||||
|
# 1. Verify data quality
|
||||||
|
# 2. Check statistics
|
||||||
|
# 3. Proceed to Phase 3 (ML model development)
|
||||||
|
```
|
||||||
|
|
||||||
|
## ✅ Verification Checklist
|
||||||
|
|
||||||
|
After ingestion completes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Check total hadiths
|
||||||
|
kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
|
||||||
|
SELECT COUNT(*) FROM hadiths;
|
||||||
|
"
|
||||||
|
# Expected: ~33,500
|
||||||
|
|
||||||
|
# 2. Check per collection
|
||||||
|
kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
|
||||||
|
SELECT
|
||||||
|
c.name_english,
|
||||||
|
COUNT(h.id) as count
|
||||||
|
FROM collections c
|
||||||
|
LEFT JOIN hadiths h ON c.id = h.collection_id
|
||||||
|
WHERE c.abbreviation IN ('bukhari', 'muslim', 'abudawud', 'tirmidhi', 'nasai', 'ibnmajah')
|
||||||
|
GROUP BY c.name_english;
|
||||||
|
"
|
||||||
|
|
||||||
|
# 3. Check for errors
|
||||||
|
kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
|
||||||
|
SELECT * FROM ingestion_jobs
|
||||||
|
WHERE status = 'failed'
|
||||||
|
ORDER BY created_at DESC;
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🐛 Common Issues & Solutions
|
||||||
|
|
||||||
|
### Issue: Rate Limiting
|
||||||
|
```
|
||||||
|
Error: 429 Too Many Requests
|
||||||
|
Solution: Already set to conservative 30/min
|
||||||
|
If still hitting limits, edit settings.py:
|
||||||
|
API_RATE_LIMIT = 20
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Connection Timeout
|
||||||
|
```
|
||||||
|
Error: Connection timeout to database
|
||||||
|
Solution:
|
||||||
|
1. Check PostgreSQL is running
|
||||||
|
2. Verify credentials in secrets
|
||||||
|
3. Test connection manually
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Missing Chapters
|
||||||
|
```
|
||||||
|
Warning: chapters_fetch_failed
|
||||||
|
Solution: Script automatically falls back to fetching all hadiths
|
||||||
|
This is expected and not critical
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📚 Documentation References
|
||||||
|
|
||||||
|
All details in the comprehensive guides:
|
||||||
|
|
||||||
|
1. **PHASE_2_IMPLEMENTATION_GUIDE.md**
|
||||||
|
- PostgreSQL schema (Section 1)
|
||||||
|
- Base utilities (Section 3)
|
||||||
|
- Database repository (Section 3.7)
|
||||||
|
|
||||||
|
2. **HADITHAPI_INTEGRATION_GUIDE.md**
|
||||||
|
- API client (Section 2)
|
||||||
|
- Main ingestion service (Section 4)
|
||||||
|
- Deployment (Section 6)
|
||||||
|
- Testing (Section 7)
|
||||||
|
|
||||||
|
## 🎯 Next Phase
|
||||||
|
|
||||||
|
After Phase 2 completion:
|
||||||
|
→ **Phase 3: ML Model Development**
|
||||||
|
- Annotate sample hadiths (Label Studio)
|
||||||
|
- Train NER model
|
||||||
|
- Train relation extraction model
|
||||||
|
- Fine-tune LLM with LoRA
|
||||||
|
|
||||||
|
## 💡 Pro Tips
|
||||||
|
|
||||||
|
1. **Start Small**: Test with `--limit 10` first
|
||||||
|
2. **Monitor Progress**: Use `argo logs -n argo <workflow> -f`
|
||||||
|
3. **Check Logs**: Structured JSON logs for easy debugging
|
||||||
|
4. **Backup Data**: Before major operations
|
||||||
|
5. **Rate Limiting**: Be conservative to avoid blocks
|
||||||
|
|
||||||
|
## 🎉 Success Criteria
|
||||||
|
|
||||||
|
Phase 2 is complete when:
|
||||||
|
- ✅ Database schema created
|
||||||
|
- ✅ 33,500+ hadiths ingested
|
||||||
|
- ✅ All 6 collections present
|
||||||
|
- ✅ No critical errors
|
||||||
|
- ✅ Data validated
|
||||||
|
- ✅ Ready for embedding generation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Estimated Total Time: 1-2 days**
|
||||||
|
**Difficulty: Intermediate**
|
||||||
|
**Prerequisites: Phase 1 completed (all core services running)**
|
||||||
|
|
||||||
|
Ready to start? Begin with Section 1 of PHASE_2_IMPLEMENTATION_GUIDE.md!
|
||||||
Loading…
Reference in New Issue