init README
This commit is contained in:
parent
b059fcab6e
commit
942119559a
|
|
@ -0,0 +1,275 @@
|
|||
# 🚀 HadithAPI.com Deployment - Quick Start
|
||||
|
||||
## What You Got
|
||||
|
||||
Three comprehensive guides:
|
||||
1. **PHASE_2_IMPLEMENTATION_GUIDE.md** - Original guide with PostgreSQL schema
|
||||
2. **HADITHAPI_INTEGRATION_GUIDE.md** - Complete HadithAPI.com implementation
|
||||
3. **This summary** - Quick deployment steps
|
||||
|
||||
## 📦 Complete Package Structure
|
||||
|
||||
The HadithAPI guide includes everything you need:
|
||||
|
||||
### Production-Ready Code
|
||||
✅ **hadithapi_client.py** - Full API client with pagination and rate limiting
|
||||
✅ **main_hadithapi.py** - Complete ingestion service
|
||||
✅ **settings.py** - Configuration with your API key
|
||||
✅ **Dockerfile** - Container image
|
||||
✅ **Argo Workflows** - Kubernetes automation
|
||||
✅ **Test scripts** - Validation and troubleshooting
|
||||
|
||||
### Key Features
|
||||
- ✅ Automatic pagination handling
|
||||
- ✅ Rate limiting (30 req/min)
|
||||
- ✅ Error handling and retries
|
||||
- ✅ Progress tracking
|
||||
- ✅ Structured logging
|
||||
- ✅ Multi-language support (Arabic, English, Urdu)
|
||||
|
||||
## 🎯 5-Minute Quick Start
|
||||
|
||||
### 1. Database Setup (2 min)
|
||||
```bash
|
||||
# Use schema from PHASE_2_IMPLEMENTATION_GUIDE.md Section 1
|
||||
kubectl -n db exec -it postgres-0 -- psql -U app -d gitea
|
||||
|
||||
# Copy all SQL from Section 1.2 through 1.6
|
||||
# This creates hadith_db with complete schema
|
||||
```
|
||||
|
||||
### 2. Create Project Structure (1 min)
|
||||
```bash
|
||||
mkdir -p hadith-ingestion/{config,src/{api_clients,processors,database,utils},argo/workflows}
|
||||
cd hadith-ingestion/
|
||||
|
||||
# Copy code from HADITHAPI_INTEGRATION_GUIDE.md:
|
||||
# - Section 2.1 → src/api_clients/hadithapi_client.py
|
||||
# - Section 4.1 → src/main_hadithapi.py
|
||||
# - Section 5.1 → config/settings.py
|
||||
# - Section 6.1 → Dockerfile
|
||||
# - Section 6.4 → argo/workflows/ingest-hadithapi.yaml
|
||||
|
||||
# Also copy from PHASE_2_IMPLEMENTATION_GUIDE.md:
|
||||
# - Section 3.4 → src/api_clients/base_client.py
|
||||
# - Section 3.6 → src/processors/text_cleaner.py
|
||||
# - Section 3.7 → src/database/repository.py
|
||||
```
|
||||
|
||||
### 3. Build & Deploy (2 min)
|
||||
```bash
|
||||
# Build image
|
||||
docker build -t hadith-ingestion:latest .
|
||||
|
||||
# Create secrets
|
||||
kubectl -n argo create secret generic hadith-db-secret \
|
||||
--from-literal=password='YOUR_PASSWORD'
|
||||
|
||||
kubectl -n argo create secret generic hadithapi-secret \
|
||||
--from-literal=api-key='$2y$10$nTJnyX3WUDoGmjKrKqSmbecANVsQWKyffmtp9fxmsQwR15DEv4mK'
|
||||
|
||||
# Test with 10 hadiths
|
||||
argo submit -n argo argo/workflows/ingest-hadithapi.yaml \
|
||||
--parameter book-slug=sahih-bukhari \
|
||||
--parameter limit=10 \
|
||||
--watch
|
||||
```
|
||||
|
||||
## 📊 Expected Results
|
||||
|
||||
### Available Collections
|
||||
| Book | Hadiths | Time |
|
||||
|------|---------|------|
|
||||
| Sahih Bukhari | ~7,500 | 2-3h |
|
||||
| Sahih Muslim | ~7,000 | 2-3h |
|
||||
| Sunan Abu Dawood | ~5,000 | 1-2h |
|
||||
| Jami` at-Tirmidhi | ~4,000 | 1-2h |
|
||||
| Sunan an-Nasa'i | ~5,700 | 2h |
|
||||
| Sunan Ibn Majah | ~4,300 | 1-2h |
|
||||
| **TOTAL** | **~33,500** | **10-15h** |
|
||||
|
||||
## 🔧 Key Differences from Sunnah.com
|
||||
|
||||
| Feature | HadithAPI.com | Sunnah.com |
|
||||
|---------|---------------|------------|
|
||||
| **API Key** | ✅ Public (provided) | ❌ Requires PR |
|
||||
| **Rate Limit** | Unknown (using 30/min) | 100/min |
|
||||
| **Coverage** | 6 major books | 10+ books |
|
||||
| **Languages** | Arabic, English, Urdu | Arabic, English |
|
||||
| **Cost** | ✅ Free | Free |
|
||||
| **Stability** | Good | Excellent |
|
||||
|
||||
## 📝 Complete File Checklist
|
||||
|
||||
Create these files from the guides:
|
||||
|
||||
```
|
||||
hadith-ingestion/
|
||||
├── Dockerfile ✓ Section 6.1
|
||||
├── requirements.txt ✓ Phase 2 Section 3.2
|
||||
├── .env ✓ Section 5.2
|
||||
├── build-hadithapi-ingestion.sh ✓ Section 6.2
|
||||
├── create-secrets.sh ✓ Section 6.3
|
||||
├── test-hadithapi-local.sh ✓ Section 7.1
|
||||
├── test-hadithapi-k8s.sh ✓ Section 7.2
|
||||
├── run-full-ingestion.sh ✓ Section 7.3
|
||||
├── config/
|
||||
│ ├── __init__.py (empty file)
|
||||
│ └── settings.py ✓ Section 5.1
|
||||
├── src/
|
||||
│ ├── __init__.py (empty file)
|
||||
│ ├── main_hadithapi.py ✓ Section 4.1
|
||||
│ ├── api_clients/
|
||||
│ │ ├── __init__.py (empty file)
|
||||
│ │ ├── base_client.py ✓ Phase 2 Sec 3.4
|
||||
│ │ └── hadithapi_client.py ✓ Section 2.1
|
||||
│ ├── processors/
|
||||
│ │ ├── __init__.py (empty file)
|
||||
│ │ └── text_cleaner.py ✓ Phase 2 Sec 3.6
|
||||
│ ├── database/
|
||||
│ │ ├── __init__.py (empty file)
|
||||
│ │ ├── connection.py (optional)
|
||||
│ │ └── repository.py ✓ Phase 2 Sec 3.7
|
||||
│ └── utils/
|
||||
│ ├── __init__.py (empty file)
|
||||
│ └── logger.py (optional)
|
||||
└── argo/
|
||||
└── workflows/
|
||||
└── ingest-hadithapi.yaml ✓ Section 6.4
|
||||
```
|
||||
|
||||
## 🎬 Step-by-Step Execution
|
||||
|
||||
### Day 1: Setup & Test (2-3 hours)
|
||||
```bash
|
||||
# 1. Create database schema
|
||||
# 2. Set up project structure
|
||||
# 3. Build Docker image
|
||||
# 4. Create secrets
|
||||
# 5. Run test with 10 hadiths
|
||||
# 6. Verify data
|
||||
```
|
||||
|
||||
### Day 2: Ingest Major Collections (10-15 hours)
|
||||
```bash
|
||||
# Ingest all 6 major collections sequentially
|
||||
./run-full-ingestion.sh
|
||||
|
||||
# Or manually one by one:
|
||||
argo submit ... --parameter book-slug=sahih-bukhari
|
||||
argo submit ... --parameter book-slug=sahih-muslim
|
||||
# etc...
|
||||
```
|
||||
|
||||
### Day 3: Validation & Next Steps
|
||||
```bash
|
||||
# 1. Verify data quality
|
||||
# 2. Check statistics
|
||||
# 3. Proceed to Phase 3 (ML model development)
|
||||
```
|
||||
|
||||
## ✅ Verification Checklist
|
||||
|
||||
After ingestion completes:
|
||||
|
||||
```bash
|
||||
# 1. Check total hadiths
|
||||
kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
|
||||
SELECT COUNT(*) FROM hadiths;
|
||||
"
|
||||
# Expected: ~33,500
|
||||
|
||||
# 2. Check per collection
|
||||
kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
|
||||
SELECT
|
||||
c.name_english,
|
||||
COUNT(h.id) as count
|
||||
FROM collections c
|
||||
LEFT JOIN hadiths h ON c.id = h.collection_id
|
||||
WHERE c.abbreviation IN ('bukhari', 'muslim', 'abudawud', 'tirmidhi', 'nasai', 'ibnmajah')
|
||||
GROUP BY c.name_english;
|
||||
"
|
||||
|
||||
# 3. Check for errors
|
||||
kubectl -n db exec -it postgres-0 -- psql -U hadith_ingest -d hadith_db -c "
|
||||
SELECT * FROM ingestion_jobs
|
||||
WHERE status = 'failed'
|
||||
ORDER BY created_at DESC;
|
||||
"
|
||||
```
|
||||
|
||||
## 🐛 Common Issues & Solutions
|
||||
|
||||
### Issue: Rate Limiting
|
||||
```
|
||||
Error: 429 Too Many Requests
|
||||
Solution: Already set to conservative 30/min
|
||||
If still hitting limits, edit settings.py:
|
||||
API_RATE_LIMIT = 20
|
||||
```
|
||||
|
||||
### Issue: Connection Timeout
|
||||
```
|
||||
Error: Connection timeout to database
|
||||
Solution:
|
||||
1. Check PostgreSQL is running
|
||||
2. Verify credentials in secrets
|
||||
3. Test connection manually
|
||||
```
|
||||
|
||||
### Issue: Missing Chapters
|
||||
```
|
||||
Warning: chapters_fetch_failed
|
||||
Solution: Script automatically falls back to fetching all hadiths
|
||||
This is expected and not critical
|
||||
```
|
||||
|
||||
## 📚 Documentation References
|
||||
|
||||
All details in the comprehensive guides:
|
||||
|
||||
1. **PHASE_2_IMPLEMENTATION_GUIDE.md**
|
||||
- PostgreSQL schema (Section 1)
|
||||
- Base utilities (Section 3)
|
||||
- Database repository (Section 3.7)
|
||||
|
||||
2. **HADITHAPI_INTEGRATION_GUIDE.md**
|
||||
- API client (Section 2)
|
||||
- Main ingestion service (Section 4)
|
||||
- Deployment (Section 6)
|
||||
- Testing (Section 7)
|
||||
|
||||
## 🎯 Next Phase
|
||||
|
||||
After Phase 2 completion:
|
||||
→ **Phase 3: ML Model Development**
|
||||
- Annotate sample hadiths (Label Studio)
|
||||
- Train NER model
|
||||
- Train relation extraction model
|
||||
- Fine-tune LLM with LoRA
|
||||
|
||||
## 💡 Pro Tips
|
||||
|
||||
1. **Start Small**: Test with `--limit 10` first
|
||||
2. **Monitor Progress**: Use `argo logs -n argo <workflow> -f`
|
||||
3. **Check Logs**: Structured JSON logs for easy debugging
|
||||
4. **Backup Data**: Before major operations
|
||||
5. **Rate Limiting**: Be conservative to avoid blocks
|
||||
|
||||
## 🎉 Success Criteria
|
||||
|
||||
Phase 2 is complete when:
|
||||
- ✅ Database schema created
|
||||
- ✅ 33,500+ hadiths ingested
|
||||
- ✅ All 6 collections present
|
||||
- ✅ No critical errors
|
||||
- ✅ Data validated
|
||||
- ✅ Ready for embedding generation
|
||||
|
||||
---
|
||||
|
||||
**Estimated Total Time: 1-2 days**
|
||||
**Difficulty: Intermediate**
|
||||
**Prerequisites: Phase 1 completed (all core services running)**
|
||||
|
||||
Ready to start? Begin with Section 1 of PHASE_2_IMPLEMENTATION_GUIDE.md!
|
||||
Loading…
Reference in New Issue