couple-repo/Projects/CONCEPT/ocr_pdf_service.md

# OCR PDF Service - Service OCR en Ligne

**Status**: CONCEPT
**Created**: 19/11/2025
**Moved to CONCEPT**: 30/11/2025 (depuis PAUSE)
**Stack**: À définir (probablement Node.js + PaddleOCR Python backend)

---

## Concept

Service en ligne d'OCR pour PDFs avec deux modes de sortie :
1. **Extraction texte brut** - PDF → Texte extrait
2. **PDF avec texte intégré** - PDF scanné → PDF searchable (texte OCR intégré dans le PDF)

---

## Use Cases

### Mode 1: Extraction Texte
- Upload PDF scanné
- Service fait OCR sur toutes les pages
- Retourne fichier texte structuré

### Mode 2: PDF Searchable
- Upload PDF scanné (images uniquement)
- Service fait OCR + intègre texte invisible dans le PDF
- Retourne PDF avec couche texte (Ctrl+F fonctionne, sélection texte possible)

---

## Stack Potentielle

**Backend OCR**:
- PaddleOCR (déjà validé sur ClassGen - 99.97% précision chinois)
- Support multilingue (CN/EN/FR/etc.)
- API Python

**Service Web**:
- Node.js + Express (API REST)
- Upload handling (multipart/form-data)
- Queue system pour jobs OCR (Redis + Bull ?)

**PDF Processing**:
- PDF.js ou pdf-lib (manipulation PDF côté Node)
- PyPDF2 ou reportlab (Python - intégration texte dans PDF)

**Frontend** (optionnel):
- Simple upload form
- Progress tracking
- Download résultat

---

## Architecture Proposée

```
┌─────────────┐
│   Client    │
│  (Browser)  │
└──────┬──────┘
       │ Upload PDF
       ▼
┌─────────────────┐
│   Node.js API   │
│   (Express)     │
└────────┬────────┘
         │ Enqueue job
         ▼
┌─────────────────┐
│   Job Queue     │
│   (Redis/Bull)  │
└────────┬────────┘
         │ Process
         ▼
┌─────────────────┐
│  Python Worker  │
│  (PaddleOCR)    │
└────────┬────────┘
         │ OCR Result
         ▼
┌─────────────────┐
│  PDF Generator  │
│  (PyPDF2/etc)   │
└────────┬────────┘
         │ Output PDF
         ▼
┌─────────────────┐
│  Storage/CDN    │
│  (Download)     │
└─────────────────┘
```

---

## Features MVP

### Core
- [x] Upload PDF (max size ?)
- [x] Détection langue automatique
- [x] OCR via PaddleOCR
- [x] Export texte brut (.txt)
- [x] Export PDF searchable

### Nice-to-Have
- [ ] Batch processing (multiple PDFs)
- [ ] Support images (JPG, PNG) en plus des PDFs
- [ ] Choix manuel langue OCR
- [ ] Preview avant download
- [ ] API key pour usage programmatique
- [ ] Webhook pour notification fin de job

---

## Différenciation vs Concurrence

**Concurrents**:
- Adobe Acrobat (payant, lourd)
- Online OCR services (limites, confidentialité ?)
- Google Drive OCR (limites format)

**Notre avantage**:
- **Gratuit** (ou freemium)
- **Open source** (si tu veux)
- **Privacy-focused** : Upload → Process → Delete (pas de stockage permanent)
- **Multi-langue optimisé** : Chinois excellemment supporté (PaddleOCR)
- **Deux modes** : Texte brut OU PDF searchable
- **API publique** : Intégration dans workflows

---

## Monétisation Potentielle

**Freemium Model**:
- **Free tier**: 10 PDFs/mois, max 5MB, watermark optionnel
- **Pro tier**: 100 PDFs/mois, max 50MB, pas de watermark, API access
- **Enterprise**: Unlimited, self-hosted option, support

**Alternative**:
- Pur gratuit + donations
- Ou pur gratuit comme portfolio piece

---

## Timeline Estimée

**Phase 1 - MVP (1-2 semaines)**:
- Setup backend Python (PaddleOCR déjà validé)
- API Node.js upload/download
- Mode extraction texte brut
- Interface web minimaliste

**Phase 2 - PDF Searchable (1 semaine)**:
- Intégration texte dans PDF original
- Tests qualité (alignement texte/image)

**Phase 3 - Polish (1 semaine)**:
- UI/UX améliorée
- Error handling robuste
- Rate limiting
- Documentation API

**Total**: 3-4 semaines pour version production-ready

---

## Risques & Challenges

**Technique**:
- Alignement texte OCR avec position dans PDF (complexe)
- Performance pour gros PDFs (100+ pages)
- Gestion mémoire (PaddleOCR peut être gourmand)

**Business**:
- Coût serveur (OCR = CPU-intensive)
- Scaling si succès
- Légal : Respecter copyright des PDFs uploadés

**Produit**:
- Beaucoup de concurrence
- Besoin USP clair (pourquoi utiliser le nôtre ?)

---

## Lien avec ClassGen

**Synergie**:
- Pipeline OCR déjà validé (99.97% précision)
- Code réutilisable (PaddleOCR setup, correction IA)
- Même stack backend

**Différence**:
- ClassGen : OCR → JSON structuré → Gamification (usage perso)
- OCR Service : OCR → PDF/Texte → Download (usage général)

---

## Décision à Prendre

**Questions**:
1. **Priorité** : Avant ou après ClassGen stable ?
2. **Scope** : MVP simple ou service complet ?
3. **Monétisation** : Gratuit, freemium, ou portfolio piece ?
4. **Hébergement** : VPS, serverless, ou local d'abord ?

**Recommandation**:
- Attendre ClassGen livré + utilisé 1-2 semaines
- Valider pipeline OCR en usage réel
- Puis décider si ce service a du sens commercialement

---

## Notes

**Pattern observé** : Encore un projet de conception brillante. Attention à ne pas tomber dans le piège "design mais jamais livré".

**Solution** :
- Time-box strict (4h sessions max)
- MVP ultra-minimaliste d'abord
- Livrer même si "pas parfait"
- Améliorer selon feedback réel

**Question socratique** : Pourquoi ce projet maintenant ? Quel problème concret ça résout pour toi ou pour d'autres ? Ou c'est juste "ça serait cool" ?