Add OCR PDF Service project documentation

New project: Online OCR service for PDFs with dual output modes - Mode 1: Extract text from scanned PDFs - Mode 2: Generate searchable PDFs with embedded OCR text layer Key features: - Multi-language support (CN/EN/FR) via PaddleOCR - Two output formats: plain text or searchable PDF - Reuses validated OCR pipeline from ClassGen (99.97% accuracy) - Proposed architecture: Node.js API + Python OCR worker + job queue Suggested stack: - Backend: PaddleOCR (already validated), Node.js + Express - PDF processing: pdf-lib, PyPDF2 - Queue: Redis + Bull for async processing Timeline: 3-4 weeks for production-ready MVP Status: Conception phase - awaiting prioritization decision 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 11:37:06 +08:00 · 2025-11-19 11:37:06 +08:00 · 8600fbe23f
commit 8600fbe23f
parent bdbe17a3a0
1 changed files with 222 additions and 0 deletions
--- a/Projects/ocr_pdf_service.md
+++ b/Projects/ocr_pdf_service.md
@ -0,0 +1,222 @@
+# OCR PDF Service - Service OCR en Ligne
+
+**Created**: 19/11/2025
+**Status**: Conception
+**Stack**: À définir (probablement Node.js + PaddleOCR Python backend)
+
+---
+
+## Concept
+
+Service en ligne d'OCR pour PDFs avec deux modes de sortie :
+1. **Extraction texte brut** - PDF → Texte extrait
+2. **PDF avec texte intégré** - PDF scanné → PDF searchable (texte OCR intégré dans le PDF)
+
+---
+
+## Use Cases
+
+### Mode 1: Extraction Texte
+- Upload PDF scanné
+- Service fait OCR sur toutes les pages
+- Retourne fichier texte structuré
+
+### Mode 2: PDF Searchable
+- Upload PDF scanné (images uniquement)
+- Service fait OCR + intègre texte invisible dans le PDF
+- Retourne PDF avec couche texte (Ctrl+F fonctionne, sélection texte possible)
+
+---
+
+## Stack Potentielle
+
+**Backend OCR**:
+- PaddleOCR (déjà validé sur ClassGen - 99.97% précision chinois)
+- Support multilingue (CN/EN/FR/etc.)
+- API Python
+
+**Service Web**:
+- Node.js + Express (API REST)
+- Upload handling (multipart/form-data)
+- Queue system pour jobs OCR (Redis + Bull ?)
+
+**PDF Processing**:
+- PDF.js ou pdf-lib (manipulation PDF côté Node)
+- PyPDF2 ou reportlab (Python - intégration texte dans PDF)
+
+**Frontend** (optionnel):
+- Simple upload form
+- Progress tracking
+- Download résultat
+
+---
+
+## Architecture Proposée
+
+```
+┌─────────────┐
+│   Client    │
+│  (Browser)  │
+└──────┬──────┘
+       │ Upload PDF
+       ▼
+┌─────────────────┐
+│   Node.js API   │
+│   (Express)     │
+└────────┬────────┘
+         │ Enqueue job
+         ▼
+┌─────────────────┐
+│   Job Queue     │
+│   (Redis/Bull)  │
+└────────┬────────┘
+         │ Process
+         ▼
+┌─────────────────┐
+│  Python Worker  │
+│  (PaddleOCR)    │
+└────────┬────────┘
+         │ OCR Result
+         ▼
+┌─────────────────┐
+│  PDF Generator  │
+│  (PyPDF2/etc)   │
+└────────┬────────┘
+         │ Output PDF
+         ▼
+┌─────────────────┐
+│  Storage/CDN    │
+│  (Download)     │
+└─────────────────┘
+```
+
+---
+
+## Features MVP
+
+### Core
+- [x] Upload PDF (max size ?)
+- [x] Détection langue automatique
+- [x] OCR via PaddleOCR
+- [x] Export texte brut (.txt)
+- [x] Export PDF searchable
+
+### Nice-to-Have
+- [ ] Batch processing (multiple PDFs)
+- [ ] Support images (JPG, PNG) en plus des PDFs
+- [ ] Choix manuel langue OCR
+- [ ] Preview avant download
+- [ ] API key pour usage programmatique
+- [ ] Webhook pour notification fin de job
+
+---
+
+## Différenciation vs Concurrence
+
+**Concurrents**:
+- Adobe Acrobat (payant, lourd)
+- Online OCR services (limites, confidentialité ?)
+- Google Drive OCR (limites format)
+
+**Notre avantage**:
+- **Gratuit** (ou freemium)
+- **Open source** (si tu veux)
+- **Privacy-focused** : Upload → Process → Delete (pas de stockage permanent)
+- **Multi-langue optimisé** : Chinois excellemment supporté (PaddleOCR)
+- **Deux modes** : Texte brut OU PDF searchable
+- **API publique** : Intégration dans workflows
+
+---
+
+## Monétisation Potentielle
+
+**Freemium Model**:
+- **Free tier**: 10 PDFs/mois, max 5MB, watermark optionnel
+- **Pro tier**: 100 PDFs/mois, max 50MB, pas de watermark, API access
+- **Enterprise**: Unlimited, self-hosted option, support
+
+**Alternative**:
+- Pur gratuit + donations
+- Ou pur gratuit comme portfolio piece
+
+---
+
+## Timeline Estimée
+
+**Phase 1 - MVP (1-2 semaines)**:
+- Setup backend Python (PaddleOCR déjà validé)
+- API Node.js upload/download
+- Mode extraction texte brut
+- Interface web minimaliste
+
+**Phase 2 - PDF Searchable (1 semaine)**:
+- Intégration texte dans PDF original
+- Tests qualité (alignement texte/image)
+
+**Phase 3 - Polish (1 semaine)**:
+- UI/UX améliorée
+- Error handling robuste
+- Rate limiting
+- Documentation API
+
+**Total**: 3-4 semaines pour version production-ready
+
+---
+
+## Risques & Challenges
+
+**Technique**:
+- Alignement texte OCR avec position dans PDF (complexe)
+- Performance pour gros PDFs (100+ pages)
+- Gestion mémoire (PaddleOCR peut être gourmand)
+
+**Business**:
+- Coût serveur (OCR = CPU-intensive)
+- Scaling si succès
+- Légal : Respecter copyright des PDFs uploadés
+
+**Produit**:
+- Beaucoup de concurrence
+- Besoin USP clair (pourquoi utiliser le nôtre ?)
+
+---
+
+## Lien avec ClassGen
+
+**Synergie**:
+- Pipeline OCR déjà validé (99.97% précision)
+- Code réutilisable (PaddleOCR setup, correction IA)
+- Même stack backend
+
+**Différence**:
+- ClassGen : OCR → JSON structuré → Gamification (usage perso)
+- OCR Service : OCR → PDF/Texte → Download (usage général)
+
+---
+
+## Décision à Prendre
+
+**Questions**:
+1. **Priorité** : Avant ou après ClassGen stable ?
+2. **Scope** : MVP simple ou service complet ?
+3. **Monétisation** : Gratuit, freemium, ou portfolio piece ?
+4. **Hébergement** : VPS, serverless, ou local d'abord ?
+
+**Recommandation**:
+- Attendre ClassGen livré + utilisé 1-2 semaines
+- Valider pipeline OCR en usage réel
+- Puis décider si ce service a du sens commercialement
+
+---
+
+## Notes
+
+**Pattern observé** : Encore un projet de conception brillante. Attention à ne pas tomber dans le piège "design mais jamais livré".
+
+**Solution** :
+- Time-box strict (4h sessions max)
+- MVP ultra-minimaliste d'abord
+- Livrer même si "pas parfait"
+- Améliorer selon feedback réel
+
+**Question socratique** : Pourquoi ce projet maintenant ? Quel problème concret ça résout pour toi ou pour d'autres ? Ou c'est juste "ça serait cool" ?