Add OCR PDF Service project documentation

New project: Online OCR service for PDFs with dual output modes - Mode 1: Extract text from scanned PDFs - Mode 2: Generate searchable PDFs with embedded OCR text layer Key features: - Multi-language support (CN/EN/FR) via PaddleOCR - Two output formats: plain text or searchable PDF - Reuses validated OCR pipeline from ClassGen (99.97% accuracy) - Proposed architecture: Node.js API + Python OCR worker + job queue Suggested stack: - Backend: PaddleOCR (already validated), Node.js + Express - PDF processing: pdf-lib, PyPDF2 - Queue: Redis + Bull for async processing Timeline: 3-4 weeks for production-ready MVP Status: Conception phase - awaiting prioritization decision 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 11:37:06 +08:00 · 2025-11-19 11:37:06 +08:00 · 8600fbe23f
commit 8600fbe23f
parent bdbe17a3a0
1 changed files with 222 additions and 0 deletions
--- a/Projects/ocr_pdf_service.md
+++ b/Projects/ocr_pdf_service.md
@ -0,0 +1,222 @@
 # OCR PDF Service - Service OCR en Ligne
 **Created**: 19/11/2025
 **Status**: Conception
 **Stack**: À définir (probablement Node.js + PaddleOCR Python backend)
 ---
 ## Concept
 Service en ligne d'OCR pour PDFs avec deux modes de sortie :
 1. **Extraction texte brut** - PDF → Texte extrait
 2. **PDF avec texte intégré** - PDF scanné → PDF searchable (texte OCR intégré dans le PDF)
 ---
 ## Use Cases
 ### Mode 1: Extraction Texte
 - Upload PDF scanné
 - Service fait OCR sur toutes les pages
 - Retourne fichier texte structuré
 ### Mode 2: PDF Searchable
 - Upload PDF scanné (images uniquement)
 - Service fait OCR + intègre texte invisible dans le PDF
 - Retourne PDF avec couche texte (Ctrl+F fonctionne, sélection texte possible)
 ---
 ## Stack Potentielle
 **Backend OCR**:
 - PaddleOCR (déjà validé sur ClassGen - 99.97% précision chinois)
 - Support multilingue (CN/EN/FR/etc.)
 - API Python
 **Service Web**:
 - Node.js + Express (API REST)
 - Upload handling (multipart/form-data)
 - Queue system pour jobs OCR (Redis + Bull ?)
 **PDF Processing**:
 - PDF.js ou pdf-lib (manipulation PDF côté Node)
 - PyPDF2 ou reportlab (Python - intégration texte dans PDF)
 **Frontend** (optionnel):
 - Simple upload form
 - Progress tracking
 - Download résultat
 ---
 ## Architecture Proposée
 ```
 ┌─────────────┐
 │   Client    │
 │  (Browser)  │
 └──────┬──────┘
       │ Upload PDF
       ▼
 ┌─────────────────┐
 │   Node.js API   │
 │   (Express)     │
 └────────┬────────┘
         │ Enqueue job
         ▼
 ┌─────────────────┐
 │   Job Queue     │
 │   (Redis/Bull)  │
 └────────┬────────┘
         │ Process
         ▼
 ┌─────────────────┐
 │  Python Worker  │
 │  (PaddleOCR)    │
 └────────┬────────┘
         │ OCR Result
         ▼
 ┌─────────────────┐
 │  PDF Generator  │
 │  (PyPDF2/etc)   │
 └────────┬────────┘
         │ Output PDF
         ▼
 ┌─────────────────┐
 │  Storage/CDN    │
 │  (Download)     │
 └─────────────────┘
 ```
 ---
 ## Features MVP
 ### Core
 - [x] Upload PDF (max size ?)
 - [x] Détection langue automatique
 - [x] OCR via PaddleOCR
 - [x] Export texte brut (.txt)
 - [x] Export PDF searchable
 ### Nice-to-Have
 - [ ] Batch processing (multiple PDFs)
 - [ ] Support images (JPG, PNG) en plus des PDFs
 - [ ] Choix manuel langue OCR
 - [ ] Preview avant download
 - [ ] API key pour usage programmatique
 - [ ] Webhook pour notification fin de job
 ---
 ## Différenciation vs Concurrence
 **Concurrents**:
 - Adobe Acrobat (payant, lourd)
 - Online OCR services (limites, confidentialité ?)
 - Google Drive OCR (limites format)
 **Notre avantage**:
 - **Gratuit** (ou freemium)
 - **Open source** (si tu veux)
 - **Privacy-focused** : Upload → Process → Delete (pas de stockage permanent)
 - **Multi-langue optimisé** : Chinois excellemment supporté (PaddleOCR)
 - **Deux modes** : Texte brut OU PDF searchable
 - **API publique** : Intégration dans workflows
 ---
 ## Monétisation Potentielle
 **Freemium Model**:
 - **Free tier**: 10 PDFs/mois, max 5MB, watermark optionnel
 - **Pro tier**: 100 PDFs/mois, max 50MB, pas de watermark, API access
 - **Enterprise**: Unlimited, self-hosted option, support
 **Alternative**:
 - Pur gratuit + donations
 - Ou pur gratuit comme portfolio piece
 ---
 ## Timeline Estimée
 **Phase 1 - MVP (1-2 semaines)**:
 - Setup backend Python (PaddleOCR déjà validé)
 - API Node.js upload/download
 - Mode extraction texte brut
 - Interface web minimaliste
 **Phase 2 - PDF Searchable (1 semaine)**:
 - Intégration texte dans PDF original
 - Tests qualité (alignement texte/image)
 **Phase 3 - Polish (1 semaine)**:
 - UI/UX améliorée
 - Error handling robuste
 - Rate limiting
 - Documentation API
 **Total**: 3-4 semaines pour version production-ready
 ---
 ## Risques & Challenges
 **Technique**:
 - Alignement texte OCR avec position dans PDF (complexe)
 - Performance pour gros PDFs (100+ pages)
 - Gestion mémoire (PaddleOCR peut être gourmand)
 **Business**:
 - Coût serveur (OCR = CPU-intensive)
 - Scaling si succès
 - Légal : Respecter copyright des PDFs uploadés
 **Produit**:
 - Beaucoup de concurrence
 - Besoin USP clair (pourquoi utiliser le nôtre ?)
 ---
 ## Lien avec ClassGen
 **Synergie**:
 - Pipeline OCR déjà validé (99.97% précision)
 - Code réutilisable (PaddleOCR setup, correction IA)
 - Même stack backend
 **Différence**:
 - ClassGen : OCR → JSON structuré → Gamification (usage perso)
 - OCR Service : OCR → PDF/Texte → Download (usage général)
 ---
 ## Décision à Prendre
 **Questions**:
 1. **Priorité** : Avant ou après ClassGen stable ?
 2. **Scope** : MVP simple ou service complet ?
 3. **Monétisation** : Gratuit, freemium, ou portfolio piece ?
 4. **Hébergement** : VPS, serverless, ou local d'abord ?
 **Recommandation**:
 - Attendre ClassGen livré + utilisé 1-2 semaines
 - Valider pipeline OCR en usage réel
 - Puis décider si ce service a du sens commercialement
 ---
 ## Notes
 **Pattern observé** : Encore un projet de conception brillante. Attention à ne pas tomber dans le piège "design mais jamais livré".
 **Solution** :
 - Time-box strict (4h sessions max)
 - MVP ultra-minimaliste d'abord
 - Livrer même si "pas parfait"
 - Améliorer selon feedback réel
 **Question socratique** : Pourquoi ce projet maintenant ? Quel problème concret ça résout pour toi ou pour d'autres ? Ou c'est juste "ça serait cool" ?