Reorganize repository structure

- Move all Python scripts to tools/ directory - Move documentation files to docs/ directory - Create exams/ and homework/ directories for future use - Remove temporary test file (page1_preview.png) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-27 23:28:39 +08:00 · 2025-10-27 23:28:39 +08:00 · a61a32b57f
commit a61a32b57f
parent acbe1b4769
14 changed files with 229 additions and 0 deletions
--- a/docs/Projet-Outil-Apprentissage.md
+++ b/docs/Projet-Outil-Apprentissage.md
--- a/docs/README-OCR.md
+++ b/docs/README-OCR.md
--- a/docs/chinese_audio_tts_pipeline.md
+++ b/docs/chinese_audio_tts_pipeline.md
@ -0,0 +1,229 @@
+# Chinese Audio to Text Extractor - Simple Transcription
+
+## Objectif
+
+Extraire le texte de fichiers MP3 de cours de chinois en utilisant Whisper.
+
+### Problème résolu
+- Besoin de récupérer le contenu textuel des cours audio
+- Conversion MP3 → Texte simple et rapide
+
+### Solution
+Pipeline minimaliste : MP3 → Whisper → Texte brut
+
+---
+
+## Architecture Pipeline
+
+```
+┌─────────────────────────────────────────┐
+│  INPUT: cours_chinois.mp3 (45min)       │
+└──────────────┬──────────────────────────┘
+               │
+               ▼
+┌─────────────────────────────────────────┐
+│  Transcription (Whisper)                │
+│  ├─ Model: whisper-1 (OpenAI API)      │
+│  ├─ Language: zh (mandarin)            │
+│  └─ Output: transcript.txt             │
+└──────────────┬──────────────────────────┘
+               │
+               ▼
+┌─────────────────────────────────────────┐
+│  OUTPUT: cours_chinois.txt              │
+│  你好。我叫Alexis。今天我们学习...      │
+└─────────────────────────────────────────┘
+```
+
+---
+
+## Plan d'Implémentation Python
+
+### Structure du projet
+
+```
+chinese-transcriber/
+├── transcribe.py           # Script principal
+├── input/                  # MP3 source
+├── output/                 # Fichiers .txt générés
+├── .env                    # API key
+└── requirements.txt
+```
+
+### Dépendances (requirements.txt)
+
+```txt
+openai>=1.0.0              # Whisper API
+python-dotenv>=1.0.0       # Env variables
+```
+
+### Script Principal (transcribe.py)
+
+```python
+"""
+Transcription simple MP3 → TXT avec Whisper
+"""
+import openai
+from pathlib import Path
+from dotenv import load_dotenv
+import os
+
+def transcribe_audio(audio_path: Path, api_key: str) -> str:
+    """
+    Transcrit un fichier MP3 en chinois
+
+    Args:
+        audio_path: Chemin vers MP3
+        api_key: Clé API OpenAI
+
+    Returns:
+        Texte transcrit
+    """
+    client = openai.OpenAI(api_key=api_key)
+
+    with open(audio_path, "rb") as audio_file:
+        transcript = client.audio.transcriptions.create(
+            model="whisper-1",
+            file=audio_file,
+            language="zh",  # Force mandarin
+            response_format="text"  # Texte brut
+        )
+
+    return transcript
+
+def main():
+    # Load API key
+    load_dotenv()
+    api_key = os.getenv("OPENAI_API_KEY")
+
+    if not api_key:
+        print("Error: OPENAI_API_KEY not found in .env")
+        return
+
+    # Setup paths
+    input_dir = Path("input")
+    output_dir = Path("output")
+    output_dir.mkdir(exist_ok=True)
+
+    # Get MP3 files
+    mp3_files = list(input_dir.glob("*.mp3"))
+
+    if not mp3_files:
+        print(f"No MP3 files found in {input_dir}/")
+        return
+
+    print(f"Found {len(mp3_files)} MP3 files to transcribe\n")
+
+    # Process each file
+    for mp3_file in mp3_files:
+        print(f"Processing: {mp3_file.name}...")
+
+        try:
+            # Transcribe
+            text = transcribe_audio(mp3_file, api_key)
+
+            # Save to TXT
+            output_path = output_dir / f"{mp3_file.stem}.txt"
+            with open(output_path, "w", encoding="utf-8") as f:
+                f.write(text)
+
+            print(f"✓ Saved to: {output_path}\n")
+
+        except Exception as e:
+            print(f"✗ Error: {e}\n")
+
+    print("=== Transcription completed ===")
+
+if __name__ == "__main__":
+    main()
+```
+
+---
+
+### Environment Variables (.env)
+
+```bash
+OPENAI_API_KEY=sk-...
+```
+
+---
+
+## Estimation Coûts
+
+### Pour 10 heures de cours audio
+
+| Service | Coût | Calcul |
+|---------|------|--------|
+| **Whisper API** | **$3.60** | 10h × $0.006/min × 60min |
+
+**Ultra-abordable** pour extraction simple de texte.
+
+---
+
+## Usage
+
+### Installation
+
+```bash
+mkdir chinese-transcriber
+cd chinese-transcriber
+
+# Créer structure
+mkdir input output
+
+# Installer dépendances
+pip install openai python-dotenv
+
+# Créer .env
+echo "OPENAI_API_KEY=sk-..." > .env
+
+# Copier le script transcribe.py
+```
+
+### Exécution
+
+```bash
+# 1. Placer tes MP3 dans input/
+cp /path/to/cours*.mp3 input/
+
+# 2. Run script
+python transcribe.py
+
+# Output:
+# Found 3 MP3 files to transcribe
+#
+# Processing: cours_1.mp3...
+# ✓ Saved to: output/cours_1.txt
+#
+# Processing: cours_2.mp3...
+# ✓ Saved to: output/cours_2.txt
+# ...
+```
+
+### Output
+
+Fichiers `.txt` avec texte chinois brut :
+
+```
+output/cours_1.txt:
+你好。我叫Alexis。今天我们学习汉语。
+第一课是关于问候的。你好吗？我很好，谢谢。
+...
+```
+
+---
+
+## Statut
+
+✅ **PLAN SIMPLE - PRÊT À UTILISER**
+
+Script minimaliste pour extraction texte MP3 → TXT.
+
+**Next steps si besoin** :
+1. Tester sur tes fichiers MP3 chinois
+2. Si besoin découpage automatique, voir options full TTS pipeline (commenté dans versions précédentes)
+
+---
+
+*Créé : 27 octobre 2025*
+*Stack : Python 3.10+, Whisper API seulement*
--- a/tools/debug_pdf_placement.py
+++ b/tools/debug_pdf_placement.py
--- a/tools/diagnose_alignment.py
+++ b/tools/diagnose_alignment.py
--- a/tools/download_audio_resources.py
+++ b/tools/download_audio_resources.py
--- a/tools/json_to_pdf_QUADRILATERAL.py
+++ b/tools/json_to_pdf_QUADRILATERAL.py
--- a/tools/json_to_pdf_SIMPLE.py
+++ b/tools/json_to_pdf_SIMPLE.py
--- a/tools/json_to_searchable_pdf.py
+++ b/tools/json_to_searchable_pdf.py
--- a/tools/ocr_pipeline.py
+++ b/tools/ocr_pipeline.py
--- a/tools/requirements-ocr.txt
+++ b/tools/requirements-ocr.txt
--- a/tools/test_center_scaling.py
+++ b/tools/test_center_scaling.py
--- a/tools/test_y_offset.py
+++ b/tools/test_y_offset.py
--- a/tools/test_y_scaling.py
+++ b/tools/test_y_scaling.py