secondvoice/docs/implementation_plan.md
StillHammer 5b60acaa73 feat: Implement complete MVP architecture for SecondVoice
Complete implementation of the real-time Chinese-to-French translation system:

Architecture:
- 3-threaded pipeline: Audio capture → AI processing → UI rendering
- Thread-safe queues for inter-thread communication
- Configurable audio chunk sizes for latency tuning

Core Features:
- Audio capture with PortAudio (configurable sample rate/channels)
- Whisper API integration for Chinese speech-to-text
- Claude API integration for Chinese-to-French translation
- ImGui real-time display with stop button
- Full recording saved to WAV on stop

Modules Implemented:
- audio/: AudioCapture (PortAudio wrapper) + AudioBuffer (WAV export)
- api/: WhisperClient + ClaudeClient (HTTP API wrappers)
- ui/: TranslationUI (ImGui interface)
- core/: Pipeline (orchestrates all threads)
- utils/: Config (JSON/.env loader) + ThreadSafeQueue (template)

Build System:
- CMake with vcpkg for dependency management
- vcpkg.json manifest for reproducible builds
- build.sh helper script

Configuration:
- config.json: Audio settings, API parameters, UI config
- .env: API keys (OpenAI + Anthropic)

Documentation:
- README.md: Setup instructions, usage, architecture
- docs/implementation_plan.md: Technical design document
- docs/SecondVoice.md: Project vision and motivation

Next Steps:
- Test build with vcpkg dependencies
- Test audio capture on real hardware
- Validate API integrations
- Tune chunk size for optimal latency

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 03:08:03 +08:00

495 lines
14 KiB
Markdown

# SecondVoice - Plan d'Implémentation MVP
**Date**: 20 novembre 2025
**Target**: MVP minimal fonctionnel
**Platform**: Linux
**Package Manager**: vcpkg
---
## 🎯 Objectif MVP Minimal
Application desktop qui:
1. Capture audio microphone en continu
2. Transcrit chinois → texte (Whisper API)
3. Traduit texte → français (Claude API)
4. Affiche traduction temps réel (ImGui)
5. Bouton Stop pour arrêter (pas de résumé MVP)
---
## 🏗️ Architecture Technique
### Pipeline
```
Audio Capture (PortAudio)
↓ (chunks audio configurables)
Whisper API (STT)
↓ (texte chinois)
Claude API (traduction)
↓ (texte français)
ImGui UI (display temps réel + bouton Stop)
```
### Threading Model
```
Thread 1 - Audio Capture:
- PortAudio callback capture audio
- Accumule chunks (taille configurable)
- Push dans queue thread-safe
- Save WAV backup en background
Thread 2 - AI Processing:
- Pop chunk depuis audio queue
- POST Whisper API → transcription chinoise
- POST Claude API → traduction française
- Push résultat dans UI queue
Thread 3 - Main UI (ImGui):
- Render window ImGui
- Display traductions depuis queue
- Handle bouton Stop
- Update status/duration
```
---
## 📁 Structure Projet
```
secondvoice/
├── .env # API keys (OPENAI_API_KEY, ANTHROPIC_API_KEY)
├── .gitignore
├── CMakeLists.txt # Build configuration
├── vcpkg.json # Dependencies manifest
├── config.json # Runtime config (audio chunk size, etc)
├── README.md
├── docs/
│ ├── SecondVoice.md # Vision document
│ └── implementation_plan.md # Ce document
├── src/
│ ├── main.cpp # Entry point + ImGui main loop
│ ├── audio/
│ │ ├── AudioCapture.h
│ │ ├── AudioCapture.cpp # PortAudio wrapper
│ │ ├── AudioBuffer.h
│ │ └── AudioBuffer.cpp # Thread-safe ring buffer
│ ├── api/
│ │ ├── WhisperClient.h
│ │ ├── WhisperClient.cpp # Whisper API client
│ │ ├── ClaudeClient.h
│ │ └── ClaudeClient.cpp # Claude API client
│ ├── ui/
│ │ ├── TranslationUI.h
│ │ └── TranslationUI.cpp # ImGui interface
│ ├── utils/
│ │ ├── Config.h
│ │ ├── Config.cpp # Load .env + config.json
│ │ ├── ThreadSafeQueue.h # Template queue thread-safe
│ │ └── Logger.h # Simple logging
│ └── core/
│ ├── Pipeline.h
│ └── Pipeline.cpp # Orchestrate threads
├── recordings/ # Output audio files
│ └── .gitkeep
└── build/ # CMake build output (ignored)
```
---
## 🔧 Dépendances
### vcpkg.json
```json
{
"name": "secondvoice",
"version": "0.1.0",
"dependencies": [
"portaudio",
"cpp-httplib",
"nlohmann-json",
"imgui[glfw-binding,opengl3-binding]",
"glfw3",
"opengl"
]
}
```
### System Requirements (Linux)
```bash
# PortAudio dependencies
sudo apt install libasound2-dev
# OpenGL dependencies
sudo apt install libgl1-mesa-dev libglu1-mesa-dev
```
---
## ⚙️ Configuration
### .env (racine projet)
```env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
```
### config.json (racine projet)
```json
{
"audio": {
"sample_rate": 16000,
"channels": 1,
"chunk_duration_seconds": 10,
"format": "wav"
},
"whisper": {
"model": "whisper-1",
"language": "zh",
"temperature": 0.0
},
"claude": {
"model": "claude-haiku-4-20250514",
"max_tokens": 1024,
"temperature": 0.3,
"system_prompt": "Tu es un traducteur professionnel chinois-français. Traduis le texte suivant de manière naturelle et contextuelle."
},
"ui": {
"window_width": 800,
"window_height": 600,
"font_size": 16,
"max_display_lines": 50
},
"recording": {
"save_audio": true,
"output_directory": "./recordings"
}
}
```
---
## 🔌 API Clients
### Whisper API
```cpp
// POST https://api.openai.com/v1/audio/transcriptions
// Content-Type: multipart/form-data
Request:
- file: audio.wav (binary)
- model: whisper-1
- language: zh
- temperature: 0.0
Response:
{
"text": "你好,今天我们讨论项目进度..."
}
```
### Claude API
```cpp
// POST https://api.anthropic.com/v1/messages
// Content-Type: application/json
// x-api-key: {ANTHROPIC_API_KEY}
// anthropic-version: 2023-06-01
Request:
{
"model": "claude-haiku-4-20250514",
"max_tokens": 1024,
"messages": [{
"role": "user",
"content": "Traduis en français: 你好,今天我们讨论项目进度..."
}]
}
Response:
{
"content": [{
"type": "text",
"text": "Bonjour, aujourd'hui nous discutons de l'avancement du projet..."
}],
"model": "claude-haiku-4-20250514",
"usage": {...}
}
```
---
## 🎨 Interface ImGui
### Layout Minimaliste
```
┌────────────────────────────────────────────┐
│ SecondVoice - Live Translation │
├────────────────────────────────────────────┤
│ │
│ [●] Recording... Duration: 00:05:23 │
│ │
│ ┌────────────────────────────────────────┐ │
│ │ 中文: 你好,今天我们讨论项目进度... │ │
│ │ FR: Bonjour, aujourd'hui nous │ │
│ │ discutons de l'avancement... │ │
│ │ │ │
│ │ 中文: 关于预算的问题... │ │
│ │ FR: Concernant la question du budget.. │ │
│ │ │ │
│ │ [Auto-scroll enabled] │ │
│ │ │ │
│ └────────────────────────────────────────┘ │
│ │
│ [ STOP RECORDING ] │
│ │
│ Status: Processing chunk 12/12 │
│ Audio: 16kHz mono, chunk size: 10s │
└────────────────────────────────────────────┘
```
### Features UI
- **Scrollable text area**: Auto-scroll, peut désactiver pour review
- **Color coding**: Chinois (couleur 1), Français (couleur 2)
- **Status bar**: Duration, chunk count, processing status
- **Stop button**: Arrête capture + processing, sauvegarde audio
- **Window resizable**: Layout adaptatif
---
## 🚀 Ordre d'Implémentation
### Phase 1 - Setup Infrastructure (Jour 1)
**Todo**:
1. ✅ Créer structure projet
2. ✅ Setup CMakeLists.txt avec vcpkg
3. ✅ Créer .gitignore (.env, build/, recordings/)
4. ✅ Créer config.json template
5. ✅ Setup .env (API keys)
6. ✅ Test build minimal (hello world)
**Validation**: `cmake -B build && cmake --build build` compile sans erreurs
---
### Phase 2 - Audio Capture (Jour 1-2)
**Todo**:
1. Implémenter `AudioCapture.h/cpp`:
- Init PortAudio
- Callback capture audio
- Accumulation chunks (configurable duration)
- Push dans ThreadSafeQueue
2. Implémenter `AudioBuffer.h/cpp`:
- Ring buffer pour audio raw
- Thread-safe operations
3. Test standalone: Capture 30s audio → save WAV
**Validation**: Audio WAV lisible, durée correcte, qualité OK
---
### Phase 3 - Whisper Client (Jour 2)
**Todo**:
1. Implémenter `WhisperClient.h/cpp`:
- Load API key depuis .env
- POST multipart/form-data (cpp-httplib)
- Encode audio WAV en memory
- Parse JSON response
- Error handling (retry, timeout)
2. Test standalone: Audio file → Whisper → texte chinois
**Validation**: Transcription chinoise correcte sur sample audio
---
### Phase 4 - Claude Client (Jour 2-3)
**Todo**:
1. Implémenter `ClaudeClient.h/cpp`:
- Load API key depuis .env
- POST JSON request (cpp-httplib)
- System prompt configurable
- Parse response (extract text)
- Error handling
2. Test standalone: Texte chinois → Claude → texte français
**Validation**: Traduction française naturelle et correcte
---
### Phase 5 - ImGui UI (Jour 3)
**Todo**:
1. Setup ImGui + GLFW + OpenGL:
- Window creation
- Render loop
- Input handling
2. Implémenter `TranslationUI.h/cpp`:
- Scrollable text area
- Display messages (CN + FR)
- Button Stop
- Status bar (duration, chunk count)
3. Test standalone: Afficher mock data
**Validation**: UI responsive, affichage texte OK, bouton fonctionne
---
### Phase 6 - Pipeline Integration (Jour 4)
**Todo**:
1. Implémenter `Pipeline.h/cpp`:
- Thread 1: AudioCapture loop
- Thread 2: Processing loop (Whisper → Claude)
- Thread 3: UI loop (ImGui)
- ThreadSafeQueue entre threads
- Synchronisation (start/stop)
2. Implémenter `Config.h/cpp`:
- Load .env (API keys)
- Load config.json (settings)
3. Implémenter `main.cpp`:
- Init all components
- Start pipeline
- Handle graceful shutdown
**Validation**: Pipeline complet fonctionne bout-à-bout
---
### Phase 7 - Testing & Tuning (Jour 5)
**Todo**:
1. Test avec audio réel chinois:
- Sample conversations
- Different audio qualities
- Different chunk sizes (5s, 10s, 30s)
2. Measure latence:
- Audio → Whisper: X secondes
- Whisper → Claude: Y secondes
- Total: Z secondes
3. Debug & fix bugs:
- Memory leaks
- Thread safety issues
- API errors handling
4. Optimize:
- Chunk size optimal (tradeoff latency vs accuracy)
- API timeout values
- UI refresh rate
**Validation**:
- Latence totale < 10s acceptable
- Pas de crash sur 30min recording
- Transcription + traduction compréhensibles
---
## 🧪 Test Plan
### Unit Tests (Phase 2+)
- `AudioCapture`: Capture audio, format correct
- `WhisperClient`: API call mock, parsing JSON
- `ClaudeClient`: API call mock, parsing JSON
- `ThreadSafeQueue`: Thread safety, no data loss
### Integration Tests
- Audio Whisper: Audio file texte chinois correct
- Whisper Claude: Texte chinois traduction française correcte
- Pipeline: Audio UI display complet
### End-to-End Test
- Recording 5min conversation chinoise réelle
- Vérifier transcription accuracy (>85%)
- Vérifier traduction compréhensible
- Vérifier UI responsive
- Vérifier audio sauvegardé correctement
---
## 📊 Metrics à Tracker
### Performance
- **Latence Whisper**: Temps API call (target: <3s pour 10s audio)
- **Latence Claude**: Temps API call (target: <2s pour 200 tokens)
- **Latence totale**: Audio Display (target: <10s)
- **Memory usage**: Stable sur longue durée (no leaks)
- **CPU usage**: Acceptable (<50% sur laptop)
### Qualité
- **Whisper accuracy**: % mots corrects (target: >85%)
- **Claude quality**: Traduction naturelle (subjective)
- **Crash rate**: 0 crash sur 1h recording
### Cost
- **Whisper**: $0.006/min audio
- **Claude**: ~$0.03-0.05/h (depends on text volume)
- **Total**: ~$0.40/h meeting
---
## ⚠️ Risks & Mitigations
| Risk | Impact | Mitigation |
|------|--------|------------|
| **Whisper API timeout** | Bloquant | Retry logic, timeout 30s, fallback queue |
| **Claude API rate limit** | Moyen | Exponential backoff, queue requests |
| **Audio buffer overflow** | Moyen | Ring buffer size adequate, drop old chunks if needed |
| **Thread deadlock** | Bloquant | Use std::lock_guard, avoid nested locks |
| **Memory leak** | Moyen | Use smart pointers, valgrind tests |
| **Network interruption** | Moyen | Retry logic, cache audio locally |
---
## 🎯 Success Criteria MVP
**MVP validé si**:
1. Capture audio microphone fonctionne
2. Transcription chinoise >85% précise
3. Traduction française compréhensible
4. UI affiche traductions temps réel
5. Bouton Stop arrête proprement
6. Audio sauvegardé correctement
7. Pas de crash sur 30min recording
8. Latence totale <10s acceptable
---
## 📝 Notes Implémentation
### Thread Safety
- Utiliser `std::mutex` + `std::lock_guard` pour queues
- Pas de shared state sans protection
- Use `std::atomic<bool>` pour flags (running, stopping)
### Error Handling
- Try/catch sur API calls
- Log errors (spdlog ou simple cout)
- Retry logic (max 3 attempts)
- Graceful degradation (skip chunk si error persistant)
### Audio Format
- **Sample rate**: 16kHz (optimal pour Whisper)
- **Channels**: Mono (sufficient, réduit bandwidth)
- **Format**: 16-bit PCM WAV
- **Chunk size**: Configurable (default 10s)
### API Best Practices
- **Timeout**: 30s pour Whisper, 15s pour Claude
- **Retry**: Exponential backoff (1s, 2s, 4s)
- **Rate limiting**: Respect API limits (monitor 429 errors)
- **Headers**: Always set User-Agent, API version
---
## 🔄 Post-MVP (Phase 2)
**Not included in MVP, but planned**:
- Résumé auto post-meeting (Claude summary)
- Export structuré (transcripts + audio)
- Système de recherche (backlog)
- Diarization (qui parle)
- Replay mode
- GUI élaborée (settings, etc)
**Focus MVP**: Pipeline fonctionnel bout-à-bout, validation concept, usage réel premier meeting.
---
*Document créé: 20 novembre 2025*
*Status: Ready to implement*
*Estimated effort: 5 jours développement + 2 jours tests*