secondvoice/docs/implementation_plan.md
StillHammer 5b60acaa73 feat: Implement complete MVP architecture for SecondVoice
Complete implementation of the real-time Chinese-to-French translation system:

Architecture:
- 3-threaded pipeline: Audio capture → AI processing → UI rendering
- Thread-safe queues for inter-thread communication
- Configurable audio chunk sizes for latency tuning

Core Features:
- Audio capture with PortAudio (configurable sample rate/channels)
- Whisper API integration for Chinese speech-to-text
- Claude API integration for Chinese-to-French translation
- ImGui real-time display with stop button
- Full recording saved to WAV on stop

Modules Implemented:
- audio/: AudioCapture (PortAudio wrapper) + AudioBuffer (WAV export)
- api/: WhisperClient + ClaudeClient (HTTP API wrappers)
- ui/: TranslationUI (ImGui interface)
- core/: Pipeline (orchestrates all threads)
- utils/: Config (JSON/.env loader) + ThreadSafeQueue (template)

Build System:
- CMake with vcpkg for dependency management
- vcpkg.json manifest for reproducible builds
- build.sh helper script

Configuration:
- config.json: Audio settings, API parameters, UI config
- .env: API keys (OpenAI + Anthropic)

Documentation:
- README.md: Setup instructions, usage, architecture
- docs/implementation_plan.md: Technical design document
- docs/SecondVoice.md: Project vision and motivation

Next Steps:
- Test build with vcpkg dependencies
- Test audio capture on real hardware
- Validate API integrations
- Tune chunk size for optimal latency

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 03:08:03 +08:00

14 KiB

SecondVoice - Plan d'Implémentation MVP

Date: 20 novembre 2025 Target: MVP minimal fonctionnel Platform: Linux Package Manager: vcpkg


🎯 Objectif MVP Minimal

Application desktop qui:

  1. Capture audio microphone en continu
  2. Transcrit chinois → texte (Whisper API)
  3. Traduit texte → français (Claude API)
  4. Affiche traduction temps réel (ImGui)
  5. Bouton Stop pour arrêter (pas de résumé MVP)

🏗️ Architecture Technique

Pipeline

Audio Capture (PortAudio)
    ↓ (chunks audio configurables)
Whisper API (STT)
    ↓ (texte chinois)
Claude API (traduction)
    ↓ (texte français)
ImGui UI (display temps réel + bouton Stop)

Threading Model

Thread 1 - Audio Capture:
  - PortAudio callback capture audio
  - Accumule chunks (taille configurable)
  - Push dans queue thread-safe
  - Save WAV backup en background

Thread 2 - AI Processing:
  - Pop chunk depuis audio queue
  - POST Whisper API → transcription chinoise
  - POST Claude API → traduction française
  - Push résultat dans UI queue

Thread 3 - Main UI (ImGui):
  - Render window ImGui
  - Display traductions depuis queue
  - Handle bouton Stop
  - Update status/duration

📁 Structure Projet

secondvoice/
├── .env                            # API keys (OPENAI_API_KEY, ANTHROPIC_API_KEY)
├── .gitignore
├── CMakeLists.txt                  # Build configuration
├── vcpkg.json                      # Dependencies manifest
├── config.json                     # Runtime config (audio chunk size, etc)
├── README.md
├── docs/
│   ├── SecondVoice.md             # Vision document
│   └── implementation_plan.md      # Ce document
├── src/
│   ├── main.cpp                    # Entry point + ImGui main loop
│   ├── audio/
│   │   ├── AudioCapture.h
│   │   ├── AudioCapture.cpp        # PortAudio wrapper
│   │   ├── AudioBuffer.h
│   │   └── AudioBuffer.cpp         # Thread-safe ring buffer
│   ├── api/
│   │   ├── WhisperClient.h
│   │   ├── WhisperClient.cpp       # Whisper API client
│   │   ├── ClaudeClient.h
│   │   └── ClaudeClient.cpp        # Claude API client
│   ├── ui/
│   │   ├── TranslationUI.h
│   │   └── TranslationUI.cpp       # ImGui interface
│   ├── utils/
│   │   ├── Config.h
│   │   ├── Config.cpp              # Load .env + config.json
│   │   ├── ThreadSafeQueue.h       # Template queue thread-safe
│   │   └── Logger.h                # Simple logging
│   └── core/
│       ├── Pipeline.h
│       └── Pipeline.cpp            # Orchestrate threads
├── recordings/                     # Output audio files
│   └── .gitkeep
└── build/                          # CMake build output (ignored)

🔧 Dépendances

vcpkg.json

{
  "name": "secondvoice",
  "version": "0.1.0",
  "dependencies": [
    "portaudio",
    "cpp-httplib",
    "nlohmann-json",
    "imgui[glfw-binding,opengl3-binding]",
    "glfw3",
    "opengl"
  ]
}

System Requirements (Linux)

# PortAudio dependencies
sudo apt install libasound2-dev

# OpenGL dependencies
sudo apt install libgl1-mesa-dev libglu1-mesa-dev

⚙️ Configuration

.env (racine projet)

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

config.json (racine projet)

{
  "audio": {
    "sample_rate": 16000,
    "channels": 1,
    "chunk_duration_seconds": 10,
    "format": "wav"
  },
  "whisper": {
    "model": "whisper-1",
    "language": "zh",
    "temperature": 0.0
  },
  "claude": {
    "model": "claude-haiku-4-20250514",
    "max_tokens": 1024,
    "temperature": 0.3,
    "system_prompt": "Tu es un traducteur professionnel chinois-français. Traduis le texte suivant de manière naturelle et contextuelle."
  },
  "ui": {
    "window_width": 800,
    "window_height": 600,
    "font_size": 16,
    "max_display_lines": 50
  },
  "recording": {
    "save_audio": true,
    "output_directory": "./recordings"
  }
}

🔌 API Clients

Whisper API

// POST https://api.openai.com/v1/audio/transcriptions
// Content-Type: multipart/form-data

Request:
- file: audio.wav (binary)
- model: whisper-1
- language: zh
- temperature: 0.0

Response:
{
  "text": "你好,今天我们讨论项目进度..."
}

Claude API

// POST https://api.anthropic.com/v1/messages
// Content-Type: application/json
// x-api-key: {ANTHROPIC_API_KEY}
// anthropic-version: 2023-06-01

Request:
{
  "model": "claude-haiku-4-20250514",
  "max_tokens": 1024,
  "messages": [{
    "role": "user",
    "content": "Traduis en français: 你好,今天我们讨论项目进度..."
  }]
}

Response:
{
  "content": [{
    "type": "text",
    "text": "Bonjour, aujourd'hui nous discutons de l'avancement du projet..."
  }],
  "model": "claude-haiku-4-20250514",
  "usage": {...}
}

🎨 Interface ImGui

Layout Minimaliste

┌────────────────────────────────────────────┐
│ SecondVoice - Live Translation             │
├────────────────────────────────────────────┤
│                                            │
│ [●] Recording...    Duration: 00:05:23     │
│                                            │
│ ┌────────────────────────────────────────┐ │
│ │ 中文: 你好,今天我们讨论项目进度...    │ │
│ │ FR: Bonjour, aujourd'hui nous          │ │
│ │     discutons de l'avancement...       │ │
│ │                                        │ │
│ │ 中文: 关于预算的问题...                │ │
│ │ FR: Concernant la question du budget.. │ │
│ │                                        │ │
│ │ [Auto-scroll enabled]                  │ │
│ │                                        │ │
│ └────────────────────────────────────────┘ │
│                                            │
│         [    STOP RECORDING    ]           │
│                                            │
│ Status: Processing chunk 12/12             │
│ Audio: 16kHz mono, chunk size: 10s         │
└────────────────────────────────────────────┘

Features UI

  • Scrollable text area: Auto-scroll, peut désactiver pour review
  • Color coding: Chinois (couleur 1), Français (couleur 2)
  • Status bar: Duration, chunk count, processing status
  • Stop button: Arrête capture + processing, sauvegarde audio
  • Window resizable: Layout adaptatif

🚀 Ordre d'Implémentation

Phase 1 - Setup Infrastructure (Jour 1)

Todo:

  1. Créer structure projet
  2. Setup CMakeLists.txt avec vcpkg
  3. Créer .gitignore (.env, build/, recordings/)
  4. Créer config.json template
  5. Setup .env (API keys)
  6. Test build minimal (hello world)

Validation: cmake -B build && cmake --build build compile sans erreurs


Phase 2 - Audio Capture (Jour 1-2)

Todo:

  1. Implémenter AudioCapture.h/cpp:
    • Init PortAudio
    • Callback capture audio
    • Accumulation chunks (configurable duration)
    • Push dans ThreadSafeQueue
  2. Implémenter AudioBuffer.h/cpp:
    • Ring buffer pour audio raw
    • Thread-safe operations
  3. Test standalone: Capture 30s audio → save WAV

Validation: Audio WAV lisible, durée correcte, qualité OK


Phase 3 - Whisper Client (Jour 2)

Todo:

  1. Implémenter WhisperClient.h/cpp:
    • Load API key depuis .env
    • POST multipart/form-data (cpp-httplib)
    • Encode audio WAV en memory
    • Parse JSON response
    • Error handling (retry, timeout)
  2. Test standalone: Audio file → Whisper → texte chinois

Validation: Transcription chinoise correcte sur sample audio


Phase 4 - Claude Client (Jour 2-3)

Todo:

  1. Implémenter ClaudeClient.h/cpp:
    • Load API key depuis .env
    • POST JSON request (cpp-httplib)
    • System prompt configurable
    • Parse response (extract text)
    • Error handling
  2. Test standalone: Texte chinois → Claude → texte français

Validation: Traduction française naturelle et correcte


Phase 5 - ImGui UI (Jour 3)

Todo:

  1. Setup ImGui + GLFW + OpenGL:
    • Window creation
    • Render loop
    • Input handling
  2. Implémenter TranslationUI.h/cpp:
    • Scrollable text area
    • Display messages (CN + FR)
    • Button Stop
    • Status bar (duration, chunk count)
  3. Test standalone: Afficher mock data

Validation: UI responsive, affichage texte OK, bouton fonctionne


Phase 6 - Pipeline Integration (Jour 4)

Todo:

  1. Implémenter Pipeline.h/cpp:
    • Thread 1: AudioCapture loop
    • Thread 2: Processing loop (Whisper → Claude)
    • Thread 3: UI loop (ImGui)
    • ThreadSafeQueue entre threads
    • Synchronisation (start/stop)
  2. Implémenter Config.h/cpp:
    • Load .env (API keys)
    • Load config.json (settings)
  3. Implémenter main.cpp:
    • Init all components
    • Start pipeline
    • Handle graceful shutdown

Validation: Pipeline complet fonctionne bout-à-bout


Phase 7 - Testing & Tuning (Jour 5)

Todo:

  1. Test avec audio réel chinois:
    • Sample conversations
    • Different audio qualities
    • Different chunk sizes (5s, 10s, 30s)
  2. Measure latence:
    • Audio → Whisper: X secondes
    • Whisper → Claude: Y secondes
    • Total: Z secondes
  3. Debug & fix bugs:
    • Memory leaks
    • Thread safety issues
    • API errors handling
  4. Optimize:
    • Chunk size optimal (tradeoff latency vs accuracy)
    • API timeout values
    • UI refresh rate

Validation:

  • Latence totale < 10s acceptable
  • Pas de crash sur 30min recording
  • Transcription + traduction compréhensibles

🧪 Test Plan

Unit Tests (Phase 2+)

  • AudioCapture: Capture audio, format correct
  • WhisperClient: API call mock, parsing JSON
  • ClaudeClient: API call mock, parsing JSON
  • ThreadSafeQueue: Thread safety, no data loss

Integration Tests

  • Audio → Whisper: Audio file → texte chinois correct
  • Whisper → Claude: Texte chinois → traduction française correcte
  • Pipeline: Audio → UI display complet

End-to-End Test

  • Recording 5min conversation chinoise réelle
  • Vérifier transcription accuracy (>85%)
  • Vérifier traduction compréhensible
  • Vérifier UI responsive
  • Vérifier audio sauvegardé correctement

📊 Metrics à Tracker

Performance

  • Latence Whisper: Temps API call (target: <3s pour 10s audio)
  • Latence Claude: Temps API call (target: <2s pour 200 tokens)
  • Latence totale: Audio → Display (target: <10s)
  • Memory usage: Stable sur longue durée (no leaks)
  • CPU usage: Acceptable (<50% sur laptop)

Qualité

  • Whisper accuracy: % mots corrects (target: >85%)
  • Claude quality: Traduction naturelle (subjective)
  • Crash rate: 0 crash sur 1h recording

Cost

  • Whisper: $0.006/min audio
  • Claude: ~$0.03-0.05/h (depends on text volume)
  • Total: ~$0.40/h meeting

⚠️ Risks & Mitigations

Risk Impact Mitigation
Whisper API timeout Bloquant Retry logic, timeout 30s, fallback queue
Claude API rate limit Moyen Exponential backoff, queue requests
Audio buffer overflow Moyen Ring buffer size adequate, drop old chunks if needed
Thread deadlock Bloquant Use std::lock_guard, avoid nested locks
Memory leak Moyen Use smart pointers, valgrind tests
Network interruption Moyen Retry logic, cache audio locally

🎯 Success Criteria MVP

MVP validé si:

  1. Capture audio microphone fonctionne
  2. Transcription chinoise >85% précise
  3. Traduction française compréhensible
  4. UI affiche traductions temps réel
  5. Bouton Stop arrête proprement
  6. Audio sauvegardé correctement
  7. Pas de crash sur 30min recording
  8. Latence totale <10s acceptable

📝 Notes Implémentation

Thread Safety

  • Utiliser std::mutex + std::lock_guard pour queues
  • Pas de shared state sans protection
  • Use std::atomic<bool> pour flags (running, stopping)

Error Handling

  • Try/catch sur API calls
  • Log errors (spdlog ou simple cout)
  • Retry logic (max 3 attempts)
  • Graceful degradation (skip chunk si error persistant)

Audio Format

  • Sample rate: 16kHz (optimal pour Whisper)
  • Channels: Mono (sufficient, réduit bandwidth)
  • Format: 16-bit PCM WAV
  • Chunk size: Configurable (default 10s)

API Best Practices

  • Timeout: 30s pour Whisper, 15s pour Claude
  • Retry: Exponential backoff (1s, 2s, 4s)
  • Rate limiting: Respect API limits (monitor 429 errors)
  • Headers: Always set User-Agent, API version

🔄 Post-MVP (Phase 2)

Not included in MVP, but planned:

  • Résumé auto post-meeting (Claude summary)
  • Export structuré (transcripts + audio)
  • Système de recherche (backlog)
  • Diarization (qui parle)
  • Replay mode
  • GUI élaborée (settings, etc)

Focus MVP: Pipeline fonctionnel bout-à-bout, validation concept, usage réel premier meeting.


Document créé: 20 novembre 2025 Status: Ready to implement Estimated effort: 5 jours développement + 2 jours tests