# SecondVoice Real-time Chinese to French translation system for live meetings. ## Overview SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly. ### Why This Project? Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for: - Business meetings with Chinese speakers - Family/administrative calls - Professional conferences - Any live Chinese conversation where real-time comprehension is needed **Status**: MVP complete, actively being debugged and improved based on real-world usage. ## Quick Start ### Windows (MinGW) - Recommended ```batch # First-time setup .\setup_mingw.bat # Build .\build_mingw.bat # Run cd build\mingw-Release SecondVoice.exe ``` **Requirements**: `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`, plus a working microphone. See full setup instructions below for other platforms. ## Features - ๐ŸŽค **Real-time audio capture** with Voice Activity Detection (VAD) - ๐Ÿ”‡ **Noise reduction** using RNNoise neural network - ๐Ÿ—ฃ๏ธ **Chinese speech-to-text** via Whisper API (gpt-4o-mini-transcribe) - ๐Ÿง  **Hallucination filtering** - removes known Whisper artifacts - ๐ŸŒ **Chinese to French translation** via Claude AI (claude-haiku-4-20250514) - ๐Ÿ–ฅ๏ธ **Clean ImGui interface** with adjustable VAD thresholds - ๐Ÿ’พ **Full session recording** with structured logging - ๐Ÿ“Š **Session archival** - audio, transcripts, translations, and metadata - โšก **Opus compression** - 46x bandwidth reduction (16kHz PCM โ†’ 24kbps Opus) - โš™๏ธ **Configurable settings** via config.json ## Requirements ### Cross-Platform Support SecondVoice works on **Windows** and **Linux**. #### Windows - Visual Studio 2019 or later (with C++ tools) - vcpkg package manager - See detailed guide: [docs/build_windows.md](docs/build_windows.md) #### Linux - GCC/Clang with C++17 support - System dependencies: `libasound2-dev`, `libgl1-mesa-dev`, `libglu1-mesa-dev` - vcpkg package manager ### vcpkg Installation **Linux**: ```bash git clone https://github.com/microsoft/vcpkg.git cd vcpkg ./bootstrap-vcpkg.sh export VCPKG_ROOT=$(pwd) ``` **Windows**: ```powershell git clone https://github.com/microsoft/vcpkg.git C:\vcpkg cd C:\vcpkg .\bootstrap-vcpkg.bat setx VCPKG_ROOT "C:\vcpkg" ``` ## Setup 1. **Clone the repository** ```bash git clone cd secondvoice ``` 2. **Create `.env` file** (copy from `.env.example`) **Linux**: ```bash cp .env.example .env nano .env # Add your API keys: # OPENAI_API_KEY=sk-... # ANTHROPIC_API_KEY=sk-ant-... ``` **Windows**: ```powershell copy .env.example .env notepad .env # Add your API keys ``` 3. **Build the project** **Linux**: ```bash ./build.sh # Or manually: # cmake -B build -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake # cmake --build build -j$(nproc) ``` **Windows**: ```batch build.bat --release REM Or see detailed guide: docs/build_windows.md ``` ## Usage **Linux**: ```bash cd build ./SecondVoice ``` **Windows**: ```batch cd build\windows-release\Release SecondVoice.exe ``` The application will: 1. Open an ImGui window 2. Start capturing audio from your microphone 3. Display Chinese transcriptions and French translations in real-time 4. Click **STOP RECORDING** button to finish 5. Save the full audio recording to `recordings/recording_YYYYMMDD_HHMMSS.wav` ## Architecture ``` Audio Input (16kHz mono) โ†“ Voice Activity Detection (VAD) - RMS + Peak thresholds โ†“ Noise Reduction (RNNoise) - 16โ†’48โ†’16 kHz resampling โ†“ Opus Encoding (24kbps OGG) - 46x compression โ†“ Whisper API (gpt-4o-mini-transcribe) - Chinese STT โ†“ Hallucination Filter - Remove known artifacts โ†“ Claude API (claude-haiku-4) - Chinese โ†’ French translation โ†“ ImGui UI Display + Session Logging ``` ### Threading Model (3 threads) 1. **Audio Thread** (`Pipeline::audioThread`) - PortAudio callback captures 16kHz mono audio - Applies VAD (Voice Activity Detection) using RMS + Peak thresholds - Pushes speech chunks to processing queue 2. **Processing Thread** (`Pipeline::processingThread`) - Consumes audio chunks from queue - Applies RNNoise denoising (upsampled to 48kHz โ†’ denoised โ†’ downsampled to 16kHz) - Encodes to Opus/OGG for bandwidth efficiency - Calls Whisper API for Chinese transcription - Filters known hallucinations (YouTube phrases, music markers, etc.) - Calls Claude API for French translation - Logs to session files 3. **UI Thread** (main) - GLFW/ImGui rendering loop (must run on main thread) - Displays real-time transcription and translation - Allows runtime VAD threshold adjustment - Handles user controls (stop recording, etc.) ### Core Components **Audio Processing**: - `AudioCapture.cpp` - PortAudio wrapper with VAD-based segmentation - `AudioBuffer.cpp` - Accumulates samples, exports WAV/Opus - `NoiseReducer.cpp` - RNNoise denoising with resampling **API Clients**: - `WhisperClient.cpp` - OpenAI Whisper API (multipart/form-data) - `ClaudeClient.cpp` - Anthropic Claude API (JSON) - `WinHttpClient.cpp` - Native Windows HTTP client (replaced libcurl) **Core Logic**: - `Pipeline.cpp` - Orchestrates audio โ†’ transcription โ†’ translation flow - `TranslationUI.cpp` - ImGui interface with VAD controls **Utilities**: - `Config.cpp` - Loads config.json + .env - `ThreadSafeQueue.h` - Lock-free queue for audio chunks ## Known Issues & Active Debugging **Status**: Real-world testing has identified issues with degraded audio conditions (see `PLAN_DEBUG.md` for details). ### Current Problems Based on transcript analysis from actual meetings (November 2025): 1. **VAD cutting speech too early** - Voice Activity Detection triggers end-of-segment prematurely - Results in fragmented phrases ("ๆˆ‘ๅพˆใ€‚" โ†’ "Je suis.") - **Hypothesis**: Silence threshold too aggressive for multi-speaker scenarios 2. **Segments too short for context** - Whisper receives insufficient audio context for accurate Chinese transcription - Single-word or two-word segments lack conversational context - **Impact**: Lower accuracy, especially with homonyms 3. **Ambient noise interpreted as speech** - Background sounds trigger false VAD positives - Test transcript shows "ๅคชๅคšๅฃฐ้Ÿณไบ†" (too much noise) being captured - **Mitigation**: RNNoise helps but not sufficient for very noisy environments 4. **Loss of inter-segment context** - Each audio chunk processed independently - Whisper cannot use previous context for better transcription - **Potential solution**: Pass previous 2-3 transcriptions in prompt ### Test Conditions Testing has been performed under **deliberately degraded conditions** to ensure robustness: - Multiple simultaneous speakers - Variable microphone distance - Variable volume levels - Fast-paced conversations - Low-quality microphone These conditions are intentionally harsh to validate real-world meeting scenarios. ### Debug Plan See `PLAN_DEBUG.md` for: - Detailed session logging implementation (JSON per segment + metadata) - Improved Whisper prompt engineering - VAD threshold tuning recommendations - Context propagation strategies ## Session Logging ### Structure ``` sessions/ โ””โ”€โ”€ YYYY-MM-DD_HHMMSS/ โ”œโ”€โ”€ session.json # Session metadata โ”œโ”€โ”€ segments/ โ”‚ โ”œโ”€โ”€ 001.json # Segment: Chinese + French + metadata โ”‚ โ”œโ”€โ”€ 002.json โ”‚ โ””โ”€โ”€ ... โ””โ”€โ”€ transcript.txt # Final export ``` ### Segment Format ```json { "id": 1, "chinese": "ไธคไธช่€้ผ ๆฑ‚ๆˆ‘", "french": "Deux souris me supplient" } ``` **Future enhancements**: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment. ## Configuration ### config.json ```json { "audio": { "sample_rate": 16000, "channels": 1, "chunk_duration_seconds": 10 }, "whisper": { "model": "gpt-4o-mini-transcribe", "language": "zh", "prompt": "Transcription d'une rรฉunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaรฎne vide. Noms possibles: Tingting, Alexis." }, "claude": { "model": "claude-haiku-4-20250514", "max_tokens": 1024 } } ``` ### .env ```env OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... ``` ## Cost Estimation - **Whisper**: ~$0.006/minute (~$0.36/hour) - **Claude Haiku**: ~$0.03-0.05/hour - **Total**: ~$0.40/hour of recording ## Advanced Features ### GPU Forcing (Hybrid Graphics Systems) `main.cpp` exports symbols to force dedicated GPU on Optimus/PowerXpress systems: - `NvOptimusEnablement` - Forces NVIDIA GPU - `AmdPowerXpressRequestHighPerformance` - Forces AMD GPU Critical for laptops with both integrated and dedicated GPUs. ### Hallucination Filtering `Pipeline.cpp` maintains an extensive list (~65 patterns) of known Whisper hallucinations: - YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment" - Chinese video endings: "่ฐข่ฐข่ง‚็œ‹", "ๅ†่ง", "่ฎข้˜…ๆˆ‘็š„้ข‘้“" - Music symbols: "โ™ชโ™ช", "๐ŸŽต" - Silence markers: "...", "silence", "inaudible" These are automatically filtered before translation to avoid wasting API calls. ### Console-Only Build A `SecondVoice_Console` target exists for headless testing: - Uses `main_console.cpp` - No ImGui/GLFW dependencies - Outputs transcriptions to stdout - Useful for debugging and automated testing ## Development ### Building in Debug Mode ```bash cmake -B build -DCMAKE_BUILD_TYPE=Debug -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake cmake --build build ``` ### Running Tests ```bash # TODO: Add tests ``` ## Troubleshooting ### No audio capture - Check microphone permissions - Verify PortAudio is properly installed: `pa_devs` (if available) - Try different audio device in code ### API errors - Verify API keys in `.env` are correct - Check internet connection - Monitor API rate limits ### Build errors - Ensure vcpkg is properly set up - Check all system dependencies are installed - Try `cmake --build build --clean-first` ## Project Structure ``` secondvoice/ โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ main.cpp # Entry point, forces NVIDIA GPU โ”‚ โ”œโ”€โ”€ core/ โ”‚ โ”‚ โ””โ”€โ”€ Pipeline.cpp # Audioโ†’Transcriptionโ†’Translation orchestration โ”‚ โ”œโ”€โ”€ audio/ โ”‚ โ”‚ โ”œโ”€โ”€ AudioCapture.cpp # PortAudio + VAD segmentation โ”‚ โ”‚ โ”œโ”€โ”€ AudioBuffer.cpp # Sample accumulation, WAV/Opus export โ”‚ โ”‚ โ””โ”€โ”€ NoiseReducer.cpp # RNNoise (16โ†’48โ†’16 kHz) โ”‚ โ”œโ”€โ”€ api/ โ”‚ โ”‚ โ”œโ”€โ”€ WhisperClient.cpp # OpenAI Whisper (multipart/form-data) โ”‚ โ”‚ โ”œโ”€โ”€ ClaudeClient.cpp # Anthropic Claude (JSON) โ”‚ โ”‚ โ””โ”€โ”€ WinHttpClient.cpp # Native Windows HTTP โ”‚ โ”œโ”€โ”€ ui/ โ”‚ โ”‚ โ””โ”€โ”€ TranslationUI.cpp # ImGui interface + VAD controls โ”‚ โ””โ”€โ”€ utils/ โ”‚ โ”œโ”€โ”€ Config.cpp # config.json + .env loader โ”‚ โ””โ”€โ”€ ThreadSafeQueue.h # Lock-free audio queue โ”œโ”€โ”€ docs/ # Build guides โ”œโ”€โ”€ sessions/ # Session recordings + logs โ”œโ”€โ”€ recordings/ # Legacy recordings directory โ”œโ”€โ”€ denoised/ # Denoised audio outputs โ”œโ”€โ”€ config.json # Runtime configuration โ”œโ”€โ”€ .env # API keys (not committed) โ”œโ”€โ”€ CLAUDE.md # Development guide for Claude Code โ”œโ”€โ”€ PLAN_DEBUG.md # Active debugging plan โ””โ”€โ”€ CMakeLists.txt # Build configuration ``` ### External Dependencies **Fetched via CMake FetchContent**: - ImGui v1.90.1 - UI framework - Opus v1.5.2 - Audio encoding - Ogg v1.3.6 - Container format - RNNoise v0.1.1 - Neural network noise reduction **vcpkg Dependencies** (x64-mingw-static triplet): - portaudio - Cross-platform audio I/O - nlohmann_json - JSON parsing - glfw3 - Windowing/input - glad - OpenGL loader ## Roadmap ### Phase 1 - MVP โœ… (Complete) - โœ… Audio capture with VAD - โœ… Noise reduction (RNNoise) - โœ… Whisper API integration - โœ… Claude API integration - โœ… ImGui UI with runtime VAD adjustment - โœ… Opus compression - โœ… Hallucination filtering - โœ… Session recording ### Phase 2 - Debugging ๐Ÿ”„ (Current) - ๐Ÿ”„ Session logging (JSON per segment) - ๐Ÿ”„ Improved Whisper prompt engineering - ๐Ÿ”„ VAD threshold optimization - ๐Ÿ”„ Context propagation between segments - โฌœ Automated testing with sample audio ### Phase 3 - Enhancement - โฌœ Auto-summary post-meeting (Claude analysis) - โฌœ Full-text search (SQLite FTS5) - โฌœ Semantic search (embeddings) - โฌœ Speaker diarization - โฌœ Replay mode with synced transcripts - โฌœ Multi-language support extension ## Development Documentation - **CLAUDE.md** - Development guide for Claude Code AI assistant - **PLAN_DEBUG.md** - Active debugging plan with identified issues and solutions - **WINDOWS_BUILD.md** - Detailed Windows build instructions - **WINDOWS_MINGW.md** - MinGW-specific build guide - **WINDOWS_QUICK_START.md** - Quick start for Windows users ## Contributing This is a personal project built to solve a real need. Bug reports and suggestions welcome: **Known issues**: See `PLAN_DEBUG.md` for current debugging efforts **Architecture**: See `CLAUDE.md` for detailed system design ## License See LICENSE file. ## Acknowledgments - OpenAI Whisper for excellent Chinese transcription - Anthropic Claude for context-aware translation - RNNoise for neural network-based noise reduction - ImGui for clean, immediate-mode UI