| docs | ||
| recordings | ||
| src | ||
| .env.example | ||
| .gitignore | ||
| build_mingw.bat | ||
| build.bat | ||
| build.sh | ||
| CLAUDE.md | ||
| CMakeLists_noui_fixed.txt | ||
| CMakeLists_noui.txt | ||
| CMakeLists.txt | ||
| CMakePresets.json | ||
| config.json | ||
| create_shortcut.ps1 | ||
| force_nvidia_gpu.ps1 | ||
| PLAN_DEBUG.md | ||
| README.md | ||
| run_secondvoice.bat | ||
| set_gpu.ps1 | ||
| setup_mingw_simple.bat | ||
| setup_mingw.bat | ||
| test_opengl.cpp | ||
| vcpkg.json | ||
| WINDOWS_BUILD.md | ||
| WINDOWS_MINGW.md | ||
| WINDOWS_QUICK_START.md | ||
SecondVoice
Real-time Chinese to French translation system for live meetings.
Overview
SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly.
Why This Project?
Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:
- Business meetings with Chinese speakers
- Family/administrative calls
- Professional conferences
- Any live Chinese conversation where real-time comprehension is needed
Status: MVP complete, actively being debugged and improved based on real-world usage.
Quick Start
Windows (MinGW) - Recommended
# First-time setup
.\setup_mingw.bat
# Build
.\build_mingw.bat
# Run
cd build\mingw-Release
SecondVoice.exe
Requirements: .env file with OPENAI_API_KEY and ANTHROPIC_API_KEY, plus a working microphone.
See full setup instructions below for other platforms.
Features
- 🎤 Real-time audio capture with Voice Activity Detection (VAD)
- 🔇 Noise reduction using RNNoise neural network
- 🗣️ Chinese speech-to-text via Whisper API (gpt-4o-mini-transcribe)
- 🧠 Hallucination filtering - removes known Whisper artifacts
- 🌐 Chinese to French translation via Claude AI (claude-haiku-4-20250514)
- 🖥️ Clean ImGui interface with adjustable VAD thresholds
- 💾 Full session recording with structured logging
- 📊 Session archival - audio, transcripts, translations, and metadata
- ⚡ Opus compression - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
- ⚙️ Configurable settings via config.json
Requirements
Cross-Platform Support
SecondVoice works on Windows and Linux.
Windows
- Visual Studio 2019 or later (with C++ tools)
- vcpkg package manager
- See detailed guide: docs/build_windows.md
Linux
- GCC/Clang with C++17 support
- System dependencies:
libasound2-dev,libgl1-mesa-dev,libglu1-mesa-dev - vcpkg package manager
vcpkg Installation
Linux:
git clone https://github.com/microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
export VCPKG_ROOT=$(pwd)
Windows:
git clone https://github.com/microsoft/vcpkg.git C:\vcpkg
cd C:\vcpkg
.\bootstrap-vcpkg.bat
setx VCPKG_ROOT "C:\vcpkg"
Setup
- Clone the repository
git clone <repository-url>
cd secondvoice
- Create
.envfile (copy from.env.example)
Linux:
cp .env.example .env
nano .env
# Add your API keys:
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
Windows:
copy .env.example .env
notepad .env
# Add your API keys
- Build the project
Linux:
./build.sh
# Or manually:
# cmake -B build -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake
# cmake --build build -j$(nproc)
Windows:
build.bat --release
REM Or see detailed guide: docs/build_windows.md
Usage
Linux:
cd build
./SecondVoice
Windows:
cd build\windows-release\Release
SecondVoice.exe
The application will:
- Open an ImGui window
- Start capturing audio from your microphone
- Display Chinese transcriptions and French translations in real-time
- Click STOP RECORDING button to finish
- Save the full audio recording to
recordings/recording_YYYYMMDD_HHMMSS.wav
Architecture
Audio Input (16kHz mono)
↓
Voice Activity Detection (VAD) - RMS + Peak thresholds
↓
Noise Reduction (RNNoise) - 16→48→16 kHz resampling
↓
Opus Encoding (24kbps OGG) - 46x compression
↓
Whisper API (gpt-4o-mini-transcribe) - Chinese STT
↓
Hallucination Filter - Remove known artifacts
↓
Claude API (claude-haiku-4) - Chinese → French translation
↓
ImGui UI Display + Session Logging
Threading Model (3 threads)
-
Audio Thread (
Pipeline::audioThread)- PortAudio callback captures 16kHz mono audio
- Applies VAD (Voice Activity Detection) using RMS + Peak thresholds
- Pushes speech chunks to processing queue
-
Processing Thread (
Pipeline::processingThread)- Consumes audio chunks from queue
- Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
- Encodes to Opus/OGG for bandwidth efficiency
- Calls Whisper API for Chinese transcription
- Filters known hallucinations (YouTube phrases, music markers, etc.)
- Calls Claude API for French translation
- Logs to session files
-
UI Thread (main)
- GLFW/ImGui rendering loop (must run on main thread)
- Displays real-time transcription and translation
- Allows runtime VAD threshold adjustment
- Handles user controls (stop recording, etc.)
Core Components
Audio Processing:
AudioCapture.cpp- PortAudio wrapper with VAD-based segmentationAudioBuffer.cpp- Accumulates samples, exports WAV/OpusNoiseReducer.cpp- RNNoise denoising with resampling
API Clients:
WhisperClient.cpp- OpenAI Whisper API (multipart/form-data)ClaudeClient.cpp- Anthropic Claude API (JSON)WinHttpClient.cpp- Native Windows HTTP client (replaced libcurl)
Core Logic:
Pipeline.cpp- Orchestrates audio → transcription → translation flowTranslationUI.cpp- ImGui interface with VAD controls
Utilities:
Config.cpp- Loads config.json + .envThreadSafeQueue.h- Lock-free queue for audio chunks
Known Issues & Active Debugging
Status: Real-world testing has identified issues with degraded audio conditions (see PLAN_DEBUG.md for details).
Current Problems
Based on transcript analysis from actual meetings (November 2025):
-
VAD cutting speech too early
- Voice Activity Detection triggers end-of-segment prematurely
- Results in fragmented phrases ("我很。" → "Je suis.")
- Hypothesis: Silence threshold too aggressive for multi-speaker scenarios
-
Segments too short for context
- Whisper receives insufficient audio context for accurate Chinese transcription
- Single-word or two-word segments lack conversational context
- Impact: Lower accuracy, especially with homonyms
-
Ambient noise interpreted as speech
- Background sounds trigger false VAD positives
- Test transcript shows "太多声音了" (too much noise) being captured
- Mitigation: RNNoise helps but not sufficient for very noisy environments
-
Loss of inter-segment context
- Each audio chunk processed independently
- Whisper cannot use previous context for better transcription
- Potential solution: Pass previous 2-3 transcriptions in prompt
Test Conditions
Testing has been performed under deliberately degraded conditions to ensure robustness:
- Multiple simultaneous speakers
- Variable microphone distance
- Variable volume levels
- Fast-paced conversations
- Low-quality microphone
These conditions are intentionally harsh to validate real-world meeting scenarios.
Debug Plan
See PLAN_DEBUG.md for:
- Detailed session logging implementation (JSON per segment + metadata)
- Improved Whisper prompt engineering
- VAD threshold tuning recommendations
- Context propagation strategies
Session Logging
Structure
sessions/
└── YYYY-MM-DD_HHMMSS/
├── session.json # Session metadata
├── segments/
│ ├── 001.json # Segment: Chinese + French + metadata
│ ├── 002.json
│ └── ...
└── transcript.txt # Final export
Segment Format
{
"id": 1,
"chinese": "两个老鼠求我",
"french": "Deux souris me supplient"
}
Future enhancements: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.
Configuration
config.json
{
"audio": {
"sample_rate": 16000,
"channels": 1,
"chunk_duration_seconds": 10
},
"whisper": {
"model": "gpt-4o-mini-transcribe",
"language": "zh",
"prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
},
"claude": {
"model": "claude-haiku-4-20250514",
"max_tokens": 1024
}
}
.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
Cost Estimation
- Whisper: ~$0.006/minute (~$0.36/hour)
- Claude Haiku: ~$0.03-0.05/hour
- Total: ~$0.40/hour of recording
Advanced Features
GPU Forcing (Hybrid Graphics Systems)
main.cpp exports symbols to force dedicated GPU on Optimus/PowerXpress systems:
NvOptimusEnablement- Forces NVIDIA GPUAmdPowerXpressRequestHighPerformance- Forces AMD GPU
Critical for laptops with both integrated and dedicated GPUs.
Hallucination Filtering
Pipeline.cpp maintains an extensive list (~65 patterns) of known Whisper hallucinations:
- YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment"
- Chinese video endings: "谢谢观看", "再见", "订阅我的频道"
- Music symbols: "♪♪", "🎵"
- Silence markers: "...", "silence", "inaudible"
These are automatically filtered before translation to avoid wasting API calls.
Console-Only Build
A SecondVoice_Console target exists for headless testing:
- Uses
main_console.cpp - No ImGui/GLFW dependencies
- Outputs transcriptions to stdout
- Useful for debugging and automated testing
Development
Building in Debug Mode
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake
cmake --build build
Running Tests
# TODO: Add tests
Troubleshooting
No audio capture
- Check microphone permissions
- Verify PortAudio is properly installed:
pa_devs(if available) - Try different audio device in code
API errors
- Verify API keys in
.envare correct - Check internet connection
- Monitor API rate limits
Build errors
- Ensure vcpkg is properly set up
- Check all system dependencies are installed
- Try
cmake --build build --clean-first
Project Structure
secondvoice/
├── src/
│ ├── main.cpp # Entry point, forces NVIDIA GPU
│ ├── core/
│ │ └── Pipeline.cpp # Audio→Transcription→Translation orchestration
│ ├── audio/
│ │ ├── AudioCapture.cpp # PortAudio + VAD segmentation
│ │ ├── AudioBuffer.cpp # Sample accumulation, WAV/Opus export
│ │ └── NoiseReducer.cpp # RNNoise (16→48→16 kHz)
│ ├── api/
│ │ ├── WhisperClient.cpp # OpenAI Whisper (multipart/form-data)
│ │ ├── ClaudeClient.cpp # Anthropic Claude (JSON)
│ │ └── WinHttpClient.cpp # Native Windows HTTP
│ ├── ui/
│ │ └── TranslationUI.cpp # ImGui interface + VAD controls
│ └── utils/
│ ├── Config.cpp # config.json + .env loader
│ └── ThreadSafeQueue.h # Lock-free audio queue
├── docs/ # Build guides
├── sessions/ # Session recordings + logs
├── recordings/ # Legacy recordings directory
├── denoised/ # Denoised audio outputs
├── config.json # Runtime configuration
├── .env # API keys (not committed)
├── CLAUDE.md # Development guide for Claude Code
├── PLAN_DEBUG.md # Active debugging plan
└── CMakeLists.txt # Build configuration
External Dependencies
Fetched via CMake FetchContent:
- ImGui v1.90.1 - UI framework
- Opus v1.5.2 - Audio encoding
- Ogg v1.3.6 - Container format
- RNNoise v0.1.1 - Neural network noise reduction
vcpkg Dependencies (x64-mingw-static triplet):
- portaudio - Cross-platform audio I/O
- nlohmann_json - JSON parsing
- glfw3 - Windowing/input
- glad - OpenGL loader
Roadmap
Phase 1 - MVP ✅ (Complete)
- ✅ Audio capture with VAD
- ✅ Noise reduction (RNNoise)
- ✅ Whisper API integration
- ✅ Claude API integration
- ✅ ImGui UI with runtime VAD adjustment
- ✅ Opus compression
- ✅ Hallucination filtering
- ✅ Session recording
Phase 2 - Debugging 🔄 (Current)
- 🔄 Session logging (JSON per segment)
- 🔄 Improved Whisper prompt engineering
- 🔄 VAD threshold optimization
- 🔄 Context propagation between segments
- ⬜ Automated testing with sample audio
Phase 3 - Enhancement
- ⬜ Auto-summary post-meeting (Claude analysis)
- ⬜ Full-text search (SQLite FTS5)
- ⬜ Semantic search (embeddings)
- ⬜ Speaker diarization
- ⬜ Replay mode with synced transcripts
- ⬜ Multi-language support extension
Development Documentation
- CLAUDE.md - Development guide for Claude Code AI assistant
- PLAN_DEBUG.md - Active debugging plan with identified issues and solutions
- WINDOWS_BUILD.md - Detailed Windows build instructions
- WINDOWS_MINGW.md - MinGW-specific build guide
- WINDOWS_QUICK_START.md - Quick start for Windows users
Contributing
This is a personal project built to solve a real need. Bug reports and suggestions welcome:
Known issues: See PLAN_DEBUG.md for current debugging efforts
Architecture: See CLAUDE.md for detailed system design
License
See LICENSE file.
Acknowledgments
- OpenAI Whisper for excellent Chinese transcription
- Anthropic Claude for context-aware translation
- RNNoise for neural network-based noise reduction
- ImGui for clean, immediate-mode UI