Go to file
2025-12-02 09:53:53 +08:00
docs feat: Add Windows support with .exe build 2025-11-20 03:38:18 +08:00
recordings chore: Clean up repo - remove audio files and update gitignore 2025-11-23 17:34:10 +08:00
src refactor: Add VAD configuration accessors to Config class 2025-12-02 09:53:53 +08:00
.env.example feat: Implement complete MVP architecture for SecondVoice 2025-11-20 03:08:03 +08:00
.gitignore feat: Add session logging system with per-segment metrics 2025-11-23 21:37:55 +08:00
build_mingw.bat feat: Add GLAD OpenGL loader and NVIDIA GPU forcing 2025-11-21 15:18:54 +08:00
build.bat feat: Add Windows support with .exe build 2025-11-20 03:38:18 +08:00
build.sh feat: Implement complete MVP architecture for SecondVoice 2025-11-20 03:08:03 +08:00
CLAUDE.md feat: Add transcript export and debug planning docs 2025-11-23 19:59:29 +08:00
CMakeLists_noui_fixed.txt fix: Résolution complète du problème OpenGL/ImGui avec threading 2025-11-21 16:37:47 +08:00
CMakeLists_noui.txt fix: Résolution complète du problème OpenGL/ImGui avec threading 2025-11-21 16:37:47 +08:00
CMakeLists.txt feat: Add session logging system with per-segment metrics 2025-11-23 21:37:55 +08:00
CMakePresets.json feat: Add GLAD OpenGL loader and NVIDIA GPU forcing 2025-11-21 15:18:54 +08:00
config.json tune: Extend VAD speech duration and improve context prompt formatting 2025-12-02 09:48:44 +08:00
create_shortcut.ps1 fix: Résolution complète du problème OpenGL/ImGui avec threading 2025-11-21 16:37:47 +08:00
force_nvidia_gpu.ps1 fix: Résolution complète du problème OpenGL/ImGui avec threading 2025-11-21 16:37:47 +08:00
PLAN_DEBUG.md docs: Add Whisper prompt improvement strategy 2025-11-23 20:09:04 +08:00
README.md refactor: Improve VAD trailing silence detection and update docs 2025-12-02 09:44:06 +08:00
run_secondvoice.bat fix: Résolution complète du problème OpenGL/ImGui avec threading 2025-11-21 16:37:47 +08:00
set_gpu.ps1 fix: Résolution complète du problème OpenGL/ImGui avec threading 2025-11-21 16:37:47 +08:00
setup_mingw_simple.bat fix: Add MinGW build support and compatibility fixes 2025-11-20 11:43:13 +08:00
setup_mingw.bat feat: Major improvements - WinHTTP, gpt-4o-mini, Opus, sliding window 2025-11-23 12:17:41 +08:00
test_opengl.cpp fix: Résolution complète du problème OpenGL/ImGui avec threading 2025-11-21 16:37:47 +08:00
vcpkg.json feat: Major improvements - WinHTTP, gpt-4o-mini, Opus, sliding window 2025-11-23 12:17:41 +08:00
WINDOWS_BUILD.md feat: Add GLAD OpenGL loader and NVIDIA GPU forcing 2025-11-21 15:18:54 +08:00
WINDOWS_MINGW.md feat: Add MinGW support - Build without Visual Studio! 2025-11-20 03:42:41 +08:00
WINDOWS_QUICK_START.md feat: Add MinGW support - Build without Visual Studio! 2025-11-20 03:42:41 +08:00

SecondVoice

Real-time Chinese to French translation system for live meetings.

Overview

SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly.

Why This Project?

Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:

  • Business meetings with Chinese speakers
  • Family/administrative calls
  • Professional conferences
  • Any live Chinese conversation where real-time comprehension is needed

Status: MVP complete, actively being debugged and improved based on real-world usage.

Quick Start

# First-time setup
.\setup_mingw.bat

# Build
.\build_mingw.bat

# Run
cd build\mingw-Release
SecondVoice.exe

Requirements: .env file with OPENAI_API_KEY and ANTHROPIC_API_KEY, plus a working microphone.

See full setup instructions below for other platforms.

Features

  • 🎤 Real-time audio capture with Voice Activity Detection (VAD)
  • 🔇 Noise reduction using RNNoise neural network
  • 🗣️ Chinese speech-to-text via Whisper API (gpt-4o-mini-transcribe)
  • 🧠 Hallucination filtering - removes known Whisper artifacts
  • 🌐 Chinese to French translation via Claude AI (claude-haiku-4-20250514)
  • 🖥️ Clean ImGui interface with adjustable VAD thresholds
  • 💾 Full session recording with structured logging
  • 📊 Session archival - audio, transcripts, translations, and metadata
  • Opus compression - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
  • ⚙️ Configurable settings via config.json

Requirements

Cross-Platform Support

SecondVoice works on Windows and Linux.

Windows

  • Visual Studio 2019 or later (with C++ tools)
  • vcpkg package manager
  • See detailed guide: docs/build_windows.md

Linux

  • GCC/Clang with C++17 support
  • System dependencies: libasound2-dev, libgl1-mesa-dev, libglu1-mesa-dev
  • vcpkg package manager

vcpkg Installation

Linux:

git clone https://github.com/microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
export VCPKG_ROOT=$(pwd)

Windows:

git clone https://github.com/microsoft/vcpkg.git C:\vcpkg
cd C:\vcpkg
.\bootstrap-vcpkg.bat
setx VCPKG_ROOT "C:\vcpkg"

Setup

  1. Clone the repository
git clone <repository-url>
cd secondvoice
  1. Create .env file (copy from .env.example)

Linux:

cp .env.example .env
nano .env
# Add your API keys:
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...

Windows:

copy .env.example .env
notepad .env
# Add your API keys
  1. Build the project

Linux:

./build.sh
# Or manually:
# cmake -B build -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake
# cmake --build build -j$(nproc)

Windows:

build.bat --release
REM Or see detailed guide: docs/build_windows.md

Usage

Linux:

cd build
./SecondVoice

Windows:

cd build\windows-release\Release
SecondVoice.exe

The application will:

  1. Open an ImGui window
  2. Start capturing audio from your microphone
  3. Display Chinese transcriptions and French translations in real-time
  4. Click STOP RECORDING button to finish
  5. Save the full audio recording to recordings/recording_YYYYMMDD_HHMMSS.wav

Architecture

Audio Input (16kHz mono)
    ↓
Voice Activity Detection (VAD) - RMS + Peak thresholds
    ↓
Noise Reduction (RNNoise) - 16→48→16 kHz resampling
    ↓
Opus Encoding (24kbps OGG) - 46x compression
    ↓
Whisper API (gpt-4o-mini-transcribe) - Chinese STT
    ↓
Hallucination Filter - Remove known artifacts
    ↓
Claude API (claude-haiku-4) - Chinese → French translation
    ↓
ImGui UI Display + Session Logging

Threading Model (3 threads)

  1. Audio Thread (Pipeline::audioThread)

    • PortAudio callback captures 16kHz mono audio
    • Applies VAD (Voice Activity Detection) using RMS + Peak thresholds
    • Pushes speech chunks to processing queue
  2. Processing Thread (Pipeline::processingThread)

    • Consumes audio chunks from queue
    • Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
    • Encodes to Opus/OGG for bandwidth efficiency
    • Calls Whisper API for Chinese transcription
    • Filters known hallucinations (YouTube phrases, music markers, etc.)
    • Calls Claude API for French translation
    • Logs to session files
  3. UI Thread (main)

    • GLFW/ImGui rendering loop (must run on main thread)
    • Displays real-time transcription and translation
    • Allows runtime VAD threshold adjustment
    • Handles user controls (stop recording, etc.)

Core Components

Audio Processing:

  • AudioCapture.cpp - PortAudio wrapper with VAD-based segmentation
  • AudioBuffer.cpp - Accumulates samples, exports WAV/Opus
  • NoiseReducer.cpp - RNNoise denoising with resampling

API Clients:

  • WhisperClient.cpp - OpenAI Whisper API (multipart/form-data)
  • ClaudeClient.cpp - Anthropic Claude API (JSON)
  • WinHttpClient.cpp - Native Windows HTTP client (replaced libcurl)

Core Logic:

  • Pipeline.cpp - Orchestrates audio → transcription → translation flow
  • TranslationUI.cpp - ImGui interface with VAD controls

Utilities:

  • Config.cpp - Loads config.json + .env
  • ThreadSafeQueue.h - Lock-free queue for audio chunks

Known Issues & Active Debugging

Status: Real-world testing has identified issues with degraded audio conditions (see PLAN_DEBUG.md for details).

Current Problems

Based on transcript analysis from actual meetings (November 2025):

  1. VAD cutting speech too early

    • Voice Activity Detection triggers end-of-segment prematurely
    • Results in fragmented phrases ("我很。" → "Je suis.")
    • Hypothesis: Silence threshold too aggressive for multi-speaker scenarios
  2. Segments too short for context

    • Whisper receives insufficient audio context for accurate Chinese transcription
    • Single-word or two-word segments lack conversational context
    • Impact: Lower accuracy, especially with homonyms
  3. Ambient noise interpreted as speech

    • Background sounds trigger false VAD positives
    • Test transcript shows "太多声音了" (too much noise) being captured
    • Mitigation: RNNoise helps but not sufficient for very noisy environments
  4. Loss of inter-segment context

    • Each audio chunk processed independently
    • Whisper cannot use previous context for better transcription
    • Potential solution: Pass previous 2-3 transcriptions in prompt

Test Conditions

Testing has been performed under deliberately degraded conditions to ensure robustness:

  • Multiple simultaneous speakers
  • Variable microphone distance
  • Variable volume levels
  • Fast-paced conversations
  • Low-quality microphone

These conditions are intentionally harsh to validate real-world meeting scenarios.

Debug Plan

See PLAN_DEBUG.md for:

  • Detailed session logging implementation (JSON per segment + metadata)
  • Improved Whisper prompt engineering
  • VAD threshold tuning recommendations
  • Context propagation strategies

Session Logging

Structure

sessions/
└── YYYY-MM-DD_HHMMSS/
    ├── session.json           # Session metadata
    ├── segments/
    │   ├── 001.json          # Segment: Chinese + French + metadata
    │   ├── 002.json
    │   └── ...
    └── transcript.txt         # Final export

Segment Format

{
  "id": 1,
  "chinese": "两个老鼠求我",
  "french": "Deux souris me supplient"
}

Future enhancements: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.

Configuration

config.json

{
  "audio": {
    "sample_rate": 16000,
    "channels": 1,
    "chunk_duration_seconds": 10
  },
  "whisper": {
    "model": "gpt-4o-mini-transcribe",
    "language": "zh",
    "prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
  },
  "claude": {
    "model": "claude-haiku-4-20250514",
    "max_tokens": 1024
  }
}

.env

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

Cost Estimation

  • Whisper: ~$0.006/minute (~$0.36/hour)
  • Claude Haiku: ~$0.03-0.05/hour
  • Total: ~$0.40/hour of recording

Advanced Features

GPU Forcing (Hybrid Graphics Systems)

main.cpp exports symbols to force dedicated GPU on Optimus/PowerXpress systems:

  • NvOptimusEnablement - Forces NVIDIA GPU
  • AmdPowerXpressRequestHighPerformance - Forces AMD GPU

Critical for laptops with both integrated and dedicated GPUs.

Hallucination Filtering

Pipeline.cpp maintains an extensive list (~65 patterns) of known Whisper hallucinations:

  • YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment"
  • Chinese video endings: "谢谢观看", "再见", "订阅我的频道"
  • Music symbols: "♪♪", "🎵"
  • Silence markers: "...", "silence", "inaudible"

These are automatically filtered before translation to avoid wasting API calls.

Console-Only Build

A SecondVoice_Console target exists for headless testing:

  • Uses main_console.cpp
  • No ImGui/GLFW dependencies
  • Outputs transcriptions to stdout
  • Useful for debugging and automated testing

Development

Building in Debug Mode

cmake -B build -DCMAKE_BUILD_TYPE=Debug -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake
cmake --build build

Running Tests

# TODO: Add tests

Troubleshooting

No audio capture

  • Check microphone permissions
  • Verify PortAudio is properly installed: pa_devs (if available)
  • Try different audio device in code

API errors

  • Verify API keys in .env are correct
  • Check internet connection
  • Monitor API rate limits

Build errors

  • Ensure vcpkg is properly set up
  • Check all system dependencies are installed
  • Try cmake --build build --clean-first

Project Structure

secondvoice/
├── src/
│   ├── main.cpp                    # Entry point, forces NVIDIA GPU
│   ├── core/
│   │   └── Pipeline.cpp           # Audio→Transcription→Translation orchestration
│   ├── audio/
│   │   ├── AudioCapture.cpp       # PortAudio + VAD segmentation
│   │   ├── AudioBuffer.cpp        # Sample accumulation, WAV/Opus export
│   │   └── NoiseReducer.cpp       # RNNoise (16→48→16 kHz)
│   ├── api/
│   │   ├── WhisperClient.cpp      # OpenAI Whisper (multipart/form-data)
│   │   ├── ClaudeClient.cpp       # Anthropic Claude (JSON)
│   │   └── WinHttpClient.cpp      # Native Windows HTTP
│   ├── ui/
│   │   └── TranslationUI.cpp      # ImGui interface + VAD controls
│   └── utils/
│       ├── Config.cpp             # config.json + .env loader
│       └── ThreadSafeQueue.h      # Lock-free audio queue
├── docs/                          # Build guides
├── sessions/                      # Session recordings + logs
├── recordings/                    # Legacy recordings directory
├── denoised/                      # Denoised audio outputs
├── config.json                    # Runtime configuration
├── .env                           # API keys (not committed)
├── CLAUDE.md                      # Development guide for Claude Code
├── PLAN_DEBUG.md                  # Active debugging plan
└── CMakeLists.txt                 # Build configuration

External Dependencies

Fetched via CMake FetchContent:

  • ImGui v1.90.1 - UI framework
  • Opus v1.5.2 - Audio encoding
  • Ogg v1.3.6 - Container format
  • RNNoise v0.1.1 - Neural network noise reduction

vcpkg Dependencies (x64-mingw-static triplet):

  • portaudio - Cross-platform audio I/O
  • nlohmann_json - JSON parsing
  • glfw3 - Windowing/input
  • glad - OpenGL loader

Roadmap

Phase 1 - MVP (Complete)

  • Audio capture with VAD
  • Noise reduction (RNNoise)
  • Whisper API integration
  • Claude API integration
  • ImGui UI with runtime VAD adjustment
  • Opus compression
  • Hallucination filtering
  • Session recording

Phase 2 - Debugging 🔄 (Current)

  • 🔄 Session logging (JSON per segment)
  • 🔄 Improved Whisper prompt engineering
  • 🔄 VAD threshold optimization
  • 🔄 Context propagation between segments
  • Automated testing with sample audio

Phase 3 - Enhancement

  • Auto-summary post-meeting (Claude analysis)
  • Full-text search (SQLite FTS5)
  • Semantic search (embeddings)
  • Speaker diarization
  • Replay mode with synced transcripts
  • Multi-language support extension

Development Documentation

  • CLAUDE.md - Development guide for Claude Code AI assistant
  • PLAN_DEBUG.md - Active debugging plan with identified issues and solutions
  • WINDOWS_BUILD.md - Detailed Windows build instructions
  • WINDOWS_MINGW.md - MinGW-specific build guide
  • WINDOWS_QUICK_START.md - Quick start for Windows users

Contributing

This is a personal project built to solve a real need. Bug reports and suggestions welcome:

Known issues: See PLAN_DEBUG.md for current debugging efforts Architecture: See CLAUDE.md for detailed system design

License

See LICENSE file.

Acknowledgments

  • OpenAI Whisper for excellent Chinese transcription
  • Anthropic Claude for context-aware translation
  • RNNoise for neural network-based noise reduction
  • ImGui for clean, immediate-mode UI