Go to file

StillHammer aac5602722 refactor: Add VAD configuration accessors to Config class		2025-12-02 09:53:53 +08:00
docs	feat: Add Windows support with .exe build	2025-11-20 03:38:18 +08:00
recordings	chore: Clean up repo - remove audio files and update gitignore	2025-11-23 17:34:10 +08:00
src	refactor: Add VAD configuration accessors to Config class	2025-12-02 09:53:53 +08:00
.env.example	feat: Implement complete MVP architecture for SecondVoice	2025-11-20 03:08:03 +08:00
.gitignore	feat: Add session logging system with per-segment metrics	2025-11-23 21:37:55 +08:00
build_mingw.bat	feat: Add GLAD OpenGL loader and NVIDIA GPU forcing	2025-11-21 15:18:54 +08:00
build.bat	feat: Add Windows support with .exe build	2025-11-20 03:38:18 +08:00
build.sh	feat: Implement complete MVP architecture for SecondVoice	2025-11-20 03:08:03 +08:00
CLAUDE.md	feat: Add transcript export and debug planning docs	2025-11-23 19:59:29 +08:00
CMakeLists_noui_fixed.txt	fix: Résolution complète du problème OpenGL/ImGui avec threading	2025-11-21 16:37:47 +08:00
CMakeLists_noui.txt	fix: Résolution complète du problème OpenGL/ImGui avec threading	2025-11-21 16:37:47 +08:00
CMakeLists.txt	feat: Add session logging system with per-segment metrics	2025-11-23 21:37:55 +08:00
CMakePresets.json	feat: Add GLAD OpenGL loader and NVIDIA GPU forcing	2025-11-21 15:18:54 +08:00
config.json	tune: Extend VAD speech duration and improve context prompt formatting	2025-12-02 09:48:44 +08:00
create_shortcut.ps1	fix: Résolution complète du problème OpenGL/ImGui avec threading	2025-11-21 16:37:47 +08:00
force_nvidia_gpu.ps1	fix: Résolution complète du problème OpenGL/ImGui avec threading	2025-11-21 16:37:47 +08:00
PLAN_DEBUG.md	docs: Add Whisper prompt improvement strategy	2025-11-23 20:09:04 +08:00
README.md	refactor: Improve VAD trailing silence detection and update docs	2025-12-02 09:44:06 +08:00
run_secondvoice.bat	fix: Résolution complète du problème OpenGL/ImGui avec threading	2025-11-21 16:37:47 +08:00
set_gpu.ps1	fix: Résolution complète du problème OpenGL/ImGui avec threading	2025-11-21 16:37:47 +08:00
setup_mingw_simple.bat	fix: Add MinGW build support and compatibility fixes	2025-11-20 11:43:13 +08:00
setup_mingw.bat	feat: Major improvements - WinHTTP, gpt-4o-mini, Opus, sliding window	2025-11-23 12:17:41 +08:00
test_opengl.cpp	fix: Résolution complète du problème OpenGL/ImGui avec threading	2025-11-21 16:37:47 +08:00
vcpkg.json	feat: Major improvements - WinHTTP, gpt-4o-mini, Opus, sliding window	2025-11-23 12:17:41 +08:00
WINDOWS_BUILD.md	feat: Add GLAD OpenGL loader and NVIDIA GPU forcing	2025-11-21 15:18:54 +08:00
WINDOWS_MINGW.md	feat: Add MinGW support - Build without Visual Studio!	2025-11-20 03:42:41 +08:00
WINDOWS_QUICK_START.md	feat: Add MinGW support - Build without Visual Studio!	2025-11-20 03:42:41 +08:00

README.md

SecondVoice

Real-time Chinese to French translation system for live meetings.

Overview

SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly.

Why This Project?

Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:

Business meetings with Chinese speakers
Family/administrative calls
Professional conferences
Any live Chinese conversation where real-time comprehension is needed

Status: MVP complete, actively being debugged and improved based on real-world usage.

Quick Start

Windows (MinGW) - Recommended

# First-time setup
.\setup_mingw.bat

# Build
.\build_mingw.bat

# Run
cd build\mingw-Release
SecondVoice.exe

Requirements: .env file with OPENAI_API_KEY and ANTHROPIC_API_KEY, plus a working microphone.

See full setup instructions below for other platforms.

Features

🎤 Real-time audio capture with Voice Activity Detection (VAD)
🔇 Noise reduction using RNNoise neural network
🗣️ Chinese speech-to-text via Whisper API (gpt-4o-mini-transcribe)
🧠 Hallucination filtering - removes known Whisper artifacts
🌐 Chinese to French translation via Claude AI (claude-haiku-4-20250514)
🖥️ Clean ImGui interface with adjustable VAD thresholds
💾 Full session recording with structured logging
📊 Session archival - audio, transcripts, translations, and metadata
⚡ Opus compression - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
⚙️ Configurable settings via config.json

Requirements

Cross-Platform Support

SecondVoice works on Windows and Linux.

Windows

Visual Studio 2019 or later (with C++ tools)
vcpkg package manager
See detailed guide: docs/build_windows.md

Linux

GCC/Clang with C++17 support
System dependencies: libasound2-dev, libgl1-mesa-dev, libglu1-mesa-dev
vcpkg package manager

vcpkg Installation

Linux:

git clone https://github.com/microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
export VCPKG_ROOT=$(pwd)

Windows:

git clone https://github.com/microsoft/vcpkg.git C:\vcpkg
cd C:\vcpkg
.\bootstrap-vcpkg.bat
setx VCPKG_ROOT "C:\vcpkg"

Setup

Clone the repository

git clone <repository-url>
cd secondvoice

Create .env file (copy from .env.example)

Linux:

cp .env.example .env
nano .env
# Add your API keys:
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...

Windows:

copy .env.example .env
notepad .env
# Add your API keys

Build the project

Linux:

./build.sh
# Or manually:
# cmake -B build -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake
# cmake --build build -j$(nproc)

Windows:

build.bat --release
REM Or see detailed guide: docs/build_windows.md

Usage

Linux:

cd build
./SecondVoice

Windows:

cd build\windows-release\Release
SecondVoice.exe

The application will:

Open an ImGui window
Start capturing audio from your microphone
Display Chinese transcriptions and French translations in real-time
Click STOP RECORDING button to finish
Save the full audio recording to recordings/recording_YYYYMMDD_HHMMSS.wav

Architecture

Audio Input (16kHz mono)
    ↓
Voice Activity Detection (VAD) - RMS + Peak thresholds
    ↓
Noise Reduction (RNNoise) - 16→48→16 kHz resampling
    ↓
Opus Encoding (24kbps OGG) - 46x compression
    ↓
Whisper API (gpt-4o-mini-transcribe) - Chinese STT
    ↓
Hallucination Filter - Remove known artifacts
    ↓
Claude API (claude-haiku-4) - Chinese → French translation
    ↓
ImGui UI Display + Session Logging

Threading Model (3 threads)

Audio Thread (Pipeline::audioThread)
- PortAudio callback captures 16kHz mono audio
- Applies VAD (Voice Activity Detection) using RMS + Peak thresholds
- Pushes speech chunks to processing queue
Processing Thread (Pipeline::processingThread)
- Consumes audio chunks from queue
- Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
- Encodes to Opus/OGG for bandwidth efficiency
- Calls Whisper API for Chinese transcription
- Filters known hallucinations (YouTube phrases, music markers, etc.)
- Calls Claude API for French translation
- Logs to session files
UI Thread (main)
- GLFW/ImGui rendering loop (must run on main thread)
- Displays real-time transcription and translation
- Allows runtime VAD threshold adjustment
- Handles user controls (stop recording, etc.)

Core Components

Audio Processing:

AudioCapture.cpp - PortAudio wrapper with VAD-based segmentation
AudioBuffer.cpp - Accumulates samples, exports WAV/Opus
NoiseReducer.cpp - RNNoise denoising with resampling

API Clients:

WhisperClient.cpp - OpenAI Whisper API (multipart/form-data)
ClaudeClient.cpp - Anthropic Claude API (JSON)
WinHttpClient.cpp - Native Windows HTTP client (replaced libcurl)

Core Logic:

Pipeline.cpp - Orchestrates audio → transcription → translation flow
TranslationUI.cpp - ImGui interface with VAD controls

Utilities:

Config.cpp - Loads config.json + .env
ThreadSafeQueue.h - Lock-free queue for audio chunks

Known Issues & Active Debugging

Status: Real-world testing has identified issues with degraded audio conditions (see PLAN_DEBUG.md for details).

Current Problems

Based on transcript analysis from actual meetings (November 2025):

VAD cutting speech too early
- Voice Activity Detection triggers end-of-segment prematurely
- Results in fragmented phrases ("我很。" → "Je suis.")
- Hypothesis: Silence threshold too aggressive for multi-speaker scenarios
Segments too short for context
- Whisper receives insufficient audio context for accurate Chinese transcription
- Single-word or two-word segments lack conversational context
- Impact: Lower accuracy, especially with homonyms
Ambient noise interpreted as speech
- Background sounds trigger false VAD positives
- Test transcript shows "太多声音了" (too much noise) being captured
- Mitigation: RNNoise helps but not sufficient for very noisy environments
Loss of inter-segment context
- Each audio chunk processed independently
- Whisper cannot use previous context for better transcription
- Potential solution: Pass previous 2-3 transcriptions in prompt

Test Conditions

Testing has been performed under deliberately degraded conditions to ensure robustness:

Multiple simultaneous speakers
Variable microphone distance
Variable volume levels
Fast-paced conversations
Low-quality microphone

These conditions are intentionally harsh to validate real-world meeting scenarios.

Debug Plan

See PLAN_DEBUG.md for:

Detailed session logging implementation (JSON per segment + metadata)
Improved Whisper prompt engineering
VAD threshold tuning recommendations
Context propagation strategies

Session Logging

Structure

sessions/
└── YYYY-MM-DD_HHMMSS/
    ├── session.json           # Session metadata
    ├── segments/
    │   ├── 001.json          # Segment: Chinese + French + metadata
    │   ├── 002.json
    │   └── ...
    └── transcript.txt         # Final export

Segment Format

{
  "id": 1,
  "chinese": "两个老鼠求我",
  "french": "Deux souris me supplient"
}

Future enhancements: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.

Configuration

config.json

{
  "audio": {
    "sample_rate": 16000,
    "channels": 1,
    "chunk_duration_seconds": 10
  },
  "whisper": {
    "model": "gpt-4o-mini-transcribe",
    "language": "zh",
    "prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
  },
  "claude": {
    "model": "claude-haiku-4-20250514",
    "max_tokens": 1024
  }
}

.env

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

Cost Estimation

Whisper: ~$0.006/minute (~$0.36/hour)
Claude Haiku: ~$0.03-0.05/hour
Total: ~$0.40/hour of recording

Advanced Features

GPU Forcing (Hybrid Graphics Systems)

main.cpp exports symbols to force dedicated GPU on Optimus/PowerXpress systems:

NvOptimusEnablement - Forces NVIDIA GPU
AmdPowerXpressRequestHighPerformance - Forces AMD GPU

Critical for laptops with both integrated and dedicated GPUs.

Hallucination Filtering

Pipeline.cpp maintains an extensive list (~65 patterns) of known Whisper hallucinations:

YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment"
Chinese video endings: "谢谢观看", "再见", "订阅我的频道"
Music symbols: "♪♪", "🎵"
Silence markers: "...", "silence", "inaudible"

These are automatically filtered before translation to avoid wasting API calls.

Console-Only Build

A SecondVoice_Console target exists for headless testing:

Uses main_console.cpp
No ImGui/GLFW dependencies
Outputs transcriptions to stdout
Useful for debugging and automated testing

Development

Building in Debug Mode

cmake -B build -DCMAKE_BUILD_TYPE=Debug -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake
cmake --build build

Running Tests

# TODO: Add tests

Troubleshooting

No audio capture

Check microphone permissions
Verify PortAudio is properly installed: pa_devs (if available)
Try different audio device in code

API errors

Verify API keys in .env are correct
Check internet connection
Monitor API rate limits

Build errors

Ensure vcpkg is properly set up
Check all system dependencies are installed
Try cmake --build build --clean-first

Project Structure

secondvoice/
├── src/
│   ├── main.cpp                    # Entry point, forces NVIDIA GPU
│   ├── core/
│   │   └── Pipeline.cpp           # Audio→Transcription→Translation orchestration
│   ├── audio/
│   │   ├── AudioCapture.cpp       # PortAudio + VAD segmentation
│   │   ├── AudioBuffer.cpp        # Sample accumulation, WAV/Opus export
│   │   └── NoiseReducer.cpp       # RNNoise (16→48→16 kHz)
│   ├── api/
│   │   ├── WhisperClient.cpp      # OpenAI Whisper (multipart/form-data)
│   │   ├── ClaudeClient.cpp       # Anthropic Claude (JSON)
│   │   └── WinHttpClient.cpp      # Native Windows HTTP
│   ├── ui/
│   │   └── TranslationUI.cpp      # ImGui interface + VAD controls
│   └── utils/
│       ├── Config.cpp             # config.json + .env loader
│       └── ThreadSafeQueue.h      # Lock-free audio queue
├── docs/                          # Build guides
├── sessions/                      # Session recordings + logs
├── recordings/                    # Legacy recordings directory
├── denoised/                      # Denoised audio outputs
├── config.json                    # Runtime configuration
├── .env                           # API keys (not committed)
├── CLAUDE.md                      # Development guide for Claude Code
├── PLAN_DEBUG.md                  # Active debugging plan
└── CMakeLists.txt                 # Build configuration

External Dependencies

Fetched via CMake FetchContent:

ImGui v1.90.1 - UI framework
Opus v1.5.2 - Audio encoding
Ogg v1.3.6 - Container format
RNNoise v0.1.1 - Neural network noise reduction

vcpkg Dependencies (x64-mingw-static triplet):

portaudio - Cross-platform audio I/O
nlohmann_json - JSON parsing
glfw3 - Windowing/input
glad - OpenGL loader

Roadmap

Phase 1 - MVP ✅ (Complete)

✅ Audio capture with VAD
✅ Noise reduction (RNNoise)
✅ Whisper API integration
✅ Claude API integration
✅ ImGui UI with runtime VAD adjustment
✅ Opus compression
✅ Hallucination filtering
✅ Session recording

Phase 2 - Debugging 🔄 (Current)

🔄 Session logging (JSON per segment)
🔄 Improved Whisper prompt engineering
🔄 VAD threshold optimization
🔄 Context propagation between segments
⬜ Automated testing with sample audio

Phase 3 - Enhancement

⬜ Auto-summary post-meeting (Claude analysis)
⬜ Full-text search (SQLite FTS5)
⬜ Semantic search (embeddings)
⬜ Speaker diarization
⬜ Replay mode with synced transcripts
⬜ Multi-language support extension

Development Documentation

CLAUDE.md - Development guide for Claude Code AI assistant
PLAN_DEBUG.md - Active debugging plan with identified issues and solutions
WINDOWS_BUILD.md - Detailed Windows build instructions
WINDOWS_MINGW.md - MinGW-specific build guide
WINDOWS_QUICK_START.md - Quick start for Windows users

Contributing

This is a personal project built to solve a real need. Bug reports and suggestions welcome:

Known issues: See PLAN_DEBUG.md for current debugging efforts Architecture: See CLAUDE.md for detailed system design

License

See LICENSE file.

Acknowledgments

OpenAI Whisper for excellent Chinese transcription
Anthropic Claude for context-aware translation
RNNoise for neural network-based noise reduction
ImGui for clean, immediate-mode UI