secondvoice/README.md

# SecondVoice

Real-time Chinese to French translation system for live meetings.

## Overview

SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly.

### Why This Project?

Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:
- Business meetings with Chinese speakers
- Family/administrative calls
- Professional conferences
- Any live Chinese conversation where real-time comprehension is needed

**Status**: MVP complete, actively being debugged and improved based on real-world usage.

## Quick Start

### Windows (MinGW) - Recommended

```batch
# First-time setup
.\setup_mingw.bat

# Build
.\build_mingw.bat

# Run
cd build\mingw-Release
SecondVoice.exe
```

**Requirements**: `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`, plus a working microphone.

See full setup instructions below for other platforms.

## Features

- 🎤 **Real-time audio capture** with Voice Activity Detection (VAD)
- 🔇 **Noise reduction** using RNNoise neural network
- 🗣️ **Chinese speech-to-text** via Whisper API (gpt-4o-mini-transcribe)
- 🧠 **Hallucination filtering** - removes known Whisper artifacts
- 🌐 **Chinese to French translation** via Claude AI (claude-haiku-4-20250514)
- 🖥️ **Clean ImGui interface** with adjustable VAD thresholds
- 💾 **Full session recording** with structured logging
- 📊 **Session archival** - audio, transcripts, translations, and metadata
- ⚡ **Opus compression** - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
- ⚙️ **Configurable settings** via config.json

## Requirements

### Cross-Platform Support

SecondVoice works on **Windows** and **Linux**.

#### Windows
- Visual Studio 2019 or later (with C++ tools)
- vcpkg package manager
- See detailed guide: [docs/build_windows.md](docs/build_windows.md)

#### Linux
- GCC/Clang with C++17 support
- System dependencies: `libasound2-dev`, `libgl1-mesa-dev`, `libglu1-mesa-dev`
- vcpkg package manager

### vcpkg Installation

**Linux**:
```bash
git clone https://github.com/microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
export VCPKG_ROOT=$(pwd)
```

**Windows**:
```powershell
git clone https://github.com/microsoft/vcpkg.git C:\vcpkg
cd C:\vcpkg
.\bootstrap-vcpkg.bat
setx VCPKG_ROOT "C:\vcpkg"
```

## Setup

1. **Clone the repository**

```bash
git clone <repository-url>
cd secondvoice
```

2. **Create `.env` file** (copy from `.env.example`)

**Linux**:
```bash
cp .env.example .env
nano .env
# Add your API keys:
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
```

**Windows**:
```powershell
copy .env.example .env
notepad .env
# Add your API keys
```

3. **Build the project**

**Linux**:
```bash
./build.sh
# Or manually:
# cmake -B build -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake
# cmake --build build -j$(nproc)
```

**Windows**:
```batch
build.bat --release
REM Or see detailed guide: docs/build_windows.md
```

## Usage

**Linux**:
```bash
cd build
./SecondVoice
```

**Windows**:
```batch
cd build\windows-release\Release
SecondVoice.exe
```

The application will:
1. Open an ImGui window
2. Start capturing audio from your microphone
3. Display Chinese transcriptions and French translations in real-time
4. Click **STOP RECORDING** button to finish
5. Save the full audio recording to `recordings/recording_YYYYMMDD_HHMMSS.wav`

## Architecture

```
Audio Input (16kHz mono)
    ↓
Voice Activity Detection (VAD) - RMS + Peak thresholds
    ↓
Noise Reduction (RNNoise) - 16→48→16 kHz resampling
    ↓
Opus Encoding (24kbps OGG) - 46x compression
    ↓
Whisper API (gpt-4o-mini-transcribe) - Chinese STT
    ↓
Hallucination Filter - Remove known artifacts
    ↓
Claude API (claude-haiku-4) - Chinese → French translation
    ↓
ImGui UI Display + Session Logging
```

### Threading Model (3 threads)

1. **Audio Thread** (`Pipeline::audioThread`)
   - PortAudio callback captures 16kHz mono audio
   - Applies VAD (Voice Activity Detection) using RMS + Peak thresholds
   - Pushes speech chunks to processing queue

2. **Processing Thread** (`Pipeline::processingThread`)
   - Consumes audio chunks from queue
   - Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
   - Encodes to Opus/OGG for bandwidth efficiency
   - Calls Whisper API for Chinese transcription
   - Filters known hallucinations (YouTube phrases, music markers, etc.)
   - Calls Claude API for French translation
   - Logs to session files

3. **UI Thread** (main)
   - GLFW/ImGui rendering loop (must run on main thread)
   - Displays real-time transcription and translation
   - Allows runtime VAD threshold adjustment
   - Handles user controls (stop recording, etc.)

### Core Components

**Audio Processing**:
- `AudioCapture.cpp` - PortAudio wrapper with VAD-based segmentation
- `AudioBuffer.cpp` - Accumulates samples, exports WAV/Opus
- `NoiseReducer.cpp` - RNNoise denoising with resampling

**API Clients**:
- `WhisperClient.cpp` - OpenAI Whisper API (multipart/form-data)
- `ClaudeClient.cpp` - Anthropic Claude API (JSON)
- `WinHttpClient.cpp` - Native Windows HTTP client (replaced libcurl)

**Core Logic**:
- `Pipeline.cpp` - Orchestrates audio → transcription → translation flow
- `TranslationUI.cpp` - ImGui interface with VAD controls

**Utilities**:
- `Config.cpp` - Loads config.json + .env
- `ThreadSafeQueue.h` - Lock-free queue for audio chunks

## Known Issues & Active Debugging

**Status**: Real-world testing has identified issues with degraded audio conditions (see `PLAN_DEBUG.md` for details).

### Current Problems

Based on transcript analysis from actual meetings (November 2025):

1. **VAD cutting speech too early**
   - Voice Activity Detection triggers end-of-segment prematurely
   - Results in fragmented phrases ("我很。" → "Je suis.")
   - **Hypothesis**: Silence threshold too aggressive for multi-speaker scenarios

2. **Segments too short for context**
   - Whisper receives insufficient audio context for accurate Chinese transcription
   - Single-word or two-word segments lack conversational context
   - **Impact**: Lower accuracy, especially with homonyms

3. **Ambient noise interpreted as speech**
   - Background sounds trigger false VAD positives
   - Test transcript shows "太多声音了" (too much noise) being captured
   - **Mitigation**: RNNoise helps but not sufficient for very noisy environments

4. **Loss of inter-segment context**
   - Each audio chunk processed independently
   - Whisper cannot use previous context for better transcription
   - **Potential solution**: Pass previous 2-3 transcriptions in prompt

### Test Conditions

Testing has been performed under **deliberately degraded conditions** to ensure robustness:
- Multiple simultaneous speakers
- Variable microphone distance
- Variable volume levels
- Fast-paced conversations
- Low-quality microphone

These conditions are intentionally harsh to validate real-world meeting scenarios.

### Debug Plan

See `PLAN_DEBUG.md` for:
- Detailed session logging implementation (JSON per segment + metadata)
- Improved Whisper prompt engineering
- VAD threshold tuning recommendations
- Context propagation strategies

## Session Logging

### Structure

```
sessions/
└── YYYY-MM-DD_HHMMSS/
    ├── session.json           # Session metadata
    ├── segments/
    │   ├── 001.json          # Segment: Chinese + French + metadata
    │   ├── 002.json
    │   └── ...
    └── transcript.txt         # Final export
```

### Segment Format

```json
{
  "id": 1,
  "chinese": "两个老鼠求我",
  "french": "Deux souris me supplient"
}
```

**Future enhancements**: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.

## Configuration

### config.json

```json
{
  "audio": {
    "sample_rate": 16000,
    "channels": 1,
    "chunk_duration_seconds": 10
  },
  "whisper": {
    "model": "gpt-4o-mini-transcribe",
    "language": "zh",
    "prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
  },
  "claude": {
    "model": "claude-haiku-4-20250514",
    "max_tokens": 1024
  }
}
```

### .env

```env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
```

## Cost Estimation

- **Whisper**: ~$0.006/minute (~$0.36/hour)
- **Claude Haiku**: ~$0.03-0.05/hour
- **Total**: ~$0.40/hour of recording

## Advanced Features

### GPU Forcing (Hybrid Graphics Systems)

`main.cpp` exports symbols to force dedicated GPU on Optimus/PowerXpress systems:
- `NvOptimusEnablement` - Forces NVIDIA GPU
- `AmdPowerXpressRequestHighPerformance` - Forces AMD GPU

Critical for laptops with both integrated and dedicated GPUs.

### Hallucination Filtering

`Pipeline.cpp` maintains an extensive list (~65 patterns) of known Whisper hallucinations:
- YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment"
- Chinese video endings: "谢谢观看", "再见", "订阅我的频道"
- Music symbols: "♪♪", "🎵"
- Silence markers: "...", "silence", "inaudible"

These are automatically filtered before translation to avoid wasting API calls.

### Console-Only Build

A `SecondVoice_Console` target exists for headless testing:
- Uses `main_console.cpp`
- No ImGui/GLFW dependencies
- Outputs transcriptions to stdout
- Useful for debugging and automated testing

## Development

### Building in Debug Mode

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake
cmake --build build
```

### Running Tests

```bash
# TODO: Add tests
```

## Troubleshooting

### No audio capture

- Check microphone permissions
- Verify PortAudio is properly installed: `pa_devs` (if available)
- Try different audio device in code

### API errors

- Verify API keys in `.env` are correct
- Check internet connection
- Monitor API rate limits

### Build errors

- Ensure vcpkg is properly set up
- Check all system dependencies are installed
- Try `cmake --build build --clean-first`

## Project Structure

```
secondvoice/
├── src/
│   ├── main.cpp                    # Entry point, forces NVIDIA GPU
│   ├── core/
│   │   └── Pipeline.cpp           # Audio→Transcription→Translation orchestration
│   ├── audio/
│   │   ├── AudioCapture.cpp       # PortAudio + VAD segmentation
│   │   ├── AudioBuffer.cpp        # Sample accumulation, WAV/Opus export
│   │   └── NoiseReducer.cpp       # RNNoise (16→48→16 kHz)
│   ├── api/
│   │   ├── WhisperClient.cpp      # OpenAI Whisper (multipart/form-data)
│   │   ├── ClaudeClient.cpp       # Anthropic Claude (JSON)
│   │   └── WinHttpClient.cpp      # Native Windows HTTP
│   ├── ui/
│   │   └── TranslationUI.cpp      # ImGui interface + VAD controls
│   └── utils/
│       ├── Config.cpp             # config.json + .env loader
│       └── ThreadSafeQueue.h      # Lock-free audio queue
├── docs/                          # Build guides
├── sessions/                      # Session recordings + logs
├── recordings/                    # Legacy recordings directory
├── denoised/                      # Denoised audio outputs
├── config.json                    # Runtime configuration
├── .env                           # API keys (not committed)
├── CLAUDE.md                      # Development guide for Claude Code
├── PLAN_DEBUG.md                  # Active debugging plan
└── CMakeLists.txt                 # Build configuration
```

### External Dependencies

**Fetched via CMake FetchContent**:
- ImGui v1.90.1 - UI framework
- Opus v1.5.2 - Audio encoding
- Ogg v1.3.6 - Container format
- RNNoise v0.1.1 - Neural network noise reduction

**vcpkg Dependencies** (x64-mingw-static triplet):
- portaudio - Cross-platform audio I/O
- nlohmann_json - JSON parsing
- glfw3 - Windowing/input
- glad - OpenGL loader

## Roadmap

### Phase 1 - MVP ✅ (Complete)
- ✅ Audio capture with VAD
- ✅ Noise reduction (RNNoise)
- ✅ Whisper API integration
- ✅ Claude API integration
- ✅ ImGui UI with runtime VAD adjustment
- ✅ Opus compression
- ✅ Hallucination filtering
- ✅ Session recording

### Phase 2 - Debugging 🔄 (Current)
- 🔄 Session logging (JSON per segment)
- 🔄 Improved Whisper prompt engineering
- 🔄 VAD threshold optimization
- 🔄 Context propagation between segments
- ⬜ Automated testing with sample audio

### Phase 3 - Enhancement
- ⬜ Auto-summary post-meeting (Claude analysis)
- ⬜ Full-text search (SQLite FTS5)
- ⬜ Semantic search (embeddings)
- ⬜ Speaker diarization
- ⬜ Replay mode with synced transcripts
- ⬜ Multi-language support extension

## Development Documentation

- **CLAUDE.md** - Development guide for Claude Code AI assistant
- **PLAN_DEBUG.md** - Active debugging plan with identified issues and solutions
- **WINDOWS_BUILD.md** - Detailed Windows build instructions
- **WINDOWS_MINGW.md** - MinGW-specific build guide
- **WINDOWS_QUICK_START.md** - Quick start for Windows users

## Contributing

This is a personal project built to solve a real need. Bug reports and suggestions welcome:

**Known issues**: See `PLAN_DEBUG.md` for current debugging efforts
**Architecture**: See `CLAUDE.md` for detailed system design

## License

See LICENSE file.

## Acknowledgments

- OpenAI Whisper for excellent Chinese transcription
- Anthropic Claude for context-aware translation
- RNNoise for neural network-based noise reduction
- ImGui for clean, immediate-mode UI