secondvoice/CLAUDE.md
StillHammer 21bcc9ed71 feat: Add transcript export and debug planning docs
- Add CLAUDE.md with project documentation for AI assistance
- Add PLAN_DEBUG.md with debugging hypotheses and logging plan
- Update Pipeline and TranslationUI with transcript export functionality

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 19:59:29 +08:00

4.2 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

SecondVoice is a real-time Chinese-to-French translation system for live meetings. It captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI.

Build Commands

Windows (MinGW) - Primary Build

# First-time setup
.\setup_mingw.bat

# Build (Release)
.\build_mingw.bat

# Build (Debug)
.\build_mingw.bat --debug

# Clean rebuild
.\build_mingw.bat --clean

Running the Application

cd build\mingw-Release
SecondVoice.exe

Requires:

  • .env file with OPENAI_API_KEY and ANTHROPIC_API_KEY
  • config.json (copied automatically during build)
  • A microphone

Architecture

Threading Model (3 threads)

  1. Audio Thread (Pipeline::audioThread) - PortAudio callback captures audio, applies VAD (Voice Activity Detection), pushes chunks to queue
  2. Processing Thread (Pipeline::processingThread) - Consumes audio chunks, calls Whisper API for transcription, then Claude API for translation
  3. UI Thread (main) - GLFW/ImGui rendering loop, must run on main thread

Core Components

src/
├── main.cpp              # Entry point, forces NVIDIA GPU
├── core/Pipeline.cpp     # Orchestrates audio→transcription→translation flow
├── audio/
│   ├── AudioCapture.cpp  # PortAudio wrapper with VAD-based segmentation
│   ├── AudioBuffer.cpp   # Accumulates samples, exports WAV/Opus
│   └── NoiseReducer.cpp  # RNNoise denoising (16kHz→48kHz→16kHz resampling)
├── api/
│   ├── WhisperClient.cpp # OpenAI Whisper API (multipart/form-data)
│   ├── ClaudeClient.cpp  # Anthropic Claude API (JSON)
│   └── WinHttpClient.cpp # Native Windows HTTP client (replaced libcurl)
├── ui/TranslationUI.cpp  # ImGui interface with VAD threshold controls
└── utils/
    ├── Config.cpp        # Loads config.json + .env
    └── ThreadSafeQueue.h # Lock-free queue for audio chunks

Key Data Flow

  1. AudioCapture detects speech via VAD thresholds (RMS + Peak)
  2. Speech segments sent to NoiseReducer (RNNoise) for denoising
  3. Denoised audio encoded to Opus/OGG for bandwidth efficiency (46x reduction)
  4. WhisperClient sends audio to gpt-4o-mini-transcribe
  5. Pipeline filters Whisper hallucinations (known garbage phrases)
  6. ClaudeClient translates Chinese text to French
  7. TranslationUI displays accumulated transcription/translation

External Dependencies (fetched via CMake FetchContent)

  • ImGui v1.90.1 - UI framework
  • Opus v1.5.2 - Audio encoding
  • Ogg v1.3.6 - Container format
  • RNNoise v0.1.1 - Neural network noise reduction

vcpkg Dependencies (x64-mingw-static triplet)

  • portaudio, nlohmann_json, glfw3, glad

Configuration

config.json

  • audio.sample_rate: 16000 Hz (required for Whisper)
  • whisper.model: "gpt-4o-mini-transcribe"
  • whisper.language: "zh" (Chinese)
  • claude.model: "claude-3-5-haiku-20241022"

VAD Tuning

VAD thresholds are adjustable in the UI at runtime:

  • RMS threshold: speech detection sensitivity
  • Peak threshold: transient/click rejection

Important Implementation Details

Whisper Hallucination Filtering

Pipeline.cpp contains an extensive list of known Whisper hallucinations (lines ~195-260) that are filtered out:

  • "Thank you for watching", "Subscribe", YouTube phrases
  • Chinese video endings: "谢谢观看", "再见", "订阅"
  • Music symbols, silence markers
  • Single-word interjections

GPU Forcing (Optimus/PowerXpress)

main.cpp exports NvOptimusEnablement and AmdPowerXpressRequestHighPerformance symbols to force dedicated GPU usage on hybrid graphics systems.

Audio Processing Pipeline

  1. 16kHz mono input → Upsampled to 48kHz for RNNoise
  2. RNNoise denoising (480-sample frames at 48kHz)
  3. Transient suppression (claps, clicks, pops)
  4. Downsampled back to 16kHz
  5. Opus encoding at 24kbps for API transmission

Console-Only Build

A SecondVoice_Console target exists for testing without UI:

  • Uses main_console.cpp
  • No ImGui/GLFW dependencies
  • Outputs transcriptions to stdout