StillHammer 21bcc9ed71 feat: Add transcript export and debug planning docs

- Add CLAUDE.md with project documentation for AI assistance
- Add PLAN_DEBUG.md with debugging hypotheses and logging plan
- Update Pipeline and TranslationUI with transcript export functionality

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-23 19:59:29 +08:00

4.2 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

SecondVoice is a real-time Chinese-to-French translation system for live meetings. It captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI.

Build Commands

Windows (MinGW) - Primary Build

# First-time setup
.\setup_mingw.bat

# Build (Release)
.\build_mingw.bat

# Build (Debug)
.\build_mingw.bat --debug

# Clean rebuild
.\build_mingw.bat --clean

Running the Application

cd build\mingw-Release
SecondVoice.exe

Requires:

.env file with OPENAI_API_KEY and ANTHROPIC_API_KEY
config.json (copied automatically during build)
A microphone

Architecture

Threading Model (3 threads)

Audio Thread (Pipeline::audioThread) - PortAudio callback captures audio, applies VAD (Voice Activity Detection), pushes chunks to queue
Processing Thread (Pipeline::processingThread) - Consumes audio chunks, calls Whisper API for transcription, then Claude API for translation
UI Thread (main) - GLFW/ImGui rendering loop, must run on main thread

Core Components

src/
├── main.cpp              # Entry point, forces NVIDIA GPU
├── core/Pipeline.cpp     # Orchestrates audio→transcription→translation flow
├── audio/
│   ├── AudioCapture.cpp  # PortAudio wrapper with VAD-based segmentation
│   ├── AudioBuffer.cpp   # Accumulates samples, exports WAV/Opus
│   └── NoiseReducer.cpp  # RNNoise denoising (16kHz→48kHz→16kHz resampling)
├── api/
│   ├── WhisperClient.cpp # OpenAI Whisper API (multipart/form-data)
│   ├── ClaudeClient.cpp  # Anthropic Claude API (JSON)
│   └── WinHttpClient.cpp # Native Windows HTTP client (replaced libcurl)
├── ui/TranslationUI.cpp  # ImGui interface with VAD threshold controls
└── utils/
    ├── Config.cpp        # Loads config.json + .env
    └── ThreadSafeQueue.h # Lock-free queue for audio chunks

Key Data Flow

AudioCapture detects speech via VAD thresholds (RMS + Peak)
Speech segments sent to NoiseReducer (RNNoise) for denoising
Denoised audio encoded to Opus/OGG for bandwidth efficiency (46x reduction)
WhisperClient sends audio to gpt-4o-mini-transcribe
Pipeline filters Whisper hallucinations (known garbage phrases)
ClaudeClient translates Chinese text to French
TranslationUI displays accumulated transcription/translation

External Dependencies (fetched via CMake FetchContent)

ImGui v1.90.1 - UI framework
Opus v1.5.2 - Audio encoding
Ogg v1.3.6 - Container format
RNNoise v0.1.1 - Neural network noise reduction

vcpkg Dependencies (x64-mingw-static triplet)

portaudio, nlohmann_json, glfw3, glad

Configuration

config.json

audio.sample_rate: 16000 Hz (required for Whisper)
whisper.model: "gpt-4o-mini-transcribe"
whisper.language: "zh" (Chinese)
claude.model: "claude-3-5-haiku-20241022"

VAD Tuning

VAD thresholds are adjustable in the UI at runtime:

RMS threshold: speech detection sensitivity
Peak threshold: transient/click rejection

Important Implementation Details

Whisper Hallucination Filtering

Pipeline.cpp contains an extensive list of known Whisper hallucinations (lines ~195-260) that are filtered out:

"Thank you for watching", "Subscribe", YouTube phrases
Chinese video endings: "谢谢观看", "再见", "订阅"
Music symbols, silence markers
Single-word interjections

GPU Forcing (Optimus/PowerXpress)

main.cpp exports NvOptimusEnablement and AmdPowerXpressRequestHighPerformance symbols to force dedicated GPU usage on hybrid graphics systems.

Audio Processing Pipeline

16kHz mono input → Upsampled to 48kHz for RNNoise
RNNoise denoising (480-sample frames at 48kHz)
Transient suppression (claps, clicks, pops)
Downsampled back to 16kHz
Opus encoding at 24kbps for API transmission

Console-Only Build

A SecondVoice_Console target exists for testing without UI:

Uses main_console.cpp
No ImGui/GLFW dependencies
Outputs transcriptions to stdout

4.2 KiB Raw Permalink Blame History