- Add CLAUDE.md with project documentation for AI assistance - Add PLAN_DEBUG.md with debugging hypotheses and logging plan - Update Pipeline and TranslationUI with transcript export functionality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
120 lines
4.2 KiB
Markdown
120 lines
4.2 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
SecondVoice is a real-time Chinese-to-French translation system for live meetings. It captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI.
|
|
|
|
## Build Commands
|
|
|
|
### Windows (MinGW) - Primary Build
|
|
```batch
|
|
# First-time setup
|
|
.\setup_mingw.bat
|
|
|
|
# Build (Release)
|
|
.\build_mingw.bat
|
|
|
|
# Build (Debug)
|
|
.\build_mingw.bat --debug
|
|
|
|
# Clean rebuild
|
|
.\build_mingw.bat --clean
|
|
```
|
|
|
|
### Running the Application
|
|
```batch
|
|
cd build\mingw-Release
|
|
SecondVoice.exe
|
|
```
|
|
|
|
Requires:
|
|
- `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`
|
|
- `config.json` (copied automatically during build)
|
|
- A microphone
|
|
|
|
## Architecture
|
|
|
|
### Threading Model (3 threads)
|
|
1. **Audio Thread** (`Pipeline::audioThread`) - PortAudio callback captures audio, applies VAD (Voice Activity Detection), pushes chunks to queue
|
|
2. **Processing Thread** (`Pipeline::processingThread`) - Consumes audio chunks, calls Whisper API for transcription, then Claude API for translation
|
|
3. **UI Thread** (main) - GLFW/ImGui rendering loop, must run on main thread
|
|
|
|
### Core Components
|
|
|
|
```
|
|
src/
|
|
├── main.cpp # Entry point, forces NVIDIA GPU
|
|
├── core/Pipeline.cpp # Orchestrates audio→transcription→translation flow
|
|
├── audio/
|
|
│ ├── AudioCapture.cpp # PortAudio wrapper with VAD-based segmentation
|
|
│ ├── AudioBuffer.cpp # Accumulates samples, exports WAV/Opus
|
|
│ └── NoiseReducer.cpp # RNNoise denoising (16kHz→48kHz→16kHz resampling)
|
|
├── api/
|
|
│ ├── WhisperClient.cpp # OpenAI Whisper API (multipart/form-data)
|
|
│ ├── ClaudeClient.cpp # Anthropic Claude API (JSON)
|
|
│ └── WinHttpClient.cpp # Native Windows HTTP client (replaced libcurl)
|
|
├── ui/TranslationUI.cpp # ImGui interface with VAD threshold controls
|
|
└── utils/
|
|
├── Config.cpp # Loads config.json + .env
|
|
└── ThreadSafeQueue.h # Lock-free queue for audio chunks
|
|
```
|
|
|
|
### Key Data Flow
|
|
1. `AudioCapture` detects speech via VAD thresholds (RMS + Peak)
|
|
2. Speech segments sent to `NoiseReducer` (RNNoise) for denoising
|
|
3. Denoised audio encoded to Opus/OGG for bandwidth efficiency (46x reduction)
|
|
4. `WhisperClient` sends audio to gpt-4o-mini-transcribe
|
|
5. `Pipeline` filters Whisper hallucinations (known garbage phrases)
|
|
6. `ClaudeClient` translates Chinese text to French
|
|
7. `TranslationUI` displays accumulated transcription/translation
|
|
|
|
### External Dependencies (fetched via CMake FetchContent)
|
|
- **ImGui** v1.90.1 - UI framework
|
|
- **Opus** v1.5.2 - Audio encoding
|
|
- **Ogg** v1.3.6 - Container format
|
|
- **RNNoise** v0.1.1 - Neural network noise reduction
|
|
|
|
### vcpkg Dependencies (x64-mingw-static triplet)
|
|
- portaudio, nlohmann_json, glfw3, glad
|
|
|
|
## Configuration
|
|
|
|
### config.json
|
|
- `audio.sample_rate`: 16000 Hz (required for Whisper)
|
|
- `whisper.model`: "gpt-4o-mini-transcribe"
|
|
- `whisper.language`: "zh" (Chinese)
|
|
- `claude.model`: "claude-3-5-haiku-20241022"
|
|
|
|
### VAD Tuning
|
|
VAD thresholds are adjustable in the UI at runtime:
|
|
- RMS threshold: speech detection sensitivity
|
|
- Peak threshold: transient/click rejection
|
|
|
|
## Important Implementation Details
|
|
|
|
### Whisper Hallucination Filtering
|
|
`Pipeline.cpp` contains an extensive list of known Whisper hallucinations (lines ~195-260) that are filtered out:
|
|
- "Thank you for watching", "Subscribe", YouTube phrases
|
|
- Chinese video endings: "谢谢观看", "再见", "订阅"
|
|
- Music symbols, silence markers
|
|
- Single-word interjections
|
|
|
|
### GPU Forcing (Optimus/PowerXpress)
|
|
`main.cpp` exports `NvOptimusEnablement` and `AmdPowerXpressRequestHighPerformance` symbols to force dedicated GPU usage on hybrid graphics systems.
|
|
|
|
### Audio Processing Pipeline
|
|
1. 16kHz mono input → Upsampled to 48kHz for RNNoise
|
|
2. RNNoise denoising (480-sample frames at 48kHz)
|
|
3. Transient suppression (claps, clicks, pops)
|
|
4. Downsampled back to 16kHz
|
|
5. Opus encoding at 24kbps for API transmission
|
|
|
|
## Console-Only Build
|
|
|
|
A `SecondVoice_Console` target exists for testing without UI:
|
|
- Uses `main_console.cpp`
|
|
- No ImGui/GLFW dependencies
|
|
- Outputs transcriptions to stdout
|