secondvoice/CLAUDE.md
StillHammer 21bcc9ed71 feat: Add transcript export and debug planning docs
- Add CLAUDE.md with project documentation for AI assistance
- Add PLAN_DEBUG.md with debugging hypotheses and logging plan
- Update Pipeline and TranslationUI with transcript export functionality

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 19:59:29 +08:00

120 lines
4.2 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
SecondVoice is a real-time Chinese-to-French translation system for live meetings. It captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI.
## Build Commands
### Windows (MinGW) - Primary Build
```batch
# First-time setup
.\setup_mingw.bat
# Build (Release)
.\build_mingw.bat
# Build (Debug)
.\build_mingw.bat --debug
# Clean rebuild
.\build_mingw.bat --clean
```
### Running the Application
```batch
cd build\mingw-Release
SecondVoice.exe
```
Requires:
- `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`
- `config.json` (copied automatically during build)
- A microphone
## Architecture
### Threading Model (3 threads)
1. **Audio Thread** (`Pipeline::audioThread`) - PortAudio callback captures audio, applies VAD (Voice Activity Detection), pushes chunks to queue
2. **Processing Thread** (`Pipeline::processingThread`) - Consumes audio chunks, calls Whisper API for transcription, then Claude API for translation
3. **UI Thread** (main) - GLFW/ImGui rendering loop, must run on main thread
### Core Components
```
src/
├── main.cpp # Entry point, forces NVIDIA GPU
├── core/Pipeline.cpp # Orchestrates audio→transcription→translation flow
├── audio/
│ ├── AudioCapture.cpp # PortAudio wrapper with VAD-based segmentation
│ ├── AudioBuffer.cpp # Accumulates samples, exports WAV/Opus
│ └── NoiseReducer.cpp # RNNoise denoising (16kHz→48kHz→16kHz resampling)
├── api/
│ ├── WhisperClient.cpp # OpenAI Whisper API (multipart/form-data)
│ ├── ClaudeClient.cpp # Anthropic Claude API (JSON)
│ └── WinHttpClient.cpp # Native Windows HTTP client (replaced libcurl)
├── ui/TranslationUI.cpp # ImGui interface with VAD threshold controls
└── utils/
├── Config.cpp # Loads config.json + .env
└── ThreadSafeQueue.h # Lock-free queue for audio chunks
```
### Key Data Flow
1. `AudioCapture` detects speech via VAD thresholds (RMS + Peak)
2. Speech segments sent to `NoiseReducer` (RNNoise) for denoising
3. Denoised audio encoded to Opus/OGG for bandwidth efficiency (46x reduction)
4. `WhisperClient` sends audio to gpt-4o-mini-transcribe
5. `Pipeline` filters Whisper hallucinations (known garbage phrases)
6. `ClaudeClient` translates Chinese text to French
7. `TranslationUI` displays accumulated transcription/translation
### External Dependencies (fetched via CMake FetchContent)
- **ImGui** v1.90.1 - UI framework
- **Opus** v1.5.2 - Audio encoding
- **Ogg** v1.3.6 - Container format
- **RNNoise** v0.1.1 - Neural network noise reduction
### vcpkg Dependencies (x64-mingw-static triplet)
- portaudio, nlohmann_json, glfw3, glad
## Configuration
### config.json
- `audio.sample_rate`: 16000 Hz (required for Whisper)
- `whisper.model`: "gpt-4o-mini-transcribe"
- `whisper.language`: "zh" (Chinese)
- `claude.model`: "claude-3-5-haiku-20241022"
### VAD Tuning
VAD thresholds are adjustable in the UI at runtime:
- RMS threshold: speech detection sensitivity
- Peak threshold: transient/click rejection
## Important Implementation Details
### Whisper Hallucination Filtering
`Pipeline.cpp` contains an extensive list of known Whisper hallucinations (lines ~195-260) that are filtered out:
- "Thank you for watching", "Subscribe", YouTube phrases
- Chinese video endings: "谢谢观看", "再见", "订阅"
- Music symbols, silence markers
- Single-word interjections
### GPU Forcing (Optimus/PowerXpress)
`main.cpp` exports `NvOptimusEnablement` and `AmdPowerXpressRequestHighPerformance` symbols to force dedicated GPU usage on hybrid graphics systems.
### Audio Processing Pipeline
1. 16kHz mono input → Upsampled to 48kHz for RNNoise
2. RNNoise denoising (480-sample frames at 48kHz)
3. Transient suppression (claps, clicks, pops)
4. Downsampled back to 16kHz
5. Opus encoding at 24kbps for API transmission
## Console-Only Build
A `SecondVoice_Console` target exists for testing without UI:
- Uses `main_console.cpp`
- No ImGui/GLFW dependencies
- Outputs transcriptions to stdout