refactor: Improve VAD trailing silence detection and update docs

- Replace hang time logic with consecutive silence frame counter for more precise speech end detection - Update Whisper prompt to utilize previous context for better transcription coherence - Expand README with comprehensive feature list, architecture details, debugging status, and session logging structure - Add troubleshooting section for real-world testing conditions and known issues
2025-12-02 09:44:06 +08:00 · 2025-12-02 09:44:06 +08:00 · db0f8e5990
commit db0f8e5990
parent a28bb89913
4 changed files with 306 additions and 64 deletions
--- a/README.md
+++ b/README.md
@ -4,16 +4,50 @@ Real-time Chinese to French translation system for live meetings.
 ## Overview
-SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API, and translates it to French using Claude AI in real-time. Perfect for understanding Chinese meetings on the fly.
+SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly.
 ### Why This Project?
 Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:
 - Business meetings with Chinese speakers
 - Family/administrative calls
 - Professional conferences
 - Any live Chinese conversation where real-time comprehension is needed
 **Status**: MVP complete, actively being debugged and improved based on real-world usage.
 ## Quick Start
 ### Windows (MinGW) - Recommended
 ```batch
 # First-time setup
 .\setup_mingw.bat
 # Build
 .\build_mingw.bat
 # Run
 cd build\mingw-Release
 SecondVoice.exe
 ```
 **Requirements**: `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`, plus a working microphone.
 See full setup instructions below for other platforms.
 ## Features
- 🎤 Real-time audio capture
+- 🎤 **Real-time audio capture** with Voice Activity Detection (VAD)
- 🗣️ Chinese speech-to-text (Whisper API)
+- 🔇 **Noise reduction** using RNNoise neural network
- 🌐 Chinese to French translation (Claude API)
+- 🗣️ **Chinese speech-to-text** via Whisper API (gpt-4o-mini-transcribe)
- 🖥️ Clean ImGui interface
+- 🧠 **Hallucination filtering** - removes known Whisper artifacts
- 💾 Full recording saved to disk
+- 🌐 **Chinese to French translation** via Claude AI (claude-haiku-4-20250514)
- ⚙️ Configurable chunk sizes and settings
+- 🖥️ **Clean ImGui interface** with adjustable VAD thresholds
 - 💾 **Full session recording** with structured logging
 - 📊 **Session archival** - audio, transcripts, translations, and metadata
 - ⚡ **Opus compression** - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
 - ⚙️ **Configurable settings** via config.json
 ## Requirements
@ -116,20 +150,138 @@ The application will:
 ## Architecture
 ```
-Audio Capture (PortAudio)
+Audio Input (16kHz mono)
    ↓
-Whisper API (Speech-to-Text)
+Voice Activity Detection (VAD) - RMS + Peak thresholds
    ↓
-Claude API (Translation)
+Noise Reduction (RNNoise) - 16→48→16 kHz resampling
    ↓
-ImGui UI (Display)
+Opus Encoding (24kbps OGG) - 46x compression
    ↓
 Whisper API (gpt-4o-mini-transcribe) - Chinese STT
    ↓
 Hallucination Filter - Remove known artifacts
    ↓
 Claude API (claude-haiku-4) - Chinese → French translation
    ↓
 ImGui UI Display + Session Logging
 ```
-### Threading Model
+### Threading Model (3 threads)
- **Thread 1**: Audio capture (PortAudio callback)
+1. **Audio Thread** (`Pipeline::audioThread`)
- **Thread 2**: AI processing (Whisper + Claude API calls)
+   - PortAudio callback captures 16kHz mono audio
- **Thread 3**: UI rendering (ImGui + OpenGL)
+   - Applies VAD (Voice Activity Detection) using RMS + Peak thresholds
   - Pushes speech chunks to processing queue
 2. **Processing Thread** (`Pipeline::processingThread`)
   - Consumes audio chunks from queue
   - Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
   - Encodes to Opus/OGG for bandwidth efficiency
   - Calls Whisper API for Chinese transcription
   - Filters known hallucinations (YouTube phrases, music markers, etc.)
   - Calls Claude API for French translation
   - Logs to session files
 3. **UI Thread** (main)
   - GLFW/ImGui rendering loop (must run on main thread)
   - Displays real-time transcription and translation
   - Allows runtime VAD threshold adjustment
   - Handles user controls (stop recording, etc.)
 ### Core Components
 **Audio Processing**:
 - `AudioCapture.cpp` - PortAudio wrapper with VAD-based segmentation
 - `AudioBuffer.cpp` - Accumulates samples, exports WAV/Opus
 - `NoiseReducer.cpp` - RNNoise denoising with resampling
 **API Clients**:
 - `WhisperClient.cpp` - OpenAI Whisper API (multipart/form-data)
 - `ClaudeClient.cpp` - Anthropic Claude API (JSON)
 - `WinHttpClient.cpp` - Native Windows HTTP client (replaced libcurl)
 **Core Logic**:
 - `Pipeline.cpp` - Orchestrates audio → transcription → translation flow
 - `TranslationUI.cpp` - ImGui interface with VAD controls
 **Utilities**:
 - `Config.cpp` - Loads config.json + .env
 - `ThreadSafeQueue.h` - Lock-free queue for audio chunks
 ## Known Issues & Active Debugging
 **Status**: Real-world testing has identified issues with degraded audio conditions (see `PLAN_DEBUG.md` for details).
 ### Current Problems
 Based on transcript analysis from actual meetings (November 2025):
 1. **VAD cutting speech too early**
   - Voice Activity Detection triggers end-of-segment prematurely
   - Results in fragmented phrases ("我很。" → "Je suis.")
   - **Hypothesis**: Silence threshold too aggressive for multi-speaker scenarios
 2. **Segments too short for context**
   - Whisper receives insufficient audio context for accurate Chinese transcription
   - Single-word or two-word segments lack conversational context
   - **Impact**: Lower accuracy, especially with homonyms
 3. **Ambient noise interpreted as speech**
   - Background sounds trigger false VAD positives
   - Test transcript shows "太多声音了" (too much noise) being captured
   - **Mitigation**: RNNoise helps but not sufficient for very noisy environments
 4. **Loss of inter-segment context**
   - Each audio chunk processed independently
   - Whisper cannot use previous context for better transcription
   - **Potential solution**: Pass previous 2-3 transcriptions in prompt
 ### Test Conditions
 Testing has been performed under **deliberately degraded conditions** to ensure robustness:
 - Multiple simultaneous speakers
 - Variable microphone distance
 - Variable volume levels
 - Fast-paced conversations
 - Low-quality microphone
 These conditions are intentionally harsh to validate real-world meeting scenarios.
 ### Debug Plan
 See `PLAN_DEBUG.md` for:
 - Detailed session logging implementation (JSON per segment + metadata)
 - Improved Whisper prompt engineering
 - VAD threshold tuning recommendations
 - Context propagation strategies
 ## Session Logging
 ### Structure
 ```
 sessions/
 └── YYYY-MM-DD_HHMMSS/
    ├── session.json           # Session metadata
    ├── segments/
    │   ├── 001.json          # Segment: Chinese + French + metadata
    │   ├── 002.json
    │   └── ...
    └── transcript.txt         # Final export
 ```
 ### Segment Format
 ```json
 {
  "id": 1,
  "chinese": "两个老鼠求我",
  "french": "Deux souris me supplient"
 }
 ```
 **Future enhancements**: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.
 ## Configuration
@ -143,8 +295,9 @@ ImGui UI (Display)
    "chunk_duration_seconds": 10
  },
  "whisper": {
-    "model": "whisper-1",
+    "model": "gpt-4o-mini-transcribe",
-    "language": "zh"
+    "language": "zh",
    "prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
  },
  "claude": {
    "model": "claude-haiku-4-20250514",
@ -166,23 +319,33 @@ ANTHROPIC_API_KEY=sk-ant-...
 - **Claude Haiku**: ~$0.03-0.05/hour
 - **Total**: ~$0.40/hour of recording
-## Project Structure
+## Advanced Features
-```
+### GPU Forcing (Hybrid Graphics Systems)
-secondvoice/
+
-├── src/
+`main.cpp` exports symbols to force dedicated GPU on Optimus/PowerXpress systems:
-│   ├── main.cpp                 # Entry point
+- `NvOptimusEnablement` - Forces NVIDIA GPU
-│   ├── audio/                   # Audio capture & buffer
+- `AmdPowerXpressRequestHighPerformance` - Forces AMD GPU
-│   ├── api/                     # Whisper & Claude clients
+
-│   ├── ui/                      # ImGui interface
+Critical for laptops with both integrated and dedicated GPUs.
-│   ├── utils/                   # Config & thread-safe queue
+
-│   └── core/                    # Pipeline orchestration
+### Hallucination Filtering
-├── docs/                        # Documentation
+
-├── recordings/                  # Output recordings
+`Pipeline.cpp` maintains an extensive list (~65 patterns) of known Whisper hallucinations:
-├── config.json                  # Runtime configuration
+- YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment"
-├── .env                         # API keys (not committed)
+- Chinese video endings: "谢谢观看", "再见", "订阅我的频道"
-└── CMakeLists.txt              # Build configuration
+- Music symbols: "♪♪", "🎵"
-```
+- Silence markers: "...", "silence", "inaudible"
 These are automatically filtered before translation to avoid wasting API calls.
 ### Console-Only Build
 A `SecondVoice_Console` target exists for headless testing:
 - Uses `main_console.cpp`
 - No ImGui/GLFW dependencies
 - Outputs transcriptions to stdout
 - Useful for debugging and automated testing
 ## Development
@ -219,30 +382,101 @@ cmake --build build
 - Check all system dependencies are installed
 - Try `cmake --build build --clean-first`
 ## Project Structure
 ```
 secondvoice/
 ├── src/
 │   ├── main.cpp                    # Entry point, forces NVIDIA GPU
 │   ├── core/
 │   │   └── Pipeline.cpp           # Audio→Transcription→Translation orchestration
 │   ├── audio/
 │   │   ├── AudioCapture.cpp       # PortAudio + VAD segmentation
 │   │   ├── AudioBuffer.cpp        # Sample accumulation, WAV/Opus export
 │   │   └── NoiseReducer.cpp       # RNNoise (16→48→16 kHz)
 │   ├── api/
 │   │   ├── WhisperClient.cpp      # OpenAI Whisper (multipart/form-data)
 │   │   ├── ClaudeClient.cpp       # Anthropic Claude (JSON)
 │   │   └── WinHttpClient.cpp      # Native Windows HTTP
 │   ├── ui/
 │   │   └── TranslationUI.cpp      # ImGui interface + VAD controls
 │   └── utils/
 │       ├── Config.cpp             # config.json + .env loader
 │       └── ThreadSafeQueue.h      # Lock-free audio queue
 ├── docs/                          # Build guides
 ├── sessions/                      # Session recordings + logs
 ├── recordings/                    # Legacy recordings directory
 ├── denoised/                      # Denoised audio outputs
 ├── config.json                    # Runtime configuration
 ├── .env                           # API keys (not committed)
 ├── CLAUDE.md                      # Development guide for Claude Code
 ├── PLAN_DEBUG.md                  # Active debugging plan
 └── CMakeLists.txt                 # Build configuration
 ```
 ### External Dependencies
 **Fetched via CMake FetchContent**:
 - ImGui v1.90.1 - UI framework
 - Opus v1.5.2 - Audio encoding
 - Ogg v1.3.6 - Container format
 - RNNoise v0.1.1 - Neural network noise reduction
 **vcpkg Dependencies** (x64-mingw-static triplet):
 - portaudio - Cross-platform audio I/O
 - nlohmann_json - JSON parsing
 - glfw3 - Windowing/input
 - glad - OpenGL loader
 ## Roadmap
-### Phase 1 - MVP (Current)
+### Phase 1 - MVP ✅ (Complete)
- ✅ Audio capture
+- ✅ Audio capture with VAD
- ✅ Whisper integration
+- ✅ Noise reduction (RNNoise)
- ✅ Claude integration
+- ✅ Whisper API integration
- ✅ ImGui UI
+- ✅ Claude API integration
- ✅ Stop button
+- ✅ ImGui UI with runtime VAD adjustment
 - ✅ Opus compression
 - ✅ Hallucination filtering
 - ✅ Session recording
-### Phase 2 - Enhancement
+### Phase 2 - Debugging 🔄 (Current)
- ⬜ Auto-summary post-meeting
+- 🔄 Session logging (JSON per segment)
- ⬜ Export transcripts
+- 🔄 Improved Whisper prompt engineering
- ⬜ Search functionality
+- 🔄 VAD threshold optimization
 - 🔄 Context propagation between segments
 - ⬜ Automated testing with sample audio
 ### Phase 3 - Enhancement
 - ⬜ Auto-summary post-meeting (Claude analysis)
 - ⬜ Full-text search (SQLite FTS5)
 - ⬜ Semantic search (embeddings)
 - ⬜ Speaker diarization
- ⬜ Replay mode
+- ⬜ Replay mode with synced transcripts
 - ⬜ Multi-language support extension
 ## Development Documentation
 - **CLAUDE.md** - Development guide for Claude Code AI assistant
 - **PLAN_DEBUG.md** - Active debugging plan with identified issues and solutions
 - **WINDOWS_BUILD.md** - Detailed Windows build instructions
 - **WINDOWS_MINGW.md** - MinGW-specific build guide
 - **WINDOWS_QUICK_START.md** - Quick start for Windows users
 ## Contributing
 This is a personal project built to solve a real need. Bug reports and suggestions welcome:
 **Known issues**: See `PLAN_DEBUG.md` for current debugging efforts
 **Architecture**: See `CLAUDE.md` for detailed system design
 ## License
 See LICENSE file.
-## Contributing
+## Acknowledgments
-This is a personal project, but suggestions and bug reports are welcome via issues.
+- OpenAI Whisper for excellent Chinese transcription
-
+- Anthropic Claude for context-aware translation
-## Contact
+- RNNoise for neural network-based noise reduction
-
+- ImGui for clean, immediate-mode UI
 See docs/SecondVoice.md for project context and motivation.
--- a/config.json
+++ b/config.json
@ -10,7 +10,7 @@
    "model": "gpt-4o-mini-transcribe",
    "language": "zh",
    "temperature": 0.0,
-    "prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.",
+    "prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. Si un contexte de phrases précédentes est fourni, utilise-le pour maintenir la cohérence (noms propres, sujets, terminologie). RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.",
    "stream": false,
    "response_format": "text"
  },
--- a/src/audio/AudioCapture.cpp
+++ b/src/audio/AudioCapture.cpp
@ -135,16 +135,12 @@ int AudioCapture::audioCallback(const void* input, void* output,
    // Speech = energy OK AND (ZCR OK or very high energy)
    bool frame_has_speech = energy_ok && (zcr_ok || denoised_rms > adaptive_rms_thresh * 3.0f);
-    // Hang time logic: don't immediately cut on silence
+    // Reset trailing silence counter when speech detected
    if (frame_has_speech) {
-        self->hang_frames_ = self->hang_frames_threshold_;  // Reset hang counter
+        self->consecutive_silence_frames_ = 0;
    } else if (self->hang_frames_ > 0) {
        self->hang_frames_--;
        frame_has_speech = true;  // Keep "speaking" during hang time
    }
    // Calculate durations in samples
    int silence_samples_threshold = (self->silence_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
    int min_speech_samples = (self->min_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
    int max_speech_samples = (self->max_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
@ -183,16 +179,17 @@ int AudioCapture::audioCallback(const void* input, void* output,
            }
            self->speech_buffer_.clear();
            self->speech_samples_count_ = 0;
            self->consecutive_silence_frames_ = 0;  // Reset after forced flush
            // Reset stream for next segment
            if (self->noise_reducer_) {
                self->noise_reducer_->resetStream();
            }
        }
    } else {
-        // True silence (after hang time expired)
+        // Silence detected
        self->silence_samples_count_ += sample_count;
-        // If we were speaking and now have enough silence, flush
+        // If we were speaking and now have silence, track consecutive silence frames
        if (self->speech_buffer_.size() > 0) {
            // Add trailing silence (denoised)
            if (!denoised_samples.empty()) {
@ -204,7 +201,16 @@ int AudioCapture::audioCallback(const void* input, void* output,
                }
            }
-            if (self->silence_samples_count_ >= silence_samples_threshold) {
+            // Increment consecutive silence frame counter
            self->consecutive_silence_frames_++;
            // Calculate threshold in frames (callbacks)
            // frames_per_buffer = frame_count from callback
            int frames_per_buffer = static_cast<int>(frame_count);
            int silence_threshold_frames = (self->silence_duration_ms_ * self->sample_rate_) / (1000 * frames_per_buffer);
            // Flush when consecutive silence exceeds threshold
            if (self->consecutive_silence_frames_ >= silence_threshold_frames) {
                self->is_speech_active_.store(false, std::memory_order_relaxed);
                // Flush if we have enough speech
@ -220,7 +226,9 @@ int AudioCapture::audioCallback(const void* input, void* output,
                    float duration = static_cast<float>(self->speech_buffer_.size()) /
                                   (self->sample_rate_ * self->channels_);
-                    std::cout << "[VAD] Speech ended (noise_floor=" << self->noise_floor_
+                    std::cout << "[VAD] Speech ended (trailing silence detected, "
                              << self->consecutive_silence_frames_ << " frames, "
                              << "noise_floor=" << self->noise_floor_
                              << "), flushing " << duration << "s (denoised)" << std::endl;
                    if (self->callback_) {
@ -233,6 +241,7 @@ int AudioCapture::audioCallback(const void* input, void* output,
                self->speech_buffer_.clear();
                self->speech_samples_count_ = 0;
                self->consecutive_silence_frames_ = 0;  // Reset after flush
                // Reset stream for next segment
                if (self->noise_reducer_) {
                    self->noise_reducer_->resetStream();
--- a/src/audio/AudioCapture.h
+++ b/src/audio/AudioCapture.h
@ -77,9 +77,8 @@ private:
    float noise_floor_ = 0.005f;         // Estimated background noise level
    float noise_floor_alpha_ = 0.001f;   // Slower adaptation
-    // Hang time - wait before cutting to avoid mid-sentence cuts
+    // Trailing silence detection - count consecutive silence frames after speech
-    int hang_frames_ = 0;
+    int consecutive_silence_frames_ = 0;
    int hang_frames_threshold_ = 35;     // ~350ms tolerance for pauses (was 20)
    // Zero-crossing rate for speech vs noise discrimination
    float last_zcr_ = 0.0f;