feat: Add VAD metrics tracking to session logs

chore: Ignore .claudiomiro directory
refactor: Add VAD configuration accessors to Config class
2025-12-02 10:03:20 +08:00 · 2025-12-02 09:54:39 +08:00 · 2025-12-02 09:53:53 +08:00 · 2025-12-02 09:48:44 +08:00 · 2025-12-02 09:44:06 +08:00 · 2025-11-23 22:08:01 +08:00
12 changed files with 803 additions and 85 deletions
--- a/.gitignore
+++ b/.gitignore
@ -64,9 +64,11 @@ imgui.ini
 *.aac
 *.m4a
 denoised/
+sessions/

 # Claude Code local settings
 .claude/settings.local.json
+.claudiomiro/

 # Build scripts (local)
 run_build.ps1
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -108,6 +108,7 @@ set(SOURCES_UI
    src/ui/TranslationUI.cpp
    # Utils
    src/utils/Config.cpp
+    src/utils/SessionLogger.cpp
    # Core
    src/core/Pipeline.cpp
 )
--- a/README.md
+++ b/README.md
@ -4,16 +4,50 @@ Real-time Chinese to French translation system for live meetings.

 ## Overview

-SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API, and translates it to French using Claude AI in real-time. Perfect for understanding Chinese meetings on the fly.
+SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly.
+
+### Why This Project?
+
+Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:
+- Business meetings with Chinese speakers
+- Family/administrative calls
+- Professional conferences
+- Any live Chinese conversation where real-time comprehension is needed
+
+**Status**: MVP complete, actively being debugged and improved based on real-world usage.
+
+## Quick Start
+
+### Windows (MinGW) - Recommended
+
+```batch
+# First-time setup
+.\setup_mingw.bat
+
+# Build
+.\build_mingw.bat
+
+# Run
+cd build\mingw-Release
+SecondVoice.exe
+```
+
+**Requirements**: `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`, plus a working microphone.
+
+See full setup instructions below for other platforms.

 ## Features

- 🎤 Real-time audio capture
- 🗣️ Chinese speech-to-text (Whisper API)
- 🌐 Chinese to French translation (Claude API)
- 🖥️ Clean ImGui interface
- 💾 Full recording saved to disk
- ⚙️ Configurable chunk sizes and settings
+- 🎤 **Real-time audio capture** with Voice Activity Detection (VAD)
+- 🔇 **Noise reduction** using RNNoise neural network
+- 🗣️ **Chinese speech-to-text** via Whisper API (gpt-4o-mini-transcribe)
+- 🧠 **Hallucination filtering** - removes known Whisper artifacts
+- 🌐 **Chinese to French translation** via Claude AI (claude-haiku-4-20250514)
+- 🖥️ **Clean ImGui interface** with adjustable VAD thresholds
+- 💾 **Full session recording** with structured logging
+- 📊 **Session archival** - audio, transcripts, translations, and metadata
+- ⚡ **Opus compression** - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
+- ⚙️ **Configurable settings** via config.json

 ## Requirements

@ -116,20 +150,138 @@ The application will:
 ## Architecture

 ```
-Audio Capture (PortAudio)
+Audio Input (16kHz mono)
    ↓
-Whisper API (Speech-to-Text)
+Voice Activity Detection (VAD) - RMS + Peak thresholds
    ↓
-Claude API (Translation)
+Noise Reduction (RNNoise) - 16→48→16 kHz resampling
    ↓
-ImGui UI (Display)
+Opus Encoding (24kbps OGG) - 46x compression
+    ↓
+Whisper API (gpt-4o-mini-transcribe) - Chinese STT
+    ↓
+Hallucination Filter - Remove known artifacts
+    ↓
+Claude API (claude-haiku-4) - Chinese → French translation
+    ↓
+ImGui UI Display + Session Logging
 ```

-### Threading Model
+### Threading Model (3 threads)

- **Thread 1**: Audio capture (PortAudio callback)
- **Thread 2**: AI processing (Whisper + Claude API calls)
- **Thread 3**: UI rendering (ImGui + OpenGL)
+1. **Audio Thread** (`Pipeline::audioThread`)
+   - PortAudio callback captures 16kHz mono audio
+   - Applies VAD (Voice Activity Detection) using RMS + Peak thresholds
+   - Pushes speech chunks to processing queue
+
+2. **Processing Thread** (`Pipeline::processingThread`)
+   - Consumes audio chunks from queue
+   - Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
+   - Encodes to Opus/OGG for bandwidth efficiency
+   - Calls Whisper API for Chinese transcription
+   - Filters known hallucinations (YouTube phrases, music markers, etc.)
+   - Calls Claude API for French translation
+   - Logs to session files
+
+3. **UI Thread** (main)
+   - GLFW/ImGui rendering loop (must run on main thread)
+   - Displays real-time transcription and translation
+   - Allows runtime VAD threshold adjustment
+   - Handles user controls (stop recording, etc.)
+
+### Core Components
+
+**Audio Processing**:
+- `AudioCapture.cpp` - PortAudio wrapper with VAD-based segmentation
+- `AudioBuffer.cpp` - Accumulates samples, exports WAV/Opus
+- `NoiseReducer.cpp` - RNNoise denoising with resampling
+
+**API Clients**:
+- `WhisperClient.cpp` - OpenAI Whisper API (multipart/form-data)
+- `ClaudeClient.cpp` - Anthropic Claude API (JSON)
+- `WinHttpClient.cpp` - Native Windows HTTP client (replaced libcurl)
+
+**Core Logic**:
+- `Pipeline.cpp` - Orchestrates audio → transcription → translation flow
+- `TranslationUI.cpp` - ImGui interface with VAD controls
+
+**Utilities**:
+- `Config.cpp` - Loads config.json + .env
+- `ThreadSafeQueue.h` - Lock-free queue for audio chunks
+
+## Known Issues & Active Debugging
+
+**Status**: Real-world testing has identified issues with degraded audio conditions (see `PLAN_DEBUG.md` for details).
+
+### Current Problems
+
+Based on transcript analysis from actual meetings (November 2025):
+
+1. **VAD cutting speech too early**
+   - Voice Activity Detection triggers end-of-segment prematurely
+   - Results in fragmented phrases ("我很。" → "Je suis.")
+   - **Hypothesis**: Silence threshold too aggressive for multi-speaker scenarios
+
+2. **Segments too short for context**
+   - Whisper receives insufficient audio context for accurate Chinese transcription
+   - Single-word or two-word segments lack conversational context
+   - **Impact**: Lower accuracy, especially with homonyms
+
+3. **Ambient noise interpreted as speech**
+   - Background sounds trigger false VAD positives
+   - Test transcript shows "太多声音了" (too much noise) being captured
+   - **Mitigation**: RNNoise helps but not sufficient for very noisy environments
+
+4. **Loss of inter-segment context**
+   - Each audio chunk processed independently
+   - Whisper cannot use previous context for better transcription
+   - **Potential solution**: Pass previous 2-3 transcriptions in prompt
+
+### Test Conditions
+
+Testing has been performed under **deliberately degraded conditions** to ensure robustness:
+- Multiple simultaneous speakers
+- Variable microphone distance
+- Variable volume levels
+- Fast-paced conversations
+- Low-quality microphone
+
+These conditions are intentionally harsh to validate real-world meeting scenarios.
+
+### Debug Plan
+
+See `PLAN_DEBUG.md` for:
+- Detailed session logging implementation (JSON per segment + metadata)
+- Improved Whisper prompt engineering
+- VAD threshold tuning recommendations
+- Context propagation strategies
+
+## Session Logging
+
+### Structure
+
+```
+sessions/
+└── YYYY-MM-DD_HHMMSS/
+    ├── session.json           # Session metadata
+    ├── segments/
+    │   ├── 001.json          # Segment: Chinese + French + metadata
+    │   ├── 002.json
+    │   └── ...
+    └── transcript.txt         # Final export
+```
+
+### Segment Format
+
+```json
+{
+  "id": 1,
+  "chinese": "两个老鼠求我",
+  "french": "Deux souris me supplient"
+}
+```
+
+**Future enhancements**: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.

 ## Configuration

@ -143,8 +295,9 @@ ImGui UI (Display)
    "chunk_duration_seconds": 10
  },
  "whisper": {
-    "model": "whisper-1",
-    "language": "zh"
+    "model": "gpt-4o-mini-transcribe",
+    "language": "zh",
+    "prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
  },
  "claude": {
    "model": "claude-haiku-4-20250514",
@ -166,23 +319,33 @@ ANTHROPIC_API_KEY=sk-ant-...
 - **Claude Haiku**: ~$0.03-0.05/hour
 - **Total**: ~$0.40/hour of recording

-## Project Structure
+## Advanced Features

-```
-secondvoice/
-├── src/
-│   ├── main.cpp                 # Entry point
-│   ├── audio/                   # Audio capture & buffer
-│   ├── api/                     # Whisper & Claude clients
-│   ├── ui/                      # ImGui interface
-│   ├── utils/                   # Config & thread-safe queue
-│   └── core/                    # Pipeline orchestration
-├── docs/                        # Documentation
-├── recordings/                  # Output recordings
-├── config.json                  # Runtime configuration
-├── .env                         # API keys (not committed)
-└── CMakeLists.txt              # Build configuration
-```
+### GPU Forcing (Hybrid Graphics Systems)
+
+`main.cpp` exports symbols to force dedicated GPU on Optimus/PowerXpress systems:
+- `NvOptimusEnablement` - Forces NVIDIA GPU
+- `AmdPowerXpressRequestHighPerformance` - Forces AMD GPU
+
+Critical for laptops with both integrated and dedicated GPUs.
+
+### Hallucination Filtering
+
+`Pipeline.cpp` maintains an extensive list (~65 patterns) of known Whisper hallucinations:
+- YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment"
+- Chinese video endings: "谢谢观看", "再见", "订阅我的频道"
+- Music symbols: "♪♪", "🎵"
+- Silence markers: "...", "silence", "inaudible"
+
+These are automatically filtered before translation to avoid wasting API calls.
+
+### Console-Only Build
+
+A `SecondVoice_Console` target exists for headless testing:
+- Uses `main_console.cpp`
+- No ImGui/GLFW dependencies
+- Outputs transcriptions to stdout
+- Useful for debugging and automated testing

 ## Development

@ -219,30 +382,101 @@ cmake --build build
 - Check all system dependencies are installed
 - Try `cmake --build build --clean-first`

+## Project Structure
+
+```
+secondvoice/
+├── src/
+│   ├── main.cpp                    # Entry point, forces NVIDIA GPU
+│   ├── core/
+│   │   └── Pipeline.cpp           # Audio→Transcription→Translation orchestration
+│   ├── audio/
+│   │   ├── AudioCapture.cpp       # PortAudio + VAD segmentation
+│   │   ├── AudioBuffer.cpp        # Sample accumulation, WAV/Opus export
+│   │   └── NoiseReducer.cpp       # RNNoise (16→48→16 kHz)
+│   ├── api/
+│   │   ├── WhisperClient.cpp      # OpenAI Whisper (multipart/form-data)
+│   │   ├── ClaudeClient.cpp       # Anthropic Claude (JSON)
+│   │   └── WinHttpClient.cpp      # Native Windows HTTP
+│   ├── ui/
+│   │   └── TranslationUI.cpp      # ImGui interface + VAD controls
+│   └── utils/
+│       ├── Config.cpp             # config.json + .env loader
+│       └── ThreadSafeQueue.h      # Lock-free audio queue
+├── docs/                          # Build guides
+├── sessions/                      # Session recordings + logs
+├── recordings/                    # Legacy recordings directory
+├── denoised/                      # Denoised audio outputs
+├── config.json                    # Runtime configuration
+├── .env                           # API keys (not committed)
+├── CLAUDE.md                      # Development guide for Claude Code
+├── PLAN_DEBUG.md                  # Active debugging plan
+└── CMakeLists.txt                 # Build configuration
+```
+
+### External Dependencies
+
+**Fetched via CMake FetchContent**:
+- ImGui v1.90.1 - UI framework
+- Opus v1.5.2 - Audio encoding
+- Ogg v1.3.6 - Container format
+- RNNoise v0.1.1 - Neural network noise reduction
+
+**vcpkg Dependencies** (x64-mingw-static triplet):
+- portaudio - Cross-platform audio I/O
+- nlohmann_json - JSON parsing
+- glfw3 - Windowing/input
+- glad - OpenGL loader
+
 ## Roadmap

-### Phase 1 - MVP (Current)
- ✅ Audio capture
- ✅ Whisper integration
- ✅ Claude integration
- ✅ ImGui UI
- ✅ Stop button
+### Phase 1 - MVP ✅ (Complete)
+- ✅ Audio capture with VAD
+- ✅ Noise reduction (RNNoise)
+- ✅ Whisper API integration
+- ✅ Claude API integration
+- ✅ ImGui UI with runtime VAD adjustment
+- ✅ Opus compression
+- ✅ Hallucination filtering
+- ✅ Session recording

-### Phase 2 - Enhancement
- ⬜ Auto-summary post-meeting
- ⬜ Export transcripts
- ⬜ Search functionality
+### Phase 2 - Debugging 🔄 (Current)
+- 🔄 Session logging (JSON per segment)
+- 🔄 Improved Whisper prompt engineering
+- 🔄 VAD threshold optimization
+- 🔄 Context propagation between segments
+- ⬜ Automated testing with sample audio
+
+### Phase 3 - Enhancement
+- ⬜ Auto-summary post-meeting (Claude analysis)
+- ⬜ Full-text search (SQLite FTS5)
+- ⬜ Semantic search (embeddings)
 - ⬜ Speaker diarization
- ⬜ Replay mode
+- ⬜ Replay mode with synced transcripts
+- ⬜ Multi-language support extension
+
+## Development Documentation
+
+- **CLAUDE.md** - Development guide for Claude Code AI assistant
+- **PLAN_DEBUG.md** - Active debugging plan with identified issues and solutions
+- **WINDOWS_BUILD.md** - Detailed Windows build instructions
+- **WINDOWS_MINGW.md** - MinGW-specific build guide
+- **WINDOWS_QUICK_START.md** - Quick start for Windows users
+
+## Contributing
+
+This is a personal project built to solve a real need. Bug reports and suggestions welcome:
+
+**Known issues**: See `PLAN_DEBUG.md` for current debugging efforts
+**Architecture**: See `CLAUDE.md` for detailed system design

 ## License

 See LICENSE file.

-## Contributing
+## Acknowledgments

-This is a personal project, but suggestions and bug reports are welcome via issues.
-
-## Contact
-
-See docs/SecondVoice.md for project context and motivation.
+- OpenAI Whisper for excellent Chinese transcription
+- Anthropic Claude for context-aware translation
+- RNNoise for neural network-based noise reduction
+- ImGui for clean, immediate-mode UI
--- a/config.json
+++ b/config.json
@ -6,11 +6,16 @@
    "chunk_step_seconds": 5,
    "format": "ogg"
  },
+  "vad": {
+    "silence_duration_ms": 700,
+    "min_speech_duration_ms": 2000,
+    "max_speech_duration_ms": 30000
+  },
  "whisper": {
    "model": "gpt-4o-mini-transcribe",
    "language": "zh",
    "temperature": 0.0,
-    "prompt": "The following is a conversation in Mandarin Chinese about business, family, and daily life. Common names: Tingting, Alexis.",
+    "prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. Si un contexte de phrases précédentes est fourni, utilise-le pour maintenir la cohérence (noms propres, sujets, terminologie). RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.",
    "stream": false,
    "response_format": "text"
  },
--- a/src/audio/AudioCapture.cpp
+++ b/src/audio/AudioCapture.cpp
@ -4,9 +4,15 @@

 namespace secondvoice {

-AudioCapture::AudioCapture(int sample_rate, int channels)
+AudioCapture::AudioCapture(int sample_rate, int channels,
+                           int silence_duration_ms,
+                           int min_speech_duration_ms,
+                           int max_speech_duration_ms)
    : sample_rate_(sample_rate)
    , channels_(channels)
+    , silence_duration_ms_(silence_duration_ms)
+    , min_speech_duration_ms_(min_speech_duration_ms)
+    , max_speech_duration_ms_(max_speech_duration_ms)
    , noise_reducer_(std::make_unique<NoiseReducer>()) {
    std::cout << "[Audio] Noise reduction enabled (RNNoise)" << std::endl;
 }
@ -135,16 +141,12 @@ int AudioCapture::audioCallback(const void* input, void* output,
    // Speech = energy OK AND (ZCR OK or very high energy)
    bool frame_has_speech = energy_ok && (zcr_ok || denoised_rms > adaptive_rms_thresh * 3.0f);

-    // Hang time logic: don't immediately cut on silence
+    // Reset trailing silence counter when speech detected
    if (frame_has_speech) {
-        self->hang_frames_ = self->hang_frames_threshold_;  // Reset hang counter
-    } else if (self->hang_frames_ > 0) {
-        self->hang_frames_--;
-        frame_has_speech = true;  // Keep "speaking" during hang time
+        self->consecutive_silence_frames_ = 0;
    }

    // Calculate durations in samples
-    int silence_samples_threshold = (self->silence_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
    int min_speech_samples = (self->min_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
    int max_speech_samples = (self->max_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;

@ -170,6 +172,11 @@ int AudioCapture::audioCallback(const void* input, void* output,
            std::cout << "[VAD] Max duration reached, forcing flush ("
                      << self->speech_samples_count_ / (self->sample_rate_ * self->channels_) << "s)" << std::endl;

+            // Calculate metrics BEFORE flushing
+            self->last_speech_duration_ms_ = (self->speech_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
+            self->last_silence_duration_ms_ = 0;  // No trailing silence in forced flush
+            self->last_flush_reason_ = "max_duration";
+
            if (self->callback_ && self->speech_buffer_.size() >= static_cast<size_t>(min_speech_samples)) {
                // Flush any remaining samples from the denoiser
                if (self->noise_reducer_ && self->noise_reducer_->isEnabled()) {
@ -183,16 +190,17 @@ int AudioCapture::audioCallback(const void* input, void* output,
            }
            self->speech_buffer_.clear();
            self->speech_samples_count_ = 0;
+            self->consecutive_silence_frames_ = 0;  // Reset after forced flush
            // Reset stream for next segment
            if (self->noise_reducer_) {
                self->noise_reducer_->resetStream();
            }
        }
    } else {
-        // True silence (after hang time expired)
+        // Silence detected
        self->silence_samples_count_ += sample_count;

-        // If we were speaking and now have enough silence, flush
+        // If we were speaking and now have silence, track consecutive silence frames
        if (self->speech_buffer_.size() > 0) {
            // Add trailing silence (denoised)
            if (!denoised_samples.empty()) {
@ -204,9 +212,23 @@ int AudioCapture::audioCallback(const void* input, void* output,
                }
            }

-            if (self->silence_samples_count_ >= silence_samples_threshold) {
+            // Increment consecutive silence frame counter
+            self->consecutive_silence_frames_++;
+
+            // Calculate threshold in frames (callbacks)
+            // frames_per_buffer = frame_count from callback
+            int frames_per_buffer = static_cast<int>(frame_count);
+            int silence_threshold_frames = (self->silence_duration_ms_ * self->sample_rate_) / (1000 * frames_per_buffer);
+
+            // Flush when consecutive silence exceeds threshold
+            if (self->consecutive_silence_frames_ >= silence_threshold_frames) {
                self->is_speech_active_.store(false, std::memory_order_relaxed);

+                // Calculate metrics BEFORE flushing
+                self->last_speech_duration_ms_ = (self->speech_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
+                self->last_silence_duration_ms_ = (self->silence_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
+                self->last_flush_reason_ = "silence_threshold";
+
                // Flush if we have enough speech
                if (self->speech_samples_count_ >= min_speech_samples) {
                    // Flush any remaining samples from the denoiser
@ -220,7 +242,9 @@ int AudioCapture::audioCallback(const void* input, void* output,

                    float duration = static_cast<float>(self->speech_buffer_.size()) /
                                   (self->sample_rate_ * self->channels_);
-                    std::cout << "[VAD] Speech ended (noise_floor=" << self->noise_floor_
+                    std::cout << "[VAD] Speech ended (trailing silence detected, "
+                              << self->consecutive_silence_frames_ << " frames, "
+                              << "noise_floor=" << self->noise_floor_
                              << "), flushing " << duration << "s (denoised)" << std::endl;

                    if (self->callback_) {
@ -233,6 +257,7 @@ int AudioCapture::audioCallback(const void* input, void* output,

                self->speech_buffer_.clear();
                self->speech_samples_count_ = 0;
+                self->consecutive_silence_frames_ = 0;  // Reset after flush
                // Reset stream for next segment
                if (self->noise_reducer_) {
                    self->noise_reducer_->resetStream();
--- a/src/audio/AudioCapture.h
+++ b/src/audio/AudioCapture.h
@ -16,7 +16,10 @@ class AudioCapture {
 public:
    using AudioCallback = std::function<void(const std::vector<float>&)>;

-    AudioCapture(int sample_rate, int channels);
+    AudioCapture(int sample_rate, int channels,
+                 int silence_duration_ms = 700,
+                 int min_speech_duration_ms = 2000,
+                 int max_speech_duration_ms = 30000);
    ~AudioCapture();

    bool initialize();
@ -44,6 +47,11 @@ public:
    void setDenoiseEnabled(bool enabled);
    bool isDenoiseEnabled() const;

+    // Get metrics from last flushed segment
+    int getLastSpeechDuration() const { return last_speech_duration_ms_; }
+    int getLastSilenceDuration() const { return last_silence_duration_ms_; }
+    std::string getLastFlushReason() const { return last_flush_reason_; }
+
 private:
    static int audioCallback(const void* input, void* output,
                            unsigned long frame_count,
@ -69,17 +77,21 @@ private:
    // VAD parameters - Higher threshold to avoid false triggers on filtered noise
    std::atomic<float> vad_rms_threshold_{0.02f};   // Was 0.01f
    std::atomic<float> vad_peak_threshold_{0.08f};  // Was 0.04f
-    int silence_duration_ms_ = 400;      // Wait 400ms of silence before cutting
-    int min_speech_duration_ms_ = 300;   // Minimum speech to send
-    int max_speech_duration_ms_ = 25000; // 25s max before forced flush
+    int silence_duration_ms_;      // Wait 700ms of silence before cutting (was 400)
+    int min_speech_duration_ms_;  // Minimum 2s speech to send (was 1000)
+    int max_speech_duration_ms_; // 30s max before forced flush (was 25000)

    // Adaptive noise floor
    float noise_floor_ = 0.005f;         // Estimated background noise level
    float noise_floor_alpha_ = 0.001f;   // Slower adaptation

-    // Hang time - wait before cutting to avoid mid-sentence cuts
-    int hang_frames_ = 0;
-    int hang_frames_threshold_ = 20;     // ~200ms tolerance for pauses
+    // Trailing silence detection - count consecutive silence frames after speech
+    int consecutive_silence_frames_ = 0;
+
+    // Metrics for last flushed segment (set in callback, read in processing thread)
+    int last_speech_duration_ms_ = 0;
+    int last_silence_duration_ms_ = 0;
+    std::string last_flush_reason_;

    // Zero-crossing rate for speech vs noise discrimination
    float last_zcr_ = 0.0f;
--- a/src/core/Pipeline.cpp
+++ b/src/core/Pipeline.cpp
@ -24,12 +24,23 @@ Pipeline::~Pipeline() {
 bool Pipeline::initialize() {
    auto& config = Config::getInstance();

+    // Load VAD parameters from config (with fallbacks if missing)
+    int silence_duration = config.getVadSilenceDurationMs();
+    int min_speech = config.getVadMinSpeechDurationMs();
+    int max_speech = config.getVadMaxSpeechDurationMs();
+
    // Initialize audio capture with VAD-based segmentation
    audio_capture_ = std::make_unique<AudioCapture>(
        config.getAudioConfig().sample_rate,
-        config.getAudioConfig().channels
+        config.getAudioConfig().channels,
+        silence_duration,
+        min_speech,
+        max_speech
    );

+    std::cout << "[Pipeline] VAD configured: silence=" << silence_duration
+              << "ms, min_speech=" << min_speech
+              << "ms, max_speech=" << max_speech << "ms" << std::endl;
    std::cout << "[Pipeline] VAD-based audio segmentation enabled" << std::endl;

    if (!audio_capture_->initialize()) {
@ -70,6 +81,10 @@ bool Pipeline::start() {
    }

    running_ = true;
+    segment_id_ = 0;
+
+    // Start session logging
+    session_logger_.startSession();

    // Start background threads
    audio_thread_ = std::thread(&Pipeline::audioThread, this);
@ -126,6 +141,9 @@ void Pipeline::stop() {
        transcript_ss << "transcripts/transcript_" << timestamp.str() << ".txt";
        ui_->exportTranscript(transcript_ss.str());
    }
+
+    // End session logging
+    session_logger_.endSession();
 }

 void Pipeline::audioThread() {
@ -143,6 +161,8 @@ void Pipeline::audioThread() {
        chunk.sample_rate = config.getAudioConfig().sample_rate;
        chunk.channels = config.getAudioConfig().channels;

+        float push_duration = static_cast<float>(audio_data.size()) / (chunk.sample_rate * chunk.channels);
+        std::cout << "[Queue] Pushing " << push_duration << "s chunk, queue size: " << audio_queue_.size() << std::endl;
        audio_queue_.push(std::move(chunk));
    });

@ -159,6 +179,7 @@ void Pipeline::audioThread() {

 void Pipeline::processingThread() {
    auto& config = Config::getInstance();
+    int audio_segment_id = 0;

    while (running_) {
        auto chunk_opt = audio_queue_.wait_and_pop();
@ -168,7 +189,42 @@ void Pipeline::processingThread() {

        auto& chunk = chunk_opt.value();
        float duration = static_cast<float>(chunk.data.size()) / (chunk.sample_rate * chunk.channels);
-        std::cout << "[Processing] Speech segment: " << duration << "s" << std::endl;
+
+        // Debug: log queue size to detect double-push
+        std::cout << "[Queue] Processing chunk, " << audio_queue_.size() << " remaining" << std::endl;
+
+        // Save audio segment to session directory for debugging
+        audio_segment_id++;
+        if (session_logger_.isActive()) {
+            std::stringstream audio_path;
+            audio_path << session_logger_.getSessionPath() << "/audio_"
+                       << std::setfill('0') << std::setw(3) << audio_segment_id << ".ogg";
+
+            AudioBuffer segment_buffer(chunk.sample_rate, chunk.channels);
+            segment_buffer.addSamples(chunk.data);
+            if (segment_buffer.saveToOpus(audio_path.str())) {
+                std::cout << "[Session] Saved audio segment: " << audio_path.str() << std::endl;
+            }
+        }
+
+        // Calculate audio RMS for logging
+        float audio_rms = 0.0f;
+        if (!chunk.data.empty()) {
+            float sum_sq = 0.0f;
+            for (float s : chunk.data) sum_sq += s * s;
+            audio_rms = std::sqrt(sum_sq / chunk.data.size());
+        }
+
+        std::cout << "[Processing] Speech segment: " << duration << "s (RMS=" << audio_rms << ")" << std::endl;
+
+        // Time Whisper
+        auto whisper_start = std::chrono::steady_clock::now();
+
+        // Build dynamic prompt with recent context
+        std::string dynamic_prompt = buildDynamicPrompt();
+        if (!recent_transcriptions_.empty()) {
+            std::cout << "[Context] Using " << recent_transcriptions_.size() << " previous segments" << std::endl;
+        }

        // Transcribe with Whisper
        auto whisper_result = whisper_client_->transcribe(
@ -178,12 +234,17 @@ void Pipeline::processingThread() {
            config.getWhisperConfig().model,
            config.getWhisperConfig().language,
            config.getWhisperConfig().temperature,
-            config.getWhisperConfig().prompt,
+            dynamic_prompt,
            config.getWhisperConfig().response_format
        );

+        auto whisper_end = std::chrono::steady_clock::now();
+        int64_t whisper_latency = std::chrono::duration_cast<std::chrono::milliseconds>(
+            whisper_end - whisper_start).count();
+
        if (!whisper_result.has_value()) {
            std::cerr << "Whisper transcription failed" << std::endl;
+            session_logger_.logFilteredSegment("", "whisper_failed", duration, audio_rms);
            continue;
        }

@ -195,6 +256,7 @@ void Pipeline::processingThread() {
        size_t end = text.find_last_not_of(" \t\n\r");
        if (start == std::string::npos) {
            std::cout << "[Skip] Empty transcription" << std::endl;
+            session_logger_.logFilteredSegment("", "empty", duration, audio_rms);
            continue;
        }
        text = text.substr(start, end - start + 1);
@ -267,14 +329,32 @@ void Pipeline::processingThread() {

        if (is_garbage) {
            std::cout << "[Skip] Filtered: " << text << std::endl;
+            session_logger_.logFilteredSegment(text, "hallucination", duration, audio_rms);
            continue;
        }

+        // Deduplication: skip if exact same as last transcription
+        if (text == last_transcription_) {
+            std::cout << "[Skip] Duplicate: " << text << std::endl;
+            session_logger_.logFilteredSegment(text, "duplicate", duration, audio_rms);
+            continue;
+        }
+        last_transcription_ = text;
+
+        // Update dynamic context for next Whisper call
+        recent_transcriptions_.push_back(text);
+        if (recent_transcriptions_.size() > MAX_CONTEXT_SEGMENTS) {
+            recent_transcriptions_.erase(recent_transcriptions_.begin());
+        }
+
        // Track audio cost
        if (ui_) {
            ui_->addAudioCost(duration);
        }

+        // Time Claude
+        auto claude_start = std::chrono::steady_clock::now();
+
        // Translate with Claude
        auto claude_result = claude_client_->translate(
            text,
@ -283,8 +363,13 @@ void Pipeline::processingThread() {
            config.getClaudeConfig().temperature
        );

+        auto claude_end = std::chrono::steady_clock::now();
+        int64_t claude_latency = std::chrono::duration_cast<std::chrono::milliseconds>(
+            claude_end - claude_start).count();
+
        if (!claude_result.has_value()) {
            std::cerr << "Claude translation failed" << std::endl;
+            session_logger_.logFilteredSegment(text, "claude_failed", duration, audio_rms);
            continue;
        }

@ -308,8 +393,28 @@ void Pipeline::processingThread() {
        ui_->setAccumulatedText(accumulated_chinese_, accumulated_french_);
        ui_->addTranslation(text, claude_result->text);

+        // Log successful segment
+        segment_id_++;
+        SegmentLog seg;
+        seg.id = segment_id_;
+        seg.chinese = text;
+        seg.french = claude_result->text;
+        seg.audio_duration_sec = duration;
+        seg.audio_rms = audio_rms;
+        seg.whisper_latency_ms = whisper_latency;
+        seg.claude_latency_ms = claude_latency;
+        seg.was_filtered = false;
+        seg.filter_reason = "";
+        seg.timestamp = "";  // Will be set by logger
+        // Add VAD metrics from AudioCapture
+        seg.speech_duration_ms = audio_capture_->getLastSpeechDuration();
+        seg.silence_duration_ms = audio_capture_->getLastSilenceDuration();
+        seg.flush_reason = audio_capture_->getLastFlushReason();
+        session_logger_.logSegment(seg);
+
        std::cout << "CN: " << text << std::endl;
        std::cout << "FR: " << claude_result->text << std::endl;
+        std::cout << "[Latency] Whisper: " << whisper_latency << "ms, Claude: " << claude_latency << "ms" << std::endl;
        std::cout << "---" << std::endl;
    }
 }
@ -358,10 +463,34 @@ bool Pipeline::shouldClose() const {
 void Pipeline::clearAccumulated() {
    accumulated_chinese_.clear();
    accumulated_french_.clear();
+    recent_transcriptions_.clear();
+    last_transcription_.clear();
    if (ui_) {
        ui_->setAccumulatedText("", "");
    }
-    std::cout << "[Pipeline] Cleared accumulated text" << std::endl;
+    std::cout << "[Pipeline] Cleared accumulated text and context" << std::endl;
+}
+
+std::string Pipeline::buildDynamicPrompt() const {
+    auto& config = Config::getInstance();
+    std::string base_prompt = config.getWhisperConfig().prompt;
+
+    // If no recent transcriptions, just return base prompt
+    if (recent_transcriptions_.empty()) {
+        return base_prompt;
+    }
+
+    // Build context from recent transcriptions
+    std::stringstream context;
+    context << base_prompt;
+    context << "\n\nContexte des phrases précédentes:\n";
+
+    for (size_t i = 0; i < recent_transcriptions_.size(); ++i) {
+        context << std::to_string(i + 1) << ". "
+                << recent_transcriptions_[i] << "\n";
+    }
+
+    return context.str();
 }

 } // namespace secondvoice
--- a/src/core/Pipeline.h
+++ b/src/core/Pipeline.h
@ -6,6 +6,7 @@
 #include <string>
 #include <vector>
 #include "../utils/ThreadSafeQueue.h"
+#include "../utils/SessionLogger.h"

 namespace secondvoice {

@ -60,6 +61,20 @@ private:
    // Simple accumulation
    std::string accumulated_chinese_;
    std::string accumulated_french_;
+
+    // Dynamic context for Whisper (last N transcriptions)
+    std::vector<std::string> recent_transcriptions_;
+    static constexpr size_t MAX_CONTEXT_SEGMENTS = 3;
+
+    // Deduplication: skip if same as last transcription
+    std::string last_transcription_;
+
+    // Build dynamic prompt with recent context
+    std::string buildDynamicPrompt() const;
+
+    // Session logging
+    SessionLogger session_logger_;
+    int segment_id_ = 0;
 };

 } // namespace secondvoice
--- a/src/utils/Config.cpp
+++ b/src/utils/Config.cpp
@ -52,10 +52,9 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
    }
    std::cerr << "[Config] File opened successfully" << std::endl;

-    json config_json;
    try {
        std::cerr << "[Config] About to parse JSON..." << std::endl;
-        config_file >> config_json;
+        config_file >> config_;
        std::cerr << "[Config] JSON parsed successfully" << std::endl;
    } catch (const json::parse_error& e) {
        std::cerr << "Error parsing config.json: " << e.what() << std::endl;
@ -66,8 +65,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
    }

    // Parse audio config
-    if (config_json.contains("audio")) {
-        auto& audio = config_json["audio"];
+    if (config_.contains("audio")) {
+        auto& audio = config_["audio"];
        audio_config_.sample_rate = audio.value("sample_rate", 16000);
        audio_config_.channels = audio.value("channels", 1);
        audio_config_.chunk_duration_seconds = audio.value("chunk_duration_seconds", 10);
@ -76,8 +75,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
    }

    // Parse whisper config
-    if (config_json.contains("whisper")) {
-        auto& whisper = config_json["whisper"];
+    if (config_.contains("whisper")) {
+        auto& whisper = config_["whisper"];
        whisper_config_.model = whisper.value("model", "whisper-1");
        whisper_config_.language = whisper.value("language", "zh");
        whisper_config_.temperature = whisper.value("temperature", 0.0f);
@ -87,8 +86,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
    }

    // Parse claude config
-    if (config_json.contains("claude")) {
-        auto& claude = config_json["claude"];
+    if (config_.contains("claude")) {
+        auto& claude = config_["claude"];
        claude_config_.model = claude.value("model", "claude-haiku-4-20250514");
        claude_config_.max_tokens = claude.value("max_tokens", 1024);
        claude_config_.temperature = claude.value("temperature", 0.3f);
@ -96,8 +95,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
    }

    // Parse UI config
-    if (config_json.contains("ui")) {
-        auto& ui = config_json["ui"];
+    if (config_.contains("ui")) {
+        auto& ui = config_["ui"];
        ui_config_.window_width = ui.value("window_width", 800);
        ui_config_.window_height = ui.value("window_height", 600);
        ui_config_.font_size = ui.value("font_size", 16);
@ -105,8 +104,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
    }

    // Parse recording config
-    if (config_json.contains("recording")) {
-        auto& recording = config_json["recording"];
+    if (config_.contains("recording")) {
+        auto& recording = config_["recording"];
        recording_config_.save_audio = recording.value("save_audio", true);
        recording_config_.output_directory = recording.value("output_directory", "./recordings");
    }
@ -114,4 +113,25 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
    return true;
 }

+int Config::getVadSilenceDurationMs() const {
+    if (config_.contains("vad") && config_["vad"].contains("silence_duration_ms")) {
+        return config_["vad"]["silence_duration_ms"].get<int>();
+    }
+    return 700;  // Default from AudioCapture.h:72 (unchanged)
+}
+
+int Config::getVadMinSpeechDurationMs() const {
+    if (config_.contains("vad") && config_["vad"].contains("min_speech_duration_ms")) {
+        return config_["vad"]["min_speech_duration_ms"].get<int>();
+    }
+    return 2000;  // Default from AudioCapture.h:73 (updated in TASK2)
+}
+
+int Config::getVadMaxSpeechDurationMs() const {
+    if (config_.contains("vad") && config_["vad"].contains("max_speech_duration_ms")) {
+        return config_["vad"]["max_speech_duration_ms"].get<int>();
+    }
+    return 30000;  // Default from AudioCapture.h:74 (updated in TASK2)
+}
+
 } // namespace secondvoice
--- a/src/utils/Config.h
+++ b/src/utils/Config.h
@ -1,6 +1,7 @@
 #pragma once

 #include <string>
+#include <nlohmann/json.hpp>

 namespace secondvoice {

@ -55,6 +56,10 @@ public:
    const std::string& getOpenAIKey() const { return openai_key_; }
    const std::string& getAnthropicKey() const { return anthropic_key_; }

+    int getVadSilenceDurationMs() const;
+    int getVadMinSpeechDurationMs() const;
+    int getVadMaxSpeechDurationMs() const;
+
 private:
    Config() = default;
    Config(const Config&) = delete;
@ -68,6 +73,7 @@ private:

    std::string openai_key_;
    std::string anthropic_key_;
+    nlohmann::json config_;
 };

 } // namespace secondvoice
--- a/src/utils/SessionLogger.cpp
+++ b/src/utils/SessionLogger.cpp
@ -0,0 +1,201 @@
+#include "SessionLogger.h"
+#include <nlohmann/json.hpp>
+#include <filesystem>
+#include <iostream>
+#include <iomanip>
+#include <sstream>
+
+namespace secondvoice {
+
+using json = nlohmann::json;
+
+SessionLogger::SessionLogger() = default;
+
+SessionLogger::~SessionLogger() {
+    if (is_active_) {
+        endSession();
+    }
+}
+
+std::string SessionLogger::getCurrentTimestamp() const {
+    auto now = std::chrono::system_clock::now();
+    auto time_t = std::chrono::system_clock::to_time_t(now);
+    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(
+        now.time_since_epoch()) % 1000;
+
+    std::stringstream ss;
+    ss << std::put_time(std::localtime(&time_t), "%Y-%m-%d_%H%M%S");
+    return ss.str();
+}
+
+void SessionLogger::startSession() {
+    if (is_active_) {
+        endSession();
+    }
+
+    session_start_time_ = getCurrentTimestamp();
+    session_path_ = "./sessions/" + session_start_time_;
+
+    // Create directories
+    std::filesystem::create_directories(session_path_ + "/segments");
+
+    is_active_ = true;
+    segment_count_ = 0;
+    filtered_count_ = 0;
+    total_audio_sec_ = 0.0f;
+    total_whisper_ms_ = 0;
+    total_claude_ms_ = 0;
+    segments_.clear();
+
+    std::cout << "[Session] Started: " << session_path_ << std::endl;
+}
+
+void SessionLogger::endSession() {
+    if (!is_active_) return;
+
+    writeSessionJson();
+    is_active_ = false;
+
+    std::cout << "[Session] Ended: " << segment_count_ << " segments, "
+              << filtered_count_ << " filtered, "
+              << total_audio_sec_ << "s audio" << std::endl;
+}
+
+void SessionLogger::logSegment(const SegmentLog& segment) {
+    if (!is_active_) return;
+
+    // Update counters
+    segment_count_++;
+    total_audio_sec_ += segment.audio_duration_sec;
+    total_whisper_ms_ += segment.whisper_latency_ms;
+    total_claude_ms_ += segment.claude_latency_ms;
+
+    // Store segment
+    segments_.push_back(segment);
+
+    // Write individual segment JSON
+    std::stringstream filename;
+    filename << session_path_ << "/segments/"
+             << std::setfill('0') << std::setw(3) << segment.id << ".json";
+
+    json j;
+    j["id"] = segment.id;
+    j["chinese"] = segment.chinese;
+    j["french"] = segment.french;
+    j["audio_duration_sec"] = segment.audio_duration_sec;
+    j["audio_rms"] = segment.audio_rms;
+    j["whisper_latency_ms"] = segment.whisper_latency_ms;
+    j["claude_latency_ms"] = segment.claude_latency_ms;
+    j["was_filtered"] = segment.was_filtered;
+    j["filter_reason"] = segment.filter_reason;
+    j["timestamp"] = segment.timestamp;
+    j["vad_metrics"] = {
+        {"speech_duration_ms", segment.speech_duration_ms},
+        {"silence_duration_ms", segment.silence_duration_ms},
+        {"flush_reason", segment.flush_reason}
+    };
+
+    std::ofstream file(filename.str());
+    if (file.is_open()) {
+        file << j.dump(2);
+        file.close();
+    }
+
+    std::cout << "[Session] Logged segment #" << segment.id
+              << " (" << segment.audio_duration_sec << "s)" << std::endl;
+}
+
+void SessionLogger::logFilteredSegment(const std::string& chinese, const std::string& reason,
+                                       float audio_duration, float audio_rms) {
+    if (!is_active_) return;
+
+    filtered_count_++;
+    total_audio_sec_ += audio_duration;
+
+    // Log filtered segment with special marker
+    SegmentLog seg;
+    seg.id = segment_count_ + filtered_count_;
+    seg.chinese = chinese;
+    seg.french = "[FILTERED]";
+    seg.audio_duration_sec = audio_duration;
+    seg.audio_rms = audio_rms;
+    seg.whisper_latency_ms = 0;
+    seg.claude_latency_ms = 0;
+    seg.was_filtered = true;
+    seg.filter_reason = reason;
+    seg.timestamp = getCurrentTimestamp();
+
+    segments_.push_back(seg);
+
+    // Write filtered segment JSON
+    std::stringstream filename;
+    filename << session_path_ << "/segments/"
+             << std::setfill('0') << std::setw(3) << seg.id << "_filtered.json";
+
+    json j;
+    j["id"] = seg.id;
+    j["chinese"] = seg.chinese;
+    j["filter_reason"] = reason;
+    j["audio_duration_sec"] = audio_duration;
+    j["audio_rms"] = audio_rms;
+    j["timestamp"] = seg.timestamp;
+
+    std::ofstream file(filename.str());
+    if (file.is_open()) {
+        file << j.dump(2);
+        file.close();
+    }
+}
+
+void SessionLogger::writeSessionJson() {
+    json session;
+    session["start_time"] = session_start_time_;
+    session["end_time"] = getCurrentTimestamp();
+    session["total_segments"] = segment_count_;
+    session["filtered_segments"] = filtered_count_;
+    session["total_audio_seconds"] = total_audio_sec_;
+    session["avg_whisper_latency_ms"] = segment_count_ > 0 ?
+        total_whisper_ms_ / segment_count_ : 0;
+    session["avg_claude_latency_ms"] = segment_count_ > 0 ?
+        total_claude_ms_ / segment_count_ : 0;
+
+    // Summary of all segments
+    json segments_summary = json::array();
+    for (const auto& seg : segments_) {
+        json s;
+        s["id"] = seg.id;
+        s["chinese"] = seg.chinese;
+        s["french"] = seg.french;
+        s["duration"] = seg.audio_duration_sec;
+        s["filtered"] = seg.was_filtered;
+        if (seg.was_filtered) {
+            s["filter_reason"] = seg.filter_reason;
+        }
+        segments_summary.push_back(s);
+    }
+    session["segments"] = segments_summary;
+
+    std::string filepath = session_path_ + "/session.json";
+    std::ofstream file(filepath);
+    if (file.is_open()) {
+        file << session.dump(2);
+        file.close();
+        std::cout << "[Session] Wrote " << filepath << std::endl;
+    }
+
+    // Also write plain text transcript
+    std::string transcript_path = session_path_ + "/transcript.txt";
+    std::ofstream transcript(transcript_path);
+    if (transcript.is_open()) {
+        transcript << "=== SecondVoice Session " << session_start_time_ << " ===\n\n";
+        for (const auto& seg : segments_) {
+            if (!seg.was_filtered) {
+                transcript << "CN: " << seg.chinese << "\n";
+                transcript << "FR: " << seg.french << "\n\n";
+            }
+        }
+        transcript.close();
+    }
+}
+
+} // namespace secondvoice
--- a/src/utils/SessionLogger.h
+++ b/src/utils/SessionLogger.h
@ -0,0 +1,68 @@
+#pragma once
+
+#include <string>
+#include <vector>
+#include <chrono>
+#include <fstream>
+
+namespace secondvoice {
+
+struct SegmentLog {
+    int id;
+    std::string chinese;
+    std::string french;
+    float audio_duration_sec;
+    float audio_rms;
+    int64_t whisper_latency_ms;
+    int64_t claude_latency_ms;
+    bool was_filtered;
+    std::string filter_reason;
+    std::string timestamp;
+
+    // VAD metrics (added for TASK8)
+    int speech_duration_ms = 0;
+    int silence_duration_ms = 0;
+    std::string flush_reason = "";
+};
+
+class SessionLogger {
+public:
+    SessionLogger();
+    ~SessionLogger();
+
+    // Start a new session (creates directory)
+    void startSession();
+
+    // End session (writes session.json summary)
+    void endSession();
+
+    // Log a segment
+    void logSegment(const SegmentLog& segment);
+
+    // Log a filtered/skipped segment
+    void logFilteredSegment(const std::string& chinese, const std::string& reason,
+                           float audio_duration, float audio_rms);
+
+    // Get current session path
+    std::string getSessionPath() const { return session_path_; }
+
+    // Check if session is active
+    bool isActive() const { return is_active_; }
+
+private:
+    std::string getCurrentTimestamp() const;
+    void writeSessionJson();
+
+    bool is_active_ = false;
+    std::string session_path_;
+    std::string session_start_time_;
+    int segment_count_ = 0;
+    int filtered_count_ = 0;
+    float total_audio_sec_ = 0.0f;
+    int total_whisper_ms_ = 0;
+    int total_claude_ms_ = 0;
+
+    std::vector<SegmentLog> segments_;
+};
+
+} // namespace secondvoice
Author	SHA1	Message	Date
StillHammer	e8dd7f840e	feat: Add VAD metrics tracking to session logs	2025-12-02 10:03:20 +08:00
StillHammer	a1b4e335c8	chore: Ignore .claudiomiro directory	2025-12-02 09:54:39 +08:00
StillHammer	aac5602722	refactor: Add VAD configuration accessors to Config class	2025-12-02 09:53:53 +08:00
StillHammer	49f9cb906e	tune: Extend VAD speech duration and improve context prompt formatting	2025-12-02 09:48:44 +08:00
StillHammer	db0f8e5990	refactor: Improve VAD trailing silence detection and update docs - Replace hang time logic with consecutive silence frame counter for more precise speech end detection - Update Whisper prompt to utilize previous context for better transcription coherence - Expand README with comprehensive feature list, architecture details, debugging status, and session logging structure - Add troubleshooting section for real-world testing conditions and known issues	2025-12-02 09:44:06 +08:00
Trouve Alexis	a28bb89913	tune: Adjust VAD parameters for longer segments - min_speech_duration: 300ms → 1000ms (avoid tiny segments) - silence_duration: 400ms → 700ms (wait longer before cutting) - hang_frames_threshold: 20 → 35 (~350ms pause tolerance) This should reduce mid-sentence cuts and give Whisper more context. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-23 22:08:01 +08:00
Trouve Alexis	53b21b94d6	feat: Add dynamic Whisper context and audio segment saving - Pass last 3 transcriptions to Whisper for better context - Add deduplication filter to skip identical consecutive segments - Save each audio segment as .ogg in session directory for debugging - Add queue debug logging to detect double-push issues 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-23 22:02:49 +08:00
Trouve Alexis	9baa213a82	feat: Add session logging system with per-segment metrics - Add SessionLogger class for structured debug logging - Log each segment with: chinese, french, audio duration, RMS, latency - Track filtered segments with reasons (hallucination, empty, failed) - Create session directories with JSON files per segment - Update Whisper prompt with anti-hallucination rules - Integrate timing measurements for Whisper and Claude calls 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-23 21:37:55 +08:00