Compare commits

..

No commits in common. "e8dd7f840e5305ab15440bedf930dc1104b1486d" and "a28bb89913faea24b7c24b63a84b04bb05179d62" have entirely different histories.

10 changed files with 85 additions and 413 deletions

1
.gitignore vendored
View File

@ -68,7 +68,6 @@ sessions/
# Claude Code local settings # Claude Code local settings
.claude/settings.local.json .claude/settings.local.json
.claudiomiro/
# Build scripts (local) # Build scripts (local)
run_build.ps1 run_build.ps1

334
README.md
View File

@ -4,50 +4,16 @@ Real-time Chinese to French translation system for live meetings.
## Overview ## Overview
SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly. SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API, and translates it to French using Claude AI in real-time. Perfect for understanding Chinese meetings on the fly.
### Why This Project?
Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:
- Business meetings with Chinese speakers
- Family/administrative calls
- Professional conferences
- Any live Chinese conversation where real-time comprehension is needed
**Status**: MVP complete, actively being debugged and improved based on real-world usage.
## Quick Start
### Windows (MinGW) - Recommended
```batch
# First-time setup
.\setup_mingw.bat
# Build
.\build_mingw.bat
# Run
cd build\mingw-Release
SecondVoice.exe
```
**Requirements**: `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`, plus a working microphone.
See full setup instructions below for other platforms.
## Features ## Features
- 🎤 **Real-time audio capture** with Voice Activity Detection (VAD) - 🎤 Real-time audio capture
- 🔇 **Noise reduction** using RNNoise neural network - 🗣️ Chinese speech-to-text (Whisper API)
- 🗣️ **Chinese speech-to-text** via Whisper API (gpt-4o-mini-transcribe) - 🌐 Chinese to French translation (Claude API)
- 🧠 **Hallucination filtering** - removes known Whisper artifacts - 🖥️ Clean ImGui interface
- 🌐 **Chinese to French translation** via Claude AI (claude-haiku-4-20250514) - 💾 Full recording saved to disk
- 🖥️ **Clean ImGui interface** with adjustable VAD thresholds - ⚙️ Configurable chunk sizes and settings
- 💾 **Full session recording** with structured logging
- 📊 **Session archival** - audio, transcripts, translations, and metadata
- ⚡ **Opus compression** - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
- ⚙️ **Configurable settings** via config.json
## Requirements ## Requirements
@ -150,138 +116,20 @@ The application will:
## Architecture ## Architecture
``` ```
Audio Input (16kHz mono) Audio Capture (PortAudio)
Voice Activity Detection (VAD) - RMS + Peak thresholds Whisper API (Speech-to-Text)
Noise Reduction (RNNoise) - 16→48→16 kHz resampling Claude API (Translation)
Opus Encoding (24kbps OGG) - 46x compression ImGui UI (Display)
Whisper API (gpt-4o-mini-transcribe) - Chinese STT
Hallucination Filter - Remove known artifacts
Claude API (claude-haiku-4) - Chinese → French translation
ImGui UI Display + Session Logging
``` ```
### Threading Model (3 threads) ### Threading Model
1. **Audio Thread** (`Pipeline::audioThread`) - **Thread 1**: Audio capture (PortAudio callback)
- PortAudio callback captures 16kHz mono audio - **Thread 2**: AI processing (Whisper + Claude API calls)
- Applies VAD (Voice Activity Detection) using RMS + Peak thresholds - **Thread 3**: UI rendering (ImGui + OpenGL)
- Pushes speech chunks to processing queue
2. **Processing Thread** (`Pipeline::processingThread`)
- Consumes audio chunks from queue
- Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
- Encodes to Opus/OGG for bandwidth efficiency
- Calls Whisper API for Chinese transcription
- Filters known hallucinations (YouTube phrases, music markers, etc.)
- Calls Claude API for French translation
- Logs to session files
3. **UI Thread** (main)
- GLFW/ImGui rendering loop (must run on main thread)
- Displays real-time transcription and translation
- Allows runtime VAD threshold adjustment
- Handles user controls (stop recording, etc.)
### Core Components
**Audio Processing**:
- `AudioCapture.cpp` - PortAudio wrapper with VAD-based segmentation
- `AudioBuffer.cpp` - Accumulates samples, exports WAV/Opus
- `NoiseReducer.cpp` - RNNoise denoising with resampling
**API Clients**:
- `WhisperClient.cpp` - OpenAI Whisper API (multipart/form-data)
- `ClaudeClient.cpp` - Anthropic Claude API (JSON)
- `WinHttpClient.cpp` - Native Windows HTTP client (replaced libcurl)
**Core Logic**:
- `Pipeline.cpp` - Orchestrates audio → transcription → translation flow
- `TranslationUI.cpp` - ImGui interface with VAD controls
**Utilities**:
- `Config.cpp` - Loads config.json + .env
- `ThreadSafeQueue.h` - Lock-free queue for audio chunks
## Known Issues & Active Debugging
**Status**: Real-world testing has identified issues with degraded audio conditions (see `PLAN_DEBUG.md` for details).
### Current Problems
Based on transcript analysis from actual meetings (November 2025):
1. **VAD cutting speech too early**
- Voice Activity Detection triggers end-of-segment prematurely
- Results in fragmented phrases ("我很。" → "Je suis.")
- **Hypothesis**: Silence threshold too aggressive for multi-speaker scenarios
2. **Segments too short for context**
- Whisper receives insufficient audio context for accurate Chinese transcription
- Single-word or two-word segments lack conversational context
- **Impact**: Lower accuracy, especially with homonyms
3. **Ambient noise interpreted as speech**
- Background sounds trigger false VAD positives
- Test transcript shows "太多声音了" (too much noise) being captured
- **Mitigation**: RNNoise helps but not sufficient for very noisy environments
4. **Loss of inter-segment context**
- Each audio chunk processed independently
- Whisper cannot use previous context for better transcription
- **Potential solution**: Pass previous 2-3 transcriptions in prompt
### Test Conditions
Testing has been performed under **deliberately degraded conditions** to ensure robustness:
- Multiple simultaneous speakers
- Variable microphone distance
- Variable volume levels
- Fast-paced conversations
- Low-quality microphone
These conditions are intentionally harsh to validate real-world meeting scenarios.
### Debug Plan
See `PLAN_DEBUG.md` for:
- Detailed session logging implementation (JSON per segment + metadata)
- Improved Whisper prompt engineering
- VAD threshold tuning recommendations
- Context propagation strategies
## Session Logging
### Structure
```
sessions/
└── YYYY-MM-DD_HHMMSS/
├── session.json # Session metadata
├── segments/
│ ├── 001.json # Segment: Chinese + French + metadata
│ ├── 002.json
│ └── ...
└── transcript.txt # Final export
```
### Segment Format
```json
{
"id": 1,
"chinese": "两个老鼠求我",
"french": "Deux souris me supplient"
}
```
**Future enhancements**: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.
## Configuration ## Configuration
@ -295,9 +143,8 @@ sessions/
"chunk_duration_seconds": 10 "chunk_duration_seconds": 10
}, },
"whisper": { "whisper": {
"model": "gpt-4o-mini-transcribe", "model": "whisper-1",
"language": "zh", "language": "zh"
"prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
}, },
"claude": { "claude": {
"model": "claude-haiku-4-20250514", "model": "claude-haiku-4-20250514",
@ -319,33 +166,23 @@ ANTHROPIC_API_KEY=sk-ant-...
- **Claude Haiku**: ~$0.03-0.05/hour - **Claude Haiku**: ~$0.03-0.05/hour
- **Total**: ~$0.40/hour of recording - **Total**: ~$0.40/hour of recording
## Advanced Features ## Project Structure
### GPU Forcing (Hybrid Graphics Systems) ```
secondvoice/
`main.cpp` exports symbols to force dedicated GPU on Optimus/PowerXpress systems: ├── src/
- `NvOptimusEnablement` - Forces NVIDIA GPU │ ├── main.cpp # Entry point
- `AmdPowerXpressRequestHighPerformance` - Forces AMD GPU │ ├── audio/ # Audio capture & buffer
│ ├── api/ # Whisper & Claude clients
Critical for laptops with both integrated and dedicated GPUs. │ ├── ui/ # ImGui interface
│ ├── utils/ # Config & thread-safe queue
### Hallucination Filtering │ └── core/ # Pipeline orchestration
├── docs/ # Documentation
`Pipeline.cpp` maintains an extensive list (~65 patterns) of known Whisper hallucinations: ├── recordings/ # Output recordings
- YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment" ├── config.json # Runtime configuration
- Chinese video endings: "谢谢观看", "再见", "订阅我的频道" ├── .env # API keys (not committed)
- Music symbols: "♪♪", "🎵" └── CMakeLists.txt # Build configuration
- Silence markers: "...", "silence", "inaudible" ```
These are automatically filtered before translation to avoid wasting API calls.
### Console-Only Build
A `SecondVoice_Console` target exists for headless testing:
- Uses `main_console.cpp`
- No ImGui/GLFW dependencies
- Outputs transcriptions to stdout
- Useful for debugging and automated testing
## Development ## Development
@ -382,101 +219,30 @@ cmake --build build
- Check all system dependencies are installed - Check all system dependencies are installed
- Try `cmake --build build --clean-first` - Try `cmake --build build --clean-first`
## Project Structure
```
secondvoice/
├── src/
│ ├── main.cpp # Entry point, forces NVIDIA GPU
│ ├── core/
│ │ └── Pipeline.cpp # Audio→Transcription→Translation orchestration
│ ├── audio/
│ │ ├── AudioCapture.cpp # PortAudio + VAD segmentation
│ │ ├── AudioBuffer.cpp # Sample accumulation, WAV/Opus export
│ │ └── NoiseReducer.cpp # RNNoise (16→48→16 kHz)
│ ├── api/
│ │ ├── WhisperClient.cpp # OpenAI Whisper (multipart/form-data)
│ │ ├── ClaudeClient.cpp # Anthropic Claude (JSON)
│ │ └── WinHttpClient.cpp # Native Windows HTTP
│ ├── ui/
│ │ └── TranslationUI.cpp # ImGui interface + VAD controls
│ └── utils/
│ ├── Config.cpp # config.json + .env loader
│ └── ThreadSafeQueue.h # Lock-free audio queue
├── docs/ # Build guides
├── sessions/ # Session recordings + logs
├── recordings/ # Legacy recordings directory
├── denoised/ # Denoised audio outputs
├── config.json # Runtime configuration
├── .env # API keys (not committed)
├── CLAUDE.md # Development guide for Claude Code
├── PLAN_DEBUG.md # Active debugging plan
└── CMakeLists.txt # Build configuration
```
### External Dependencies
**Fetched via CMake FetchContent**:
- ImGui v1.90.1 - UI framework
- Opus v1.5.2 - Audio encoding
- Ogg v1.3.6 - Container format
- RNNoise v0.1.1 - Neural network noise reduction
**vcpkg Dependencies** (x64-mingw-static triplet):
- portaudio - Cross-platform audio I/O
- nlohmann_json - JSON parsing
- glfw3 - Windowing/input
- glad - OpenGL loader
## Roadmap ## Roadmap
### Phase 1 - MVP ✅ (Complete) ### Phase 1 - MVP (Current)
- ✅ Audio capture with VAD - ✅ Audio capture
- ✅ Noise reduction (RNNoise) - ✅ Whisper integration
- ✅ Whisper API integration - ✅ Claude integration
- ✅ Claude API integration - ✅ ImGui UI
- ✅ ImGui UI with runtime VAD adjustment - ✅ Stop button
- ✅ Opus compression
- ✅ Hallucination filtering
- ✅ Session recording
### Phase 2 - Debugging 🔄 (Current) ### Phase 2 - Enhancement
- 🔄 Session logging (JSON per segment) - ⬜ Auto-summary post-meeting
- 🔄 Improved Whisper prompt engineering - ⬜ Export transcripts
- 🔄 VAD threshold optimization - ⬜ Search functionality
- 🔄 Context propagation between segments
- ⬜ Automated testing with sample audio
### Phase 3 - Enhancement
- ⬜ Auto-summary post-meeting (Claude analysis)
- ⬜ Full-text search (SQLite FTS5)
- ⬜ Semantic search (embeddings)
- ⬜ Speaker diarization - ⬜ Speaker diarization
- ⬜ Replay mode with synced transcripts - ⬜ Replay mode
- ⬜ Multi-language support extension
## Development Documentation
- **CLAUDE.md** - Development guide for Claude Code AI assistant
- **PLAN_DEBUG.md** - Active debugging plan with identified issues and solutions
- **WINDOWS_BUILD.md** - Detailed Windows build instructions
- **WINDOWS_MINGW.md** - MinGW-specific build guide
- **WINDOWS_QUICK_START.md** - Quick start for Windows users
## Contributing
This is a personal project built to solve a real need. Bug reports and suggestions welcome:
**Known issues**: See `PLAN_DEBUG.md` for current debugging efforts
**Architecture**: See `CLAUDE.md` for detailed system design
## License ## License
See LICENSE file. See LICENSE file.
## Acknowledgments ## Contributing
- OpenAI Whisper for excellent Chinese transcription This is a personal project, but suggestions and bug reports are welcome via issues.
- Anthropic Claude for context-aware translation
- RNNoise for neural network-based noise reduction ## Contact
- ImGui for clean, immediate-mode UI
See docs/SecondVoice.md for project context and motivation.

View File

@ -6,16 +6,11 @@
"chunk_step_seconds": 5, "chunk_step_seconds": 5,
"format": "ogg" "format": "ogg"
}, },
"vad": {
"silence_duration_ms": 700,
"min_speech_duration_ms": 2000,
"max_speech_duration_ms": 30000
},
"whisper": { "whisper": {
"model": "gpt-4o-mini-transcribe", "model": "gpt-4o-mini-transcribe",
"language": "zh", "language": "zh",
"temperature": 0.0, "temperature": 0.0,
"prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. Si un contexte de phrases précédentes est fourni, utilise-le pour maintenir la cohérence (noms propres, sujets, terminologie). RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.", "prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.",
"stream": false, "stream": false,
"response_format": "text" "response_format": "text"
}, },

View File

@ -4,15 +4,9 @@
namespace secondvoice { namespace secondvoice {
AudioCapture::AudioCapture(int sample_rate, int channels, AudioCapture::AudioCapture(int sample_rate, int channels)
int silence_duration_ms,
int min_speech_duration_ms,
int max_speech_duration_ms)
: sample_rate_(sample_rate) : sample_rate_(sample_rate)
, channels_(channels) , channels_(channels)
, silence_duration_ms_(silence_duration_ms)
, min_speech_duration_ms_(min_speech_duration_ms)
, max_speech_duration_ms_(max_speech_duration_ms)
, noise_reducer_(std::make_unique<NoiseReducer>()) { , noise_reducer_(std::make_unique<NoiseReducer>()) {
std::cout << "[Audio] Noise reduction enabled (RNNoise)" << std::endl; std::cout << "[Audio] Noise reduction enabled (RNNoise)" << std::endl;
} }
@ -141,12 +135,16 @@ int AudioCapture::audioCallback(const void* input, void* output,
// Speech = energy OK AND (ZCR OK or very high energy) // Speech = energy OK AND (ZCR OK or very high energy)
bool frame_has_speech = energy_ok && (zcr_ok || denoised_rms > adaptive_rms_thresh * 3.0f); bool frame_has_speech = energy_ok && (zcr_ok || denoised_rms > adaptive_rms_thresh * 3.0f);
// Reset trailing silence counter when speech detected // Hang time logic: don't immediately cut on silence
if (frame_has_speech) { if (frame_has_speech) {
self->consecutive_silence_frames_ = 0; self->hang_frames_ = self->hang_frames_threshold_; // Reset hang counter
} else if (self->hang_frames_ > 0) {
self->hang_frames_--;
frame_has_speech = true; // Keep "speaking" during hang time
} }
// Calculate durations in samples // Calculate durations in samples
int silence_samples_threshold = (self->silence_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
int min_speech_samples = (self->min_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000; int min_speech_samples = (self->min_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
int max_speech_samples = (self->max_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000; int max_speech_samples = (self->max_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
@ -172,11 +170,6 @@ int AudioCapture::audioCallback(const void* input, void* output,
std::cout << "[VAD] Max duration reached, forcing flush (" std::cout << "[VAD] Max duration reached, forcing flush ("
<< self->speech_samples_count_ / (self->sample_rate_ * self->channels_) << "s)" << std::endl; << self->speech_samples_count_ / (self->sample_rate_ * self->channels_) << "s)" << std::endl;
// Calculate metrics BEFORE flushing
self->last_speech_duration_ms_ = (self->speech_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
self->last_silence_duration_ms_ = 0; // No trailing silence in forced flush
self->last_flush_reason_ = "max_duration";
if (self->callback_ && self->speech_buffer_.size() >= static_cast<size_t>(min_speech_samples)) { if (self->callback_ && self->speech_buffer_.size() >= static_cast<size_t>(min_speech_samples)) {
// Flush any remaining samples from the denoiser // Flush any remaining samples from the denoiser
if (self->noise_reducer_ && self->noise_reducer_->isEnabled()) { if (self->noise_reducer_ && self->noise_reducer_->isEnabled()) {
@ -190,17 +183,16 @@ int AudioCapture::audioCallback(const void* input, void* output,
} }
self->speech_buffer_.clear(); self->speech_buffer_.clear();
self->speech_samples_count_ = 0; self->speech_samples_count_ = 0;
self->consecutive_silence_frames_ = 0; // Reset after forced flush
// Reset stream for next segment // Reset stream for next segment
if (self->noise_reducer_) { if (self->noise_reducer_) {
self->noise_reducer_->resetStream(); self->noise_reducer_->resetStream();
} }
} }
} else { } else {
// Silence detected // True silence (after hang time expired)
self->silence_samples_count_ += sample_count; self->silence_samples_count_ += sample_count;
// If we were speaking and now have silence, track consecutive silence frames // If we were speaking and now have enough silence, flush
if (self->speech_buffer_.size() > 0) { if (self->speech_buffer_.size() > 0) {
// Add trailing silence (denoised) // Add trailing silence (denoised)
if (!denoised_samples.empty()) { if (!denoised_samples.empty()) {
@ -212,23 +204,9 @@ int AudioCapture::audioCallback(const void* input, void* output,
} }
} }
// Increment consecutive silence frame counter if (self->silence_samples_count_ >= silence_samples_threshold) {
self->consecutive_silence_frames_++;
// Calculate threshold in frames (callbacks)
// frames_per_buffer = frame_count from callback
int frames_per_buffer = static_cast<int>(frame_count);
int silence_threshold_frames = (self->silence_duration_ms_ * self->sample_rate_) / (1000 * frames_per_buffer);
// Flush when consecutive silence exceeds threshold
if (self->consecutive_silence_frames_ >= silence_threshold_frames) {
self->is_speech_active_.store(false, std::memory_order_relaxed); self->is_speech_active_.store(false, std::memory_order_relaxed);
// Calculate metrics BEFORE flushing
self->last_speech_duration_ms_ = (self->speech_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
self->last_silence_duration_ms_ = (self->silence_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
self->last_flush_reason_ = "silence_threshold";
// Flush if we have enough speech // Flush if we have enough speech
if (self->speech_samples_count_ >= min_speech_samples) { if (self->speech_samples_count_ >= min_speech_samples) {
// Flush any remaining samples from the denoiser // Flush any remaining samples from the denoiser
@ -242,9 +220,7 @@ int AudioCapture::audioCallback(const void* input, void* output,
float duration = static_cast<float>(self->speech_buffer_.size()) / float duration = static_cast<float>(self->speech_buffer_.size()) /
(self->sample_rate_ * self->channels_); (self->sample_rate_ * self->channels_);
std::cout << "[VAD] Speech ended (trailing silence detected, " std::cout << "[VAD] Speech ended (noise_floor=" << self->noise_floor_
<< self->consecutive_silence_frames_ << " frames, "
<< "noise_floor=" << self->noise_floor_
<< "), flushing " << duration << "s (denoised)" << std::endl; << "), flushing " << duration << "s (denoised)" << std::endl;
if (self->callback_) { if (self->callback_) {
@ -257,7 +233,6 @@ int AudioCapture::audioCallback(const void* input, void* output,
self->speech_buffer_.clear(); self->speech_buffer_.clear();
self->speech_samples_count_ = 0; self->speech_samples_count_ = 0;
self->consecutive_silence_frames_ = 0; // Reset after flush
// Reset stream for next segment // Reset stream for next segment
if (self->noise_reducer_) { if (self->noise_reducer_) {
self->noise_reducer_->resetStream(); self->noise_reducer_->resetStream();

View File

@ -16,10 +16,7 @@ class AudioCapture {
public: public:
using AudioCallback = std::function<void(const std::vector<float>&)>; using AudioCallback = std::function<void(const std::vector<float>&)>;
AudioCapture(int sample_rate, int channels, AudioCapture(int sample_rate, int channels);
int silence_duration_ms = 700,
int min_speech_duration_ms = 2000,
int max_speech_duration_ms = 30000);
~AudioCapture(); ~AudioCapture();
bool initialize(); bool initialize();
@ -47,11 +44,6 @@ public:
void setDenoiseEnabled(bool enabled); void setDenoiseEnabled(bool enabled);
bool isDenoiseEnabled() const; bool isDenoiseEnabled() const;
// Get metrics from last flushed segment
int getLastSpeechDuration() const { return last_speech_duration_ms_; }
int getLastSilenceDuration() const { return last_silence_duration_ms_; }
std::string getLastFlushReason() const { return last_flush_reason_; }
private: private:
static int audioCallback(const void* input, void* output, static int audioCallback(const void* input, void* output,
unsigned long frame_count, unsigned long frame_count,
@ -77,21 +69,17 @@ private:
// VAD parameters - Higher threshold to avoid false triggers on filtered noise // VAD parameters - Higher threshold to avoid false triggers on filtered noise
std::atomic<float> vad_rms_threshold_{0.02f}; // Was 0.01f std::atomic<float> vad_rms_threshold_{0.02f}; // Was 0.01f
std::atomic<float> vad_peak_threshold_{0.08f}; // Was 0.04f std::atomic<float> vad_peak_threshold_{0.08f}; // Was 0.04f
int silence_duration_ms_; // Wait 700ms of silence before cutting (was 400) int silence_duration_ms_ = 700; // Wait 700ms of silence before cutting (was 400)
int min_speech_duration_ms_; // Minimum 2s speech to send (was 1000) int min_speech_duration_ms_ = 1000; // Minimum 1s speech to send (was 300)
int max_speech_duration_ms_; // 30s max before forced flush (was 25000) int max_speech_duration_ms_ = 25000; // 25s max before forced flush
// Adaptive noise floor // Adaptive noise floor
float noise_floor_ = 0.005f; // Estimated background noise level float noise_floor_ = 0.005f; // Estimated background noise level
float noise_floor_alpha_ = 0.001f; // Slower adaptation float noise_floor_alpha_ = 0.001f; // Slower adaptation
// Trailing silence detection - count consecutive silence frames after speech // Hang time - wait before cutting to avoid mid-sentence cuts
int consecutive_silence_frames_ = 0; int hang_frames_ = 0;
int hang_frames_threshold_ = 35; // ~350ms tolerance for pauses (was 20)
// Metrics for last flushed segment (set in callback, read in processing thread)
int last_speech_duration_ms_ = 0;
int last_silence_duration_ms_ = 0;
std::string last_flush_reason_;
// Zero-crossing rate for speech vs noise discrimination // Zero-crossing rate for speech vs noise discrimination
float last_zcr_ = 0.0f; float last_zcr_ = 0.0f;

View File

@ -24,23 +24,12 @@ Pipeline::~Pipeline() {
bool Pipeline::initialize() { bool Pipeline::initialize() {
auto& config = Config::getInstance(); auto& config = Config::getInstance();
// Load VAD parameters from config (with fallbacks if missing)
int silence_duration = config.getVadSilenceDurationMs();
int min_speech = config.getVadMinSpeechDurationMs();
int max_speech = config.getVadMaxSpeechDurationMs();
// Initialize audio capture with VAD-based segmentation // Initialize audio capture with VAD-based segmentation
audio_capture_ = std::make_unique<AudioCapture>( audio_capture_ = std::make_unique<AudioCapture>(
config.getAudioConfig().sample_rate, config.getAudioConfig().sample_rate,
config.getAudioConfig().channels, config.getAudioConfig().channels
silence_duration,
min_speech,
max_speech
); );
std::cout << "[Pipeline] VAD configured: silence=" << silence_duration
<< "ms, min_speech=" << min_speech
<< "ms, max_speech=" << max_speech << "ms" << std::endl;
std::cout << "[Pipeline] VAD-based audio segmentation enabled" << std::endl; std::cout << "[Pipeline] VAD-based audio segmentation enabled" << std::endl;
if (!audio_capture_->initialize()) { if (!audio_capture_->initialize()) {
@ -406,10 +395,6 @@ void Pipeline::processingThread() {
seg.was_filtered = false; seg.was_filtered = false;
seg.filter_reason = ""; seg.filter_reason = "";
seg.timestamp = ""; // Will be set by logger seg.timestamp = ""; // Will be set by logger
// Add VAD metrics from AudioCapture
seg.speech_duration_ms = audio_capture_->getLastSpeechDuration();
seg.silence_duration_ms = audio_capture_->getLastSilenceDuration();
seg.flush_reason = audio_capture_->getLastFlushReason();
session_logger_.logSegment(seg); session_logger_.logSegment(seg);
std::cout << "CN: " << text << std::endl; std::cout << "CN: " << text << std::endl;
@ -483,11 +468,11 @@ std::string Pipeline::buildDynamicPrompt() const {
// Build context from recent transcriptions // Build context from recent transcriptions
std::stringstream context; std::stringstream context;
context << base_prompt; context << base_prompt;
context << "\n\nContexte des phrases précédentes:\n"; context << "\n\nContexte des phrases précédentes: ";
for (size_t i = 0; i < recent_transcriptions_.size(); ++i) { for (size_t i = 0; i < recent_transcriptions_.size(); ++i) {
context << std::to_string(i + 1) << ". " if (i > 0) context << " ";
<< recent_transcriptions_[i] << "\n"; context << recent_transcriptions_[i];
} }
return context.str(); return context.str();

View File

@ -52,9 +52,10 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
} }
std::cerr << "[Config] File opened successfully" << std::endl; std::cerr << "[Config] File opened successfully" << std::endl;
json config_json;
try { try {
std::cerr << "[Config] About to parse JSON..." << std::endl; std::cerr << "[Config] About to parse JSON..." << std::endl;
config_file >> config_; config_file >> config_json;
std::cerr << "[Config] JSON parsed successfully" << std::endl; std::cerr << "[Config] JSON parsed successfully" << std::endl;
} catch (const json::parse_error& e) { } catch (const json::parse_error& e) {
std::cerr << "Error parsing config.json: " << e.what() << std::endl; std::cerr << "Error parsing config.json: " << e.what() << std::endl;
@ -65,8 +66,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
} }
// Parse audio config // Parse audio config
if (config_.contains("audio")) { if (config_json.contains("audio")) {
auto& audio = config_["audio"]; auto& audio = config_json["audio"];
audio_config_.sample_rate = audio.value("sample_rate", 16000); audio_config_.sample_rate = audio.value("sample_rate", 16000);
audio_config_.channels = audio.value("channels", 1); audio_config_.channels = audio.value("channels", 1);
audio_config_.chunk_duration_seconds = audio.value("chunk_duration_seconds", 10); audio_config_.chunk_duration_seconds = audio.value("chunk_duration_seconds", 10);
@ -75,8 +76,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
} }
// Parse whisper config // Parse whisper config
if (config_.contains("whisper")) { if (config_json.contains("whisper")) {
auto& whisper = config_["whisper"]; auto& whisper = config_json["whisper"];
whisper_config_.model = whisper.value("model", "whisper-1"); whisper_config_.model = whisper.value("model", "whisper-1");
whisper_config_.language = whisper.value("language", "zh"); whisper_config_.language = whisper.value("language", "zh");
whisper_config_.temperature = whisper.value("temperature", 0.0f); whisper_config_.temperature = whisper.value("temperature", 0.0f);
@ -86,8 +87,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
} }
// Parse claude config // Parse claude config
if (config_.contains("claude")) { if (config_json.contains("claude")) {
auto& claude = config_["claude"]; auto& claude = config_json["claude"];
claude_config_.model = claude.value("model", "claude-haiku-4-20250514"); claude_config_.model = claude.value("model", "claude-haiku-4-20250514");
claude_config_.max_tokens = claude.value("max_tokens", 1024); claude_config_.max_tokens = claude.value("max_tokens", 1024);
claude_config_.temperature = claude.value("temperature", 0.3f); claude_config_.temperature = claude.value("temperature", 0.3f);
@ -95,8 +96,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
} }
// Parse UI config // Parse UI config
if (config_.contains("ui")) { if (config_json.contains("ui")) {
auto& ui = config_["ui"]; auto& ui = config_json["ui"];
ui_config_.window_width = ui.value("window_width", 800); ui_config_.window_width = ui.value("window_width", 800);
ui_config_.window_height = ui.value("window_height", 600); ui_config_.window_height = ui.value("window_height", 600);
ui_config_.font_size = ui.value("font_size", 16); ui_config_.font_size = ui.value("font_size", 16);
@ -104,8 +105,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
} }
// Parse recording config // Parse recording config
if (config_.contains("recording")) { if (config_json.contains("recording")) {
auto& recording = config_["recording"]; auto& recording = config_json["recording"];
recording_config_.save_audio = recording.value("save_audio", true); recording_config_.save_audio = recording.value("save_audio", true);
recording_config_.output_directory = recording.value("output_directory", "./recordings"); recording_config_.output_directory = recording.value("output_directory", "./recordings");
} }
@ -113,25 +114,4 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
return true; return true;
} }
int Config::getVadSilenceDurationMs() const {
if (config_.contains("vad") && config_["vad"].contains("silence_duration_ms")) {
return config_["vad"]["silence_duration_ms"].get<int>();
}
return 700; // Default from AudioCapture.h:72 (unchanged)
}
int Config::getVadMinSpeechDurationMs() const {
if (config_.contains("vad") && config_["vad"].contains("min_speech_duration_ms")) {
return config_["vad"]["min_speech_duration_ms"].get<int>();
}
return 2000; // Default from AudioCapture.h:73 (updated in TASK2)
}
int Config::getVadMaxSpeechDurationMs() const {
if (config_.contains("vad") && config_["vad"].contains("max_speech_duration_ms")) {
return config_["vad"]["max_speech_duration_ms"].get<int>();
}
return 30000; // Default from AudioCapture.h:74 (updated in TASK2)
}
} // namespace secondvoice } // namespace secondvoice

View File

@ -1,7 +1,6 @@
#pragma once #pragma once
#include <string> #include <string>
#include <nlohmann/json.hpp>
namespace secondvoice { namespace secondvoice {
@ -56,10 +55,6 @@ public:
const std::string& getOpenAIKey() const { return openai_key_; } const std::string& getOpenAIKey() const { return openai_key_; }
const std::string& getAnthropicKey() const { return anthropic_key_; } const std::string& getAnthropicKey() const { return anthropic_key_; }
int getVadSilenceDurationMs() const;
int getVadMinSpeechDurationMs() const;
int getVadMaxSpeechDurationMs() const;
private: private:
Config() = default; Config() = default;
Config(const Config&) = delete; Config(const Config&) = delete;
@ -73,7 +68,6 @@ private:
std::string openai_key_; std::string openai_key_;
std::string anthropic_key_; std::string anthropic_key_;
nlohmann::json config_;
}; };
} // namespace secondvoice } // namespace secondvoice

View File

@ -89,11 +89,6 @@ void SessionLogger::logSegment(const SegmentLog& segment) {
j["was_filtered"] = segment.was_filtered; j["was_filtered"] = segment.was_filtered;
j["filter_reason"] = segment.filter_reason; j["filter_reason"] = segment.filter_reason;
j["timestamp"] = segment.timestamp; j["timestamp"] = segment.timestamp;
j["vad_metrics"] = {
{"speech_duration_ms", segment.speech_duration_ms},
{"silence_duration_ms", segment.silence_duration_ms},
{"flush_reason", segment.flush_reason}
};
std::ofstream file(filename.str()); std::ofstream file(filename.str());
if (file.is_open()) { if (file.is_open()) {

View File

@ -18,11 +18,6 @@ struct SegmentLog {
bool was_filtered; bool was_filtered;
std::string filter_reason; std::string filter_reason;
std::string timestamp; std::string timestamp;
// VAD metrics (added for TASK8)
int speech_duration_ms = 0;
int silence_duration_ms = 0;
std::string flush_reason = "";
}; };
class SessionLogger { class SessionLogger {