Compare commits

...

5 Commits

Author SHA1 Message Date
e8dd7f840e feat: Add VAD metrics tracking to session logs 2025-12-02 10:03:20 +08:00
a1b4e335c8 chore: Ignore .claudiomiro directory 2025-12-02 09:54:39 +08:00
aac5602722 refactor: Add VAD configuration accessors to Config class 2025-12-02 09:53:53 +08:00
49f9cb906e tune: Extend VAD speech duration and improve context prompt formatting 2025-12-02 09:48:44 +08:00
db0f8e5990 refactor: Improve VAD trailing silence detection and update docs
- Replace hang time logic with consecutive silence frame counter for more precise speech end detection
- Update Whisper prompt to utilize previous context for better transcription coherence
- Expand README with comprehensive feature list, architecture details, debugging status, and session logging structure
- Add troubleshooting section for real-world testing conditions and known issues
2025-12-02 09:44:06 +08:00
10 changed files with 413 additions and 85 deletions

1
.gitignore vendored
View File

@ -68,6 +68,7 @@ sessions/
# Claude Code local settings
.claude/settings.local.json
.claudiomiro/
# Build scripts (local)
run_build.ps1

334
README.md
View File

@ -4,16 +4,50 @@ Real-time Chinese to French translation system for live meetings.
## Overview
SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API, and translates it to French using Claude AI in real-time. Perfect for understanding Chinese meetings on the fly.
SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly.
### Why This Project?
Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:
- Business meetings with Chinese speakers
- Family/administrative calls
- Professional conferences
- Any live Chinese conversation where real-time comprehension is needed
**Status**: MVP complete, actively being debugged and improved based on real-world usage.
## Quick Start
### Windows (MinGW) - Recommended
```batch
# First-time setup
.\setup_mingw.bat
# Build
.\build_mingw.bat
# Run
cd build\mingw-Release
SecondVoice.exe
```
**Requirements**: `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`, plus a working microphone.
See full setup instructions below for other platforms.
## Features
- 🎤 Real-time audio capture
- 🗣️ Chinese speech-to-text (Whisper API)
- 🌐 Chinese to French translation (Claude API)
- 🖥️ Clean ImGui interface
- 💾 Full recording saved to disk
- ⚙️ Configurable chunk sizes and settings
- 🎤 **Real-time audio capture** with Voice Activity Detection (VAD)
- 🔇 **Noise reduction** using RNNoise neural network
- 🗣️ **Chinese speech-to-text** via Whisper API (gpt-4o-mini-transcribe)
- 🧠 **Hallucination filtering** - removes known Whisper artifacts
- 🌐 **Chinese to French translation** via Claude AI (claude-haiku-4-20250514)
- 🖥️ **Clean ImGui interface** with adjustable VAD thresholds
- 💾 **Full session recording** with structured logging
- 📊 **Session archival** - audio, transcripts, translations, and metadata
- ⚡ **Opus compression** - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
- ⚙️ **Configurable settings** via config.json
## Requirements
@ -116,20 +150,138 @@ The application will:
## Architecture
```
Audio Capture (PortAudio)
Audio Input (16kHz mono)
Whisper API (Speech-to-Text)
Voice Activity Detection (VAD) - RMS + Peak thresholds
Claude API (Translation)
Noise Reduction (RNNoise) - 16→48→16 kHz resampling
ImGui UI (Display)
Opus Encoding (24kbps OGG) - 46x compression
Whisper API (gpt-4o-mini-transcribe) - Chinese STT
Hallucination Filter - Remove known artifacts
Claude API (claude-haiku-4) - Chinese → French translation
ImGui UI Display + Session Logging
```
### Threading Model
### Threading Model (3 threads)
- **Thread 1**: Audio capture (PortAudio callback)
- **Thread 2**: AI processing (Whisper + Claude API calls)
- **Thread 3**: UI rendering (ImGui + OpenGL)
1. **Audio Thread** (`Pipeline::audioThread`)
- PortAudio callback captures 16kHz mono audio
- Applies VAD (Voice Activity Detection) using RMS + Peak thresholds
- Pushes speech chunks to processing queue
2. **Processing Thread** (`Pipeline::processingThread`)
- Consumes audio chunks from queue
- Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
- Encodes to Opus/OGG for bandwidth efficiency
- Calls Whisper API for Chinese transcription
- Filters known hallucinations (YouTube phrases, music markers, etc.)
- Calls Claude API for French translation
- Logs to session files
3. **UI Thread** (main)
- GLFW/ImGui rendering loop (must run on main thread)
- Displays real-time transcription and translation
- Allows runtime VAD threshold adjustment
- Handles user controls (stop recording, etc.)
### Core Components
**Audio Processing**:
- `AudioCapture.cpp` - PortAudio wrapper with VAD-based segmentation
- `AudioBuffer.cpp` - Accumulates samples, exports WAV/Opus
- `NoiseReducer.cpp` - RNNoise denoising with resampling
**API Clients**:
- `WhisperClient.cpp` - OpenAI Whisper API (multipart/form-data)
- `ClaudeClient.cpp` - Anthropic Claude API (JSON)
- `WinHttpClient.cpp` - Native Windows HTTP client (replaced libcurl)
**Core Logic**:
- `Pipeline.cpp` - Orchestrates audio → transcription → translation flow
- `TranslationUI.cpp` - ImGui interface with VAD controls
**Utilities**:
- `Config.cpp` - Loads config.json + .env
- `ThreadSafeQueue.h` - Lock-free queue for audio chunks
## Known Issues & Active Debugging
**Status**: Real-world testing has identified issues with degraded audio conditions (see `PLAN_DEBUG.md` for details).
### Current Problems
Based on transcript analysis from actual meetings (November 2025):
1. **VAD cutting speech too early**
- Voice Activity Detection triggers end-of-segment prematurely
- Results in fragmented phrases ("我很。" → "Je suis.")
- **Hypothesis**: Silence threshold too aggressive for multi-speaker scenarios
2. **Segments too short for context**
- Whisper receives insufficient audio context for accurate Chinese transcription
- Single-word or two-word segments lack conversational context
- **Impact**: Lower accuracy, especially with homonyms
3. **Ambient noise interpreted as speech**
- Background sounds trigger false VAD positives
- Test transcript shows "太多声音了" (too much noise) being captured
- **Mitigation**: RNNoise helps but not sufficient for very noisy environments
4. **Loss of inter-segment context**
- Each audio chunk processed independently
- Whisper cannot use previous context for better transcription
- **Potential solution**: Pass previous 2-3 transcriptions in prompt
### Test Conditions
Testing has been performed under **deliberately degraded conditions** to ensure robustness:
- Multiple simultaneous speakers
- Variable microphone distance
- Variable volume levels
- Fast-paced conversations
- Low-quality microphone
These conditions are intentionally harsh to validate real-world meeting scenarios.
### Debug Plan
See `PLAN_DEBUG.md` for:
- Detailed session logging implementation (JSON per segment + metadata)
- Improved Whisper prompt engineering
- VAD threshold tuning recommendations
- Context propagation strategies
## Session Logging
### Structure
```
sessions/
└── YYYY-MM-DD_HHMMSS/
├── session.json # Session metadata
├── segments/
│ ├── 001.json # Segment: Chinese + French + metadata
│ ├── 002.json
│ └── ...
└── transcript.txt # Final export
```
### Segment Format
```json
{
"id": 1,
"chinese": "两个老鼠求我",
"french": "Deux souris me supplient"
}
```
**Future enhancements**: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.
## Configuration
@ -143,8 +295,9 @@ ImGui UI (Display)
"chunk_duration_seconds": 10
},
"whisper": {
"model": "whisper-1",
"language": "zh"
"model": "gpt-4o-mini-transcribe",
"language": "zh",
"prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
},
"claude": {
"model": "claude-haiku-4-20250514",
@ -166,23 +319,33 @@ ANTHROPIC_API_KEY=sk-ant-...
- **Claude Haiku**: ~$0.03-0.05/hour
- **Total**: ~$0.40/hour of recording
## Project Structure
## Advanced Features
```
secondvoice/
├── src/
│ ├── main.cpp # Entry point
│ ├── audio/ # Audio capture & buffer
│ ├── api/ # Whisper & Claude clients
│ ├── ui/ # ImGui interface
│ ├── utils/ # Config & thread-safe queue
│ └── core/ # Pipeline orchestration
├── docs/ # Documentation
├── recordings/ # Output recordings
├── config.json # Runtime configuration
├── .env # API keys (not committed)
└── CMakeLists.txt # Build configuration
```
### GPU Forcing (Hybrid Graphics Systems)
`main.cpp` exports symbols to force dedicated GPU on Optimus/PowerXpress systems:
- `NvOptimusEnablement` - Forces NVIDIA GPU
- `AmdPowerXpressRequestHighPerformance` - Forces AMD GPU
Critical for laptops with both integrated and dedicated GPUs.
### Hallucination Filtering
`Pipeline.cpp` maintains an extensive list (~65 patterns) of known Whisper hallucinations:
- YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment"
- Chinese video endings: "谢谢观看", "再见", "订阅我的频道"
- Music symbols: "♪♪", "🎵"
- Silence markers: "...", "silence", "inaudible"
These are automatically filtered before translation to avoid wasting API calls.
### Console-Only Build
A `SecondVoice_Console` target exists for headless testing:
- Uses `main_console.cpp`
- No ImGui/GLFW dependencies
- Outputs transcriptions to stdout
- Useful for debugging and automated testing
## Development
@ -219,30 +382,101 @@ cmake --build build
- Check all system dependencies are installed
- Try `cmake --build build --clean-first`
## Project Structure
```
secondvoice/
├── src/
│ ├── main.cpp # Entry point, forces NVIDIA GPU
│ ├── core/
│ │ └── Pipeline.cpp # Audio→Transcription→Translation orchestration
│ ├── audio/
│ │ ├── AudioCapture.cpp # PortAudio + VAD segmentation
│ │ ├── AudioBuffer.cpp # Sample accumulation, WAV/Opus export
│ │ └── NoiseReducer.cpp # RNNoise (16→48→16 kHz)
│ ├── api/
│ │ ├── WhisperClient.cpp # OpenAI Whisper (multipart/form-data)
│ │ ├── ClaudeClient.cpp # Anthropic Claude (JSON)
│ │ └── WinHttpClient.cpp # Native Windows HTTP
│ ├── ui/
│ │ └── TranslationUI.cpp # ImGui interface + VAD controls
│ └── utils/
│ ├── Config.cpp # config.json + .env loader
│ └── ThreadSafeQueue.h # Lock-free audio queue
├── docs/ # Build guides
├── sessions/ # Session recordings + logs
├── recordings/ # Legacy recordings directory
├── denoised/ # Denoised audio outputs
├── config.json # Runtime configuration
├── .env # API keys (not committed)
├── CLAUDE.md # Development guide for Claude Code
├── PLAN_DEBUG.md # Active debugging plan
└── CMakeLists.txt # Build configuration
```
### External Dependencies
**Fetched via CMake FetchContent**:
- ImGui v1.90.1 - UI framework
- Opus v1.5.2 - Audio encoding
- Ogg v1.3.6 - Container format
- RNNoise v0.1.1 - Neural network noise reduction
**vcpkg Dependencies** (x64-mingw-static triplet):
- portaudio - Cross-platform audio I/O
- nlohmann_json - JSON parsing
- glfw3 - Windowing/input
- glad - OpenGL loader
## Roadmap
### Phase 1 - MVP (Current)
- ✅ Audio capture
- ✅ Whisper integration
- ✅ Claude integration
- ✅ ImGui UI
- ✅ Stop button
### Phase 1 - MVP ✅ (Complete)
- ✅ Audio capture with VAD
- ✅ Noise reduction (RNNoise)
- ✅ Whisper API integration
- ✅ Claude API integration
- ✅ ImGui UI with runtime VAD adjustment
- ✅ Opus compression
- ✅ Hallucination filtering
- ✅ Session recording
### Phase 2 - Enhancement
- ⬜ Auto-summary post-meeting
- ⬜ Export transcripts
- ⬜ Search functionality
### Phase 2 - Debugging 🔄 (Current)
- 🔄 Session logging (JSON per segment)
- 🔄 Improved Whisper prompt engineering
- 🔄 VAD threshold optimization
- 🔄 Context propagation between segments
- ⬜ Automated testing with sample audio
### Phase 3 - Enhancement
- ⬜ Auto-summary post-meeting (Claude analysis)
- ⬜ Full-text search (SQLite FTS5)
- ⬜ Semantic search (embeddings)
- ⬜ Speaker diarization
- ⬜ Replay mode
- ⬜ Replay mode with synced transcripts
- ⬜ Multi-language support extension
## Development Documentation
- **CLAUDE.md** - Development guide for Claude Code AI assistant
- **PLAN_DEBUG.md** - Active debugging plan with identified issues and solutions
- **WINDOWS_BUILD.md** - Detailed Windows build instructions
- **WINDOWS_MINGW.md** - MinGW-specific build guide
- **WINDOWS_QUICK_START.md** - Quick start for Windows users
## Contributing
This is a personal project built to solve a real need. Bug reports and suggestions welcome:
**Known issues**: See `PLAN_DEBUG.md` for current debugging efforts
**Architecture**: See `CLAUDE.md` for detailed system design
## License
See LICENSE file.
## Contributing
## Acknowledgments
This is a personal project, but suggestions and bug reports are welcome via issues.
## Contact
See docs/SecondVoice.md for project context and motivation.
- OpenAI Whisper for excellent Chinese transcription
- Anthropic Claude for context-aware translation
- RNNoise for neural network-based noise reduction
- ImGui for clean, immediate-mode UI

View File

@ -6,11 +6,16 @@
"chunk_step_seconds": 5,
"format": "ogg"
},
"vad": {
"silence_duration_ms": 700,
"min_speech_duration_ms": 2000,
"max_speech_duration_ms": 30000
},
"whisper": {
"model": "gpt-4o-mini-transcribe",
"language": "zh",
"temperature": 0.0,
"prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.",
"prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. Si un contexte de phrases précédentes est fourni, utilise-le pour maintenir la cohérence (noms propres, sujets, terminologie). RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.",
"stream": false,
"response_format": "text"
},

View File

@ -4,9 +4,15 @@
namespace secondvoice {
AudioCapture::AudioCapture(int sample_rate, int channels)
AudioCapture::AudioCapture(int sample_rate, int channels,
int silence_duration_ms,
int min_speech_duration_ms,
int max_speech_duration_ms)
: sample_rate_(sample_rate)
, channels_(channels)
, silence_duration_ms_(silence_duration_ms)
, min_speech_duration_ms_(min_speech_duration_ms)
, max_speech_duration_ms_(max_speech_duration_ms)
, noise_reducer_(std::make_unique<NoiseReducer>()) {
std::cout << "[Audio] Noise reduction enabled (RNNoise)" << std::endl;
}
@ -135,16 +141,12 @@ int AudioCapture::audioCallback(const void* input, void* output,
// Speech = energy OK AND (ZCR OK or very high energy)
bool frame_has_speech = energy_ok && (zcr_ok || denoised_rms > adaptive_rms_thresh * 3.0f);
// Hang time logic: don't immediately cut on silence
// Reset trailing silence counter when speech detected
if (frame_has_speech) {
self->hang_frames_ = self->hang_frames_threshold_; // Reset hang counter
} else if (self->hang_frames_ > 0) {
self->hang_frames_--;
frame_has_speech = true; // Keep "speaking" during hang time
self->consecutive_silence_frames_ = 0;
}
// Calculate durations in samples
int silence_samples_threshold = (self->silence_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
int min_speech_samples = (self->min_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
int max_speech_samples = (self->max_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
@ -170,6 +172,11 @@ int AudioCapture::audioCallback(const void* input, void* output,
std::cout << "[VAD] Max duration reached, forcing flush ("
<< self->speech_samples_count_ / (self->sample_rate_ * self->channels_) << "s)" << std::endl;
// Calculate metrics BEFORE flushing
self->last_speech_duration_ms_ = (self->speech_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
self->last_silence_duration_ms_ = 0; // No trailing silence in forced flush
self->last_flush_reason_ = "max_duration";
if (self->callback_ && self->speech_buffer_.size() >= static_cast<size_t>(min_speech_samples)) {
// Flush any remaining samples from the denoiser
if (self->noise_reducer_ && self->noise_reducer_->isEnabled()) {
@ -183,16 +190,17 @@ int AudioCapture::audioCallback(const void* input, void* output,
}
self->speech_buffer_.clear();
self->speech_samples_count_ = 0;
self->consecutive_silence_frames_ = 0; // Reset after forced flush
// Reset stream for next segment
if (self->noise_reducer_) {
self->noise_reducer_->resetStream();
}
}
} else {
// True silence (after hang time expired)
// Silence detected
self->silence_samples_count_ += sample_count;
// If we were speaking and now have enough silence, flush
// If we were speaking and now have silence, track consecutive silence frames
if (self->speech_buffer_.size() > 0) {
// Add trailing silence (denoised)
if (!denoised_samples.empty()) {
@ -204,9 +212,23 @@ int AudioCapture::audioCallback(const void* input, void* output,
}
}
if (self->silence_samples_count_ >= silence_samples_threshold) {
// Increment consecutive silence frame counter
self->consecutive_silence_frames_++;
// Calculate threshold in frames (callbacks)
// frames_per_buffer = frame_count from callback
int frames_per_buffer = static_cast<int>(frame_count);
int silence_threshold_frames = (self->silence_duration_ms_ * self->sample_rate_) / (1000 * frames_per_buffer);
// Flush when consecutive silence exceeds threshold
if (self->consecutive_silence_frames_ >= silence_threshold_frames) {
self->is_speech_active_.store(false, std::memory_order_relaxed);
// Calculate metrics BEFORE flushing
self->last_speech_duration_ms_ = (self->speech_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
self->last_silence_duration_ms_ = (self->silence_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
self->last_flush_reason_ = "silence_threshold";
// Flush if we have enough speech
if (self->speech_samples_count_ >= min_speech_samples) {
// Flush any remaining samples from the denoiser
@ -220,7 +242,9 @@ int AudioCapture::audioCallback(const void* input, void* output,
float duration = static_cast<float>(self->speech_buffer_.size()) /
(self->sample_rate_ * self->channels_);
std::cout << "[VAD] Speech ended (noise_floor=" << self->noise_floor_
std::cout << "[VAD] Speech ended (trailing silence detected, "
<< self->consecutive_silence_frames_ << " frames, "
<< "noise_floor=" << self->noise_floor_
<< "), flushing " << duration << "s (denoised)" << std::endl;
if (self->callback_) {
@ -233,6 +257,7 @@ int AudioCapture::audioCallback(const void* input, void* output,
self->speech_buffer_.clear();
self->speech_samples_count_ = 0;
self->consecutive_silence_frames_ = 0; // Reset after flush
// Reset stream for next segment
if (self->noise_reducer_) {
self->noise_reducer_->resetStream();

View File

@ -16,7 +16,10 @@ class AudioCapture {
public:
using AudioCallback = std::function<void(const std::vector<float>&)>;
AudioCapture(int sample_rate, int channels);
AudioCapture(int sample_rate, int channels,
int silence_duration_ms = 700,
int min_speech_duration_ms = 2000,
int max_speech_duration_ms = 30000);
~AudioCapture();
bool initialize();
@ -44,6 +47,11 @@ public:
void setDenoiseEnabled(bool enabled);
bool isDenoiseEnabled() const;
// Get metrics from last flushed segment
int getLastSpeechDuration() const { return last_speech_duration_ms_; }
int getLastSilenceDuration() const { return last_silence_duration_ms_; }
std::string getLastFlushReason() const { return last_flush_reason_; }
private:
static int audioCallback(const void* input, void* output,
unsigned long frame_count,
@ -69,17 +77,21 @@ private:
// VAD parameters - Higher threshold to avoid false triggers on filtered noise
std::atomic<float> vad_rms_threshold_{0.02f}; // Was 0.01f
std::atomic<float> vad_peak_threshold_{0.08f}; // Was 0.04f
int silence_duration_ms_ = 700; // Wait 700ms of silence before cutting (was 400)
int min_speech_duration_ms_ = 1000; // Minimum 1s speech to send (was 300)
int max_speech_duration_ms_ = 25000; // 25s max before forced flush
int silence_duration_ms_; // Wait 700ms of silence before cutting (was 400)
int min_speech_duration_ms_; // Minimum 2s speech to send (was 1000)
int max_speech_duration_ms_; // 30s max before forced flush (was 25000)
// Adaptive noise floor
float noise_floor_ = 0.005f; // Estimated background noise level
float noise_floor_alpha_ = 0.001f; // Slower adaptation
// Hang time - wait before cutting to avoid mid-sentence cuts
int hang_frames_ = 0;
int hang_frames_threshold_ = 35; // ~350ms tolerance for pauses (was 20)
// Trailing silence detection - count consecutive silence frames after speech
int consecutive_silence_frames_ = 0;
// Metrics for last flushed segment (set in callback, read in processing thread)
int last_speech_duration_ms_ = 0;
int last_silence_duration_ms_ = 0;
std::string last_flush_reason_;
// Zero-crossing rate for speech vs noise discrimination
float last_zcr_ = 0.0f;

View File

@ -24,12 +24,23 @@ Pipeline::~Pipeline() {
bool Pipeline::initialize() {
auto& config = Config::getInstance();
// Load VAD parameters from config (with fallbacks if missing)
int silence_duration = config.getVadSilenceDurationMs();
int min_speech = config.getVadMinSpeechDurationMs();
int max_speech = config.getVadMaxSpeechDurationMs();
// Initialize audio capture with VAD-based segmentation
audio_capture_ = std::make_unique<AudioCapture>(
config.getAudioConfig().sample_rate,
config.getAudioConfig().channels
config.getAudioConfig().channels,
silence_duration,
min_speech,
max_speech
);
std::cout << "[Pipeline] VAD configured: silence=" << silence_duration
<< "ms, min_speech=" << min_speech
<< "ms, max_speech=" << max_speech << "ms" << std::endl;
std::cout << "[Pipeline] VAD-based audio segmentation enabled" << std::endl;
if (!audio_capture_->initialize()) {
@ -395,6 +406,10 @@ void Pipeline::processingThread() {
seg.was_filtered = false;
seg.filter_reason = "";
seg.timestamp = ""; // Will be set by logger
// Add VAD metrics from AudioCapture
seg.speech_duration_ms = audio_capture_->getLastSpeechDuration();
seg.silence_duration_ms = audio_capture_->getLastSilenceDuration();
seg.flush_reason = audio_capture_->getLastFlushReason();
session_logger_.logSegment(seg);
std::cout << "CN: " << text << std::endl;
@ -468,11 +483,11 @@ std::string Pipeline::buildDynamicPrompt() const {
// Build context from recent transcriptions
std::stringstream context;
context << base_prompt;
context << "\n\nContexte des phrases précédentes: ";
context << "\n\nContexte des phrases précédentes:\n";
for (size_t i = 0; i < recent_transcriptions_.size(); ++i) {
if (i > 0) context << " ";
context << recent_transcriptions_[i];
context << std::to_string(i + 1) << ". "
<< recent_transcriptions_[i] << "\n";
}
return context.str();

View File

@ -52,10 +52,9 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
}
std::cerr << "[Config] File opened successfully" << std::endl;
json config_json;
try {
std::cerr << "[Config] About to parse JSON..." << std::endl;
config_file >> config_json;
config_file >> config_;
std::cerr << "[Config] JSON parsed successfully" << std::endl;
} catch (const json::parse_error& e) {
std::cerr << "Error parsing config.json: " << e.what() << std::endl;
@ -66,8 +65,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
}
// Parse audio config
if (config_json.contains("audio")) {
auto& audio = config_json["audio"];
if (config_.contains("audio")) {
auto& audio = config_["audio"];
audio_config_.sample_rate = audio.value("sample_rate", 16000);
audio_config_.channels = audio.value("channels", 1);
audio_config_.chunk_duration_seconds = audio.value("chunk_duration_seconds", 10);
@ -76,8 +75,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
}
// Parse whisper config
if (config_json.contains("whisper")) {
auto& whisper = config_json["whisper"];
if (config_.contains("whisper")) {
auto& whisper = config_["whisper"];
whisper_config_.model = whisper.value("model", "whisper-1");
whisper_config_.language = whisper.value("language", "zh");
whisper_config_.temperature = whisper.value("temperature", 0.0f);
@ -87,8 +86,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
}
// Parse claude config
if (config_json.contains("claude")) {
auto& claude = config_json["claude"];
if (config_.contains("claude")) {
auto& claude = config_["claude"];
claude_config_.model = claude.value("model", "claude-haiku-4-20250514");
claude_config_.max_tokens = claude.value("max_tokens", 1024);
claude_config_.temperature = claude.value("temperature", 0.3f);
@ -96,8 +95,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
}
// Parse UI config
if (config_json.contains("ui")) {
auto& ui = config_json["ui"];
if (config_.contains("ui")) {
auto& ui = config_["ui"];
ui_config_.window_width = ui.value("window_width", 800);
ui_config_.window_height = ui.value("window_height", 600);
ui_config_.font_size = ui.value("font_size", 16);
@ -105,8 +104,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
}
// Parse recording config
if (config_json.contains("recording")) {
auto& recording = config_json["recording"];
if (config_.contains("recording")) {
auto& recording = config_["recording"];
recording_config_.save_audio = recording.value("save_audio", true);
recording_config_.output_directory = recording.value("output_directory", "./recordings");
}
@ -114,4 +113,25 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
return true;
}
int Config::getVadSilenceDurationMs() const {
if (config_.contains("vad") && config_["vad"].contains("silence_duration_ms")) {
return config_["vad"]["silence_duration_ms"].get<int>();
}
return 700; // Default from AudioCapture.h:72 (unchanged)
}
int Config::getVadMinSpeechDurationMs() const {
if (config_.contains("vad") && config_["vad"].contains("min_speech_duration_ms")) {
return config_["vad"]["min_speech_duration_ms"].get<int>();
}
return 2000; // Default from AudioCapture.h:73 (updated in TASK2)
}
int Config::getVadMaxSpeechDurationMs() const {
if (config_.contains("vad") && config_["vad"].contains("max_speech_duration_ms")) {
return config_["vad"]["max_speech_duration_ms"].get<int>();
}
return 30000; // Default from AudioCapture.h:74 (updated in TASK2)
}
} // namespace secondvoice

View File

@ -1,6 +1,7 @@
#pragma once
#include <string>
#include <nlohmann/json.hpp>
namespace secondvoice {
@ -55,6 +56,10 @@ public:
const std::string& getOpenAIKey() const { return openai_key_; }
const std::string& getAnthropicKey() const { return anthropic_key_; }
int getVadSilenceDurationMs() const;
int getVadMinSpeechDurationMs() const;
int getVadMaxSpeechDurationMs() const;
private:
Config() = default;
Config(const Config&) = delete;
@ -68,6 +73,7 @@ private:
std::string openai_key_;
std::string anthropic_key_;
nlohmann::json config_;
};
} // namespace secondvoice

View File

@ -89,6 +89,11 @@ void SessionLogger::logSegment(const SegmentLog& segment) {
j["was_filtered"] = segment.was_filtered;
j["filter_reason"] = segment.filter_reason;
j["timestamp"] = segment.timestamp;
j["vad_metrics"] = {
{"speech_duration_ms", segment.speech_duration_ms},
{"silence_duration_ms", segment.silence_duration_ms},
{"flush_reason", segment.flush_reason}
};
std::ofstream file(filename.str());
if (file.is_open()) {

View File

@ -18,6 +18,11 @@ struct SegmentLog {
bool was_filtered;
std::string filter_reason;
std::string timestamp;
// VAD metrics (added for TASK8)
int speech_duration_ms = 0;
int silence_duration_ms = 0;
std::string flush_reason = "";
};
class SessionLogger {