Compare commits
8 Commits
master
...
vad-improv
| Author | SHA1 | Date | |
|---|---|---|---|
| e8dd7f840e | |||
| a1b4e335c8 | |||
| aac5602722 | |||
| 49f9cb906e | |||
| db0f8e5990 | |||
| a28bb89913 | |||
| 53b21b94d6 | |||
| 9baa213a82 |
2
.gitignore
vendored
2
.gitignore
vendored
@ -64,9 +64,11 @@ imgui.ini
|
||||
*.aac
|
||||
*.m4a
|
||||
denoised/
|
||||
sessions/
|
||||
|
||||
# Claude Code local settings
|
||||
.claude/settings.local.json
|
||||
.claudiomiro/
|
||||
|
||||
# Build scripts (local)
|
||||
run_build.ps1
|
||||
|
||||
@ -108,6 +108,7 @@ set(SOURCES_UI
|
||||
src/ui/TranslationUI.cpp
|
||||
# Utils
|
||||
src/utils/Config.cpp
|
||||
src/utils/SessionLogger.cpp
|
||||
# Core
|
||||
src/core/Pipeline.cpp
|
||||
)
|
||||
|
||||
334
README.md
334
README.md
@ -4,16 +4,50 @@ Real-time Chinese to French translation system for live meetings.
|
||||
|
||||
## Overview
|
||||
|
||||
SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API, and translates it to French using Claude AI in real-time. Perfect for understanding Chinese meetings on the fly.
|
||||
SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly.
|
||||
|
||||
### Why This Project?
|
||||
|
||||
Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:
|
||||
- Business meetings with Chinese speakers
|
||||
- Family/administrative calls
|
||||
- Professional conferences
|
||||
- Any live Chinese conversation where real-time comprehension is needed
|
||||
|
||||
**Status**: MVP complete, actively being debugged and improved based on real-world usage.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Windows (MinGW) - Recommended
|
||||
|
||||
```batch
|
||||
# First-time setup
|
||||
.\setup_mingw.bat
|
||||
|
||||
# Build
|
||||
.\build_mingw.bat
|
||||
|
||||
# Run
|
||||
cd build\mingw-Release
|
||||
SecondVoice.exe
|
||||
```
|
||||
|
||||
**Requirements**: `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`, plus a working microphone.
|
||||
|
||||
See full setup instructions below for other platforms.
|
||||
|
||||
## Features
|
||||
|
||||
- 🎤 Real-time audio capture
|
||||
- 🗣️ Chinese speech-to-text (Whisper API)
|
||||
- 🌐 Chinese to French translation (Claude API)
|
||||
- 🖥️ Clean ImGui interface
|
||||
- 💾 Full recording saved to disk
|
||||
- ⚙️ Configurable chunk sizes and settings
|
||||
- 🎤 **Real-time audio capture** with Voice Activity Detection (VAD)
|
||||
- 🔇 **Noise reduction** using RNNoise neural network
|
||||
- 🗣️ **Chinese speech-to-text** via Whisper API (gpt-4o-mini-transcribe)
|
||||
- 🧠 **Hallucination filtering** - removes known Whisper artifacts
|
||||
- 🌐 **Chinese to French translation** via Claude AI (claude-haiku-4-20250514)
|
||||
- 🖥️ **Clean ImGui interface** with adjustable VAD thresholds
|
||||
- 💾 **Full session recording** with structured logging
|
||||
- 📊 **Session archival** - audio, transcripts, translations, and metadata
|
||||
- ⚡ **Opus compression** - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
|
||||
- ⚙️ **Configurable settings** via config.json
|
||||
|
||||
## Requirements
|
||||
|
||||
@ -116,20 +150,138 @@ The application will:
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Audio Capture (PortAudio)
|
||||
Audio Input (16kHz mono)
|
||||
↓
|
||||
Whisper API (Speech-to-Text)
|
||||
Voice Activity Detection (VAD) - RMS + Peak thresholds
|
||||
↓
|
||||
Claude API (Translation)
|
||||
Noise Reduction (RNNoise) - 16→48→16 kHz resampling
|
||||
↓
|
||||
ImGui UI (Display)
|
||||
Opus Encoding (24kbps OGG) - 46x compression
|
||||
↓
|
||||
Whisper API (gpt-4o-mini-transcribe) - Chinese STT
|
||||
↓
|
||||
Hallucination Filter - Remove known artifacts
|
||||
↓
|
||||
Claude API (claude-haiku-4) - Chinese → French translation
|
||||
↓
|
||||
ImGui UI Display + Session Logging
|
||||
```
|
||||
|
||||
### Threading Model
|
||||
### Threading Model (3 threads)
|
||||
|
||||
- **Thread 1**: Audio capture (PortAudio callback)
|
||||
- **Thread 2**: AI processing (Whisper + Claude API calls)
|
||||
- **Thread 3**: UI rendering (ImGui + OpenGL)
|
||||
1. **Audio Thread** (`Pipeline::audioThread`)
|
||||
- PortAudio callback captures 16kHz mono audio
|
||||
- Applies VAD (Voice Activity Detection) using RMS + Peak thresholds
|
||||
- Pushes speech chunks to processing queue
|
||||
|
||||
2. **Processing Thread** (`Pipeline::processingThread`)
|
||||
- Consumes audio chunks from queue
|
||||
- Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
|
||||
- Encodes to Opus/OGG for bandwidth efficiency
|
||||
- Calls Whisper API for Chinese transcription
|
||||
- Filters known hallucinations (YouTube phrases, music markers, etc.)
|
||||
- Calls Claude API for French translation
|
||||
- Logs to session files
|
||||
|
||||
3. **UI Thread** (main)
|
||||
- GLFW/ImGui rendering loop (must run on main thread)
|
||||
- Displays real-time transcription and translation
|
||||
- Allows runtime VAD threshold adjustment
|
||||
- Handles user controls (stop recording, etc.)
|
||||
|
||||
### Core Components
|
||||
|
||||
**Audio Processing**:
|
||||
- `AudioCapture.cpp` - PortAudio wrapper with VAD-based segmentation
|
||||
- `AudioBuffer.cpp` - Accumulates samples, exports WAV/Opus
|
||||
- `NoiseReducer.cpp` - RNNoise denoising with resampling
|
||||
|
||||
**API Clients**:
|
||||
- `WhisperClient.cpp` - OpenAI Whisper API (multipart/form-data)
|
||||
- `ClaudeClient.cpp` - Anthropic Claude API (JSON)
|
||||
- `WinHttpClient.cpp` - Native Windows HTTP client (replaced libcurl)
|
||||
|
||||
**Core Logic**:
|
||||
- `Pipeline.cpp` - Orchestrates audio → transcription → translation flow
|
||||
- `TranslationUI.cpp` - ImGui interface with VAD controls
|
||||
|
||||
**Utilities**:
|
||||
- `Config.cpp` - Loads config.json + .env
|
||||
- `ThreadSafeQueue.h` - Lock-free queue for audio chunks
|
||||
|
||||
## Known Issues & Active Debugging
|
||||
|
||||
**Status**: Real-world testing has identified issues with degraded audio conditions (see `PLAN_DEBUG.md` for details).
|
||||
|
||||
### Current Problems
|
||||
|
||||
Based on transcript analysis from actual meetings (November 2025):
|
||||
|
||||
1. **VAD cutting speech too early**
|
||||
- Voice Activity Detection triggers end-of-segment prematurely
|
||||
- Results in fragmented phrases ("我很。" → "Je suis.")
|
||||
- **Hypothesis**: Silence threshold too aggressive for multi-speaker scenarios
|
||||
|
||||
2. **Segments too short for context**
|
||||
- Whisper receives insufficient audio context for accurate Chinese transcription
|
||||
- Single-word or two-word segments lack conversational context
|
||||
- **Impact**: Lower accuracy, especially with homonyms
|
||||
|
||||
3. **Ambient noise interpreted as speech**
|
||||
- Background sounds trigger false VAD positives
|
||||
- Test transcript shows "太多声音了" (too much noise) being captured
|
||||
- **Mitigation**: RNNoise helps but not sufficient for very noisy environments
|
||||
|
||||
4. **Loss of inter-segment context**
|
||||
- Each audio chunk processed independently
|
||||
- Whisper cannot use previous context for better transcription
|
||||
- **Potential solution**: Pass previous 2-3 transcriptions in prompt
|
||||
|
||||
### Test Conditions
|
||||
|
||||
Testing has been performed under **deliberately degraded conditions** to ensure robustness:
|
||||
- Multiple simultaneous speakers
|
||||
- Variable microphone distance
|
||||
- Variable volume levels
|
||||
- Fast-paced conversations
|
||||
- Low-quality microphone
|
||||
|
||||
These conditions are intentionally harsh to validate real-world meeting scenarios.
|
||||
|
||||
### Debug Plan
|
||||
|
||||
See `PLAN_DEBUG.md` for:
|
||||
- Detailed session logging implementation (JSON per segment + metadata)
|
||||
- Improved Whisper prompt engineering
|
||||
- VAD threshold tuning recommendations
|
||||
- Context propagation strategies
|
||||
|
||||
## Session Logging
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
sessions/
|
||||
└── YYYY-MM-DD_HHMMSS/
|
||||
├── session.json # Session metadata
|
||||
├── segments/
|
||||
│ ├── 001.json # Segment: Chinese + French + metadata
|
||||
│ ├── 002.json
|
||||
│ └── ...
|
||||
└── transcript.txt # Final export
|
||||
```
|
||||
|
||||
### Segment Format
|
||||
|
||||
```json
|
||||
{
|
||||
"id": 1,
|
||||
"chinese": "两个老鼠求我",
|
||||
"french": "Deux souris me supplient"
|
||||
}
|
||||
```
|
||||
|
||||
**Future enhancements**: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.
|
||||
|
||||
## Configuration
|
||||
|
||||
@ -143,8 +295,9 @@ ImGui UI (Display)
|
||||
"chunk_duration_seconds": 10
|
||||
},
|
||||
"whisper": {
|
||||
"model": "whisper-1",
|
||||
"language": "zh"
|
||||
"model": "gpt-4o-mini-transcribe",
|
||||
"language": "zh",
|
||||
"prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
|
||||
},
|
||||
"claude": {
|
||||
"model": "claude-haiku-4-20250514",
|
||||
@ -166,23 +319,33 @@ ANTHROPIC_API_KEY=sk-ant-...
|
||||
- **Claude Haiku**: ~$0.03-0.05/hour
|
||||
- **Total**: ~$0.40/hour of recording
|
||||
|
||||
## Project Structure
|
||||
## Advanced Features
|
||||
|
||||
```
|
||||
secondvoice/
|
||||
├── src/
|
||||
│ ├── main.cpp # Entry point
|
||||
│ ├── audio/ # Audio capture & buffer
|
||||
│ ├── api/ # Whisper & Claude clients
|
||||
│ ├── ui/ # ImGui interface
|
||||
│ ├── utils/ # Config & thread-safe queue
|
||||
│ └── core/ # Pipeline orchestration
|
||||
├── docs/ # Documentation
|
||||
├── recordings/ # Output recordings
|
||||
├── config.json # Runtime configuration
|
||||
├── .env # API keys (not committed)
|
||||
└── CMakeLists.txt # Build configuration
|
||||
```
|
||||
### GPU Forcing (Hybrid Graphics Systems)
|
||||
|
||||
`main.cpp` exports symbols to force dedicated GPU on Optimus/PowerXpress systems:
|
||||
- `NvOptimusEnablement` - Forces NVIDIA GPU
|
||||
- `AmdPowerXpressRequestHighPerformance` - Forces AMD GPU
|
||||
|
||||
Critical for laptops with both integrated and dedicated GPUs.
|
||||
|
||||
### Hallucination Filtering
|
||||
|
||||
`Pipeline.cpp` maintains an extensive list (~65 patterns) of known Whisper hallucinations:
|
||||
- YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment"
|
||||
- Chinese video endings: "谢谢观看", "再见", "订阅我的频道"
|
||||
- Music symbols: "♪♪", "🎵"
|
||||
- Silence markers: "...", "silence", "inaudible"
|
||||
|
||||
These are automatically filtered before translation to avoid wasting API calls.
|
||||
|
||||
### Console-Only Build
|
||||
|
||||
A `SecondVoice_Console` target exists for headless testing:
|
||||
- Uses `main_console.cpp`
|
||||
- No ImGui/GLFW dependencies
|
||||
- Outputs transcriptions to stdout
|
||||
- Useful for debugging and automated testing
|
||||
|
||||
## Development
|
||||
|
||||
@ -219,30 +382,101 @@ cmake --build build
|
||||
- Check all system dependencies are installed
|
||||
- Try `cmake --build build --clean-first`
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
secondvoice/
|
||||
├── src/
|
||||
│ ├── main.cpp # Entry point, forces NVIDIA GPU
|
||||
│ ├── core/
|
||||
│ │ └── Pipeline.cpp # Audio→Transcription→Translation orchestration
|
||||
│ ├── audio/
|
||||
│ │ ├── AudioCapture.cpp # PortAudio + VAD segmentation
|
||||
│ │ ├── AudioBuffer.cpp # Sample accumulation, WAV/Opus export
|
||||
│ │ └── NoiseReducer.cpp # RNNoise (16→48→16 kHz)
|
||||
│ ├── api/
|
||||
│ │ ├── WhisperClient.cpp # OpenAI Whisper (multipart/form-data)
|
||||
│ │ ├── ClaudeClient.cpp # Anthropic Claude (JSON)
|
||||
│ │ └── WinHttpClient.cpp # Native Windows HTTP
|
||||
│ ├── ui/
|
||||
│ │ └── TranslationUI.cpp # ImGui interface + VAD controls
|
||||
│ └── utils/
|
||||
│ ├── Config.cpp # config.json + .env loader
|
||||
│ └── ThreadSafeQueue.h # Lock-free audio queue
|
||||
├── docs/ # Build guides
|
||||
├── sessions/ # Session recordings + logs
|
||||
├── recordings/ # Legacy recordings directory
|
||||
├── denoised/ # Denoised audio outputs
|
||||
├── config.json # Runtime configuration
|
||||
├── .env # API keys (not committed)
|
||||
├── CLAUDE.md # Development guide for Claude Code
|
||||
├── PLAN_DEBUG.md # Active debugging plan
|
||||
└── CMakeLists.txt # Build configuration
|
||||
```
|
||||
|
||||
### External Dependencies
|
||||
|
||||
**Fetched via CMake FetchContent**:
|
||||
- ImGui v1.90.1 - UI framework
|
||||
- Opus v1.5.2 - Audio encoding
|
||||
- Ogg v1.3.6 - Container format
|
||||
- RNNoise v0.1.1 - Neural network noise reduction
|
||||
|
||||
**vcpkg Dependencies** (x64-mingw-static triplet):
|
||||
- portaudio - Cross-platform audio I/O
|
||||
- nlohmann_json - JSON parsing
|
||||
- glfw3 - Windowing/input
|
||||
- glad - OpenGL loader
|
||||
|
||||
## Roadmap
|
||||
|
||||
### Phase 1 - MVP (Current)
|
||||
- ✅ Audio capture
|
||||
- ✅ Whisper integration
|
||||
- ✅ Claude integration
|
||||
- ✅ ImGui UI
|
||||
- ✅ Stop button
|
||||
### Phase 1 - MVP ✅ (Complete)
|
||||
- ✅ Audio capture with VAD
|
||||
- ✅ Noise reduction (RNNoise)
|
||||
- ✅ Whisper API integration
|
||||
- ✅ Claude API integration
|
||||
- ✅ ImGui UI with runtime VAD adjustment
|
||||
- ✅ Opus compression
|
||||
- ✅ Hallucination filtering
|
||||
- ✅ Session recording
|
||||
|
||||
### Phase 2 - Enhancement
|
||||
- ⬜ Auto-summary post-meeting
|
||||
- ⬜ Export transcripts
|
||||
- ⬜ Search functionality
|
||||
### Phase 2 - Debugging 🔄 (Current)
|
||||
- 🔄 Session logging (JSON per segment)
|
||||
- 🔄 Improved Whisper prompt engineering
|
||||
- 🔄 VAD threshold optimization
|
||||
- 🔄 Context propagation between segments
|
||||
- ⬜ Automated testing with sample audio
|
||||
|
||||
### Phase 3 - Enhancement
|
||||
- ⬜ Auto-summary post-meeting (Claude analysis)
|
||||
- ⬜ Full-text search (SQLite FTS5)
|
||||
- ⬜ Semantic search (embeddings)
|
||||
- ⬜ Speaker diarization
|
||||
- ⬜ Replay mode
|
||||
- ⬜ Replay mode with synced transcripts
|
||||
- ⬜ Multi-language support extension
|
||||
|
||||
## Development Documentation
|
||||
|
||||
- **CLAUDE.md** - Development guide for Claude Code AI assistant
|
||||
- **PLAN_DEBUG.md** - Active debugging plan with identified issues and solutions
|
||||
- **WINDOWS_BUILD.md** - Detailed Windows build instructions
|
||||
- **WINDOWS_MINGW.md** - MinGW-specific build guide
|
||||
- **WINDOWS_QUICK_START.md** - Quick start for Windows users
|
||||
|
||||
## Contributing
|
||||
|
||||
This is a personal project built to solve a real need. Bug reports and suggestions welcome:
|
||||
|
||||
**Known issues**: See `PLAN_DEBUG.md` for current debugging efforts
|
||||
**Architecture**: See `CLAUDE.md` for detailed system design
|
||||
|
||||
## License
|
||||
|
||||
See LICENSE file.
|
||||
|
||||
## Contributing
|
||||
## Acknowledgments
|
||||
|
||||
This is a personal project, but suggestions and bug reports are welcome via issues.
|
||||
|
||||
## Contact
|
||||
|
||||
See docs/SecondVoice.md for project context and motivation.
|
||||
- OpenAI Whisper for excellent Chinese transcription
|
||||
- Anthropic Claude for context-aware translation
|
||||
- RNNoise for neural network-based noise reduction
|
||||
- ImGui for clean, immediate-mode UI
|
||||
|
||||
@ -6,11 +6,16 @@
|
||||
"chunk_step_seconds": 5,
|
||||
"format": "ogg"
|
||||
},
|
||||
"vad": {
|
||||
"silence_duration_ms": 700,
|
||||
"min_speech_duration_ms": 2000,
|
||||
"max_speech_duration_ms": 30000
|
||||
},
|
||||
"whisper": {
|
||||
"model": "gpt-4o-mini-transcribe",
|
||||
"language": "zh",
|
||||
"temperature": 0.0,
|
||||
"prompt": "The following is a conversation in Mandarin Chinese about business, family, and daily life. Common names: Tingting, Alexis.",
|
||||
"prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. Si un contexte de phrases précédentes est fourni, utilise-le pour maintenir la cohérence (noms propres, sujets, terminologie). RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.",
|
||||
"stream": false,
|
||||
"response_format": "text"
|
||||
},
|
||||
|
||||
@ -4,9 +4,15 @@
|
||||
|
||||
namespace secondvoice {
|
||||
|
||||
AudioCapture::AudioCapture(int sample_rate, int channels)
|
||||
AudioCapture::AudioCapture(int sample_rate, int channels,
|
||||
int silence_duration_ms,
|
||||
int min_speech_duration_ms,
|
||||
int max_speech_duration_ms)
|
||||
: sample_rate_(sample_rate)
|
||||
, channels_(channels)
|
||||
, silence_duration_ms_(silence_duration_ms)
|
||||
, min_speech_duration_ms_(min_speech_duration_ms)
|
||||
, max_speech_duration_ms_(max_speech_duration_ms)
|
||||
, noise_reducer_(std::make_unique<NoiseReducer>()) {
|
||||
std::cout << "[Audio] Noise reduction enabled (RNNoise)" << std::endl;
|
||||
}
|
||||
@ -135,16 +141,12 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
||||
// Speech = energy OK AND (ZCR OK or very high energy)
|
||||
bool frame_has_speech = energy_ok && (zcr_ok || denoised_rms > adaptive_rms_thresh * 3.0f);
|
||||
|
||||
// Hang time logic: don't immediately cut on silence
|
||||
// Reset trailing silence counter when speech detected
|
||||
if (frame_has_speech) {
|
||||
self->hang_frames_ = self->hang_frames_threshold_; // Reset hang counter
|
||||
} else if (self->hang_frames_ > 0) {
|
||||
self->hang_frames_--;
|
||||
frame_has_speech = true; // Keep "speaking" during hang time
|
||||
self->consecutive_silence_frames_ = 0;
|
||||
}
|
||||
|
||||
// Calculate durations in samples
|
||||
int silence_samples_threshold = (self->silence_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
|
||||
int min_speech_samples = (self->min_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
|
||||
int max_speech_samples = (self->max_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
|
||||
|
||||
@ -170,6 +172,11 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
||||
std::cout << "[VAD] Max duration reached, forcing flush ("
|
||||
<< self->speech_samples_count_ / (self->sample_rate_ * self->channels_) << "s)" << std::endl;
|
||||
|
||||
// Calculate metrics BEFORE flushing
|
||||
self->last_speech_duration_ms_ = (self->speech_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
|
||||
self->last_silence_duration_ms_ = 0; // No trailing silence in forced flush
|
||||
self->last_flush_reason_ = "max_duration";
|
||||
|
||||
if (self->callback_ && self->speech_buffer_.size() >= static_cast<size_t>(min_speech_samples)) {
|
||||
// Flush any remaining samples from the denoiser
|
||||
if (self->noise_reducer_ && self->noise_reducer_->isEnabled()) {
|
||||
@ -183,16 +190,17 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
||||
}
|
||||
self->speech_buffer_.clear();
|
||||
self->speech_samples_count_ = 0;
|
||||
self->consecutive_silence_frames_ = 0; // Reset after forced flush
|
||||
// Reset stream for next segment
|
||||
if (self->noise_reducer_) {
|
||||
self->noise_reducer_->resetStream();
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// True silence (after hang time expired)
|
||||
// Silence detected
|
||||
self->silence_samples_count_ += sample_count;
|
||||
|
||||
// If we were speaking and now have enough silence, flush
|
||||
// If we were speaking and now have silence, track consecutive silence frames
|
||||
if (self->speech_buffer_.size() > 0) {
|
||||
// Add trailing silence (denoised)
|
||||
if (!denoised_samples.empty()) {
|
||||
@ -204,9 +212,23 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
||||
}
|
||||
}
|
||||
|
||||
if (self->silence_samples_count_ >= silence_samples_threshold) {
|
||||
// Increment consecutive silence frame counter
|
||||
self->consecutive_silence_frames_++;
|
||||
|
||||
// Calculate threshold in frames (callbacks)
|
||||
// frames_per_buffer = frame_count from callback
|
||||
int frames_per_buffer = static_cast<int>(frame_count);
|
||||
int silence_threshold_frames = (self->silence_duration_ms_ * self->sample_rate_) / (1000 * frames_per_buffer);
|
||||
|
||||
// Flush when consecutive silence exceeds threshold
|
||||
if (self->consecutive_silence_frames_ >= silence_threshold_frames) {
|
||||
self->is_speech_active_.store(false, std::memory_order_relaxed);
|
||||
|
||||
// Calculate metrics BEFORE flushing
|
||||
self->last_speech_duration_ms_ = (self->speech_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
|
||||
self->last_silence_duration_ms_ = (self->silence_samples_count_ * 1000) / (self->sample_rate_ * self->channels_);
|
||||
self->last_flush_reason_ = "silence_threshold";
|
||||
|
||||
// Flush if we have enough speech
|
||||
if (self->speech_samples_count_ >= min_speech_samples) {
|
||||
// Flush any remaining samples from the denoiser
|
||||
@ -220,7 +242,9 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
||||
|
||||
float duration = static_cast<float>(self->speech_buffer_.size()) /
|
||||
(self->sample_rate_ * self->channels_);
|
||||
std::cout << "[VAD] Speech ended (noise_floor=" << self->noise_floor_
|
||||
std::cout << "[VAD] Speech ended (trailing silence detected, "
|
||||
<< self->consecutive_silence_frames_ << " frames, "
|
||||
<< "noise_floor=" << self->noise_floor_
|
||||
<< "), flushing " << duration << "s (denoised)" << std::endl;
|
||||
|
||||
if (self->callback_) {
|
||||
@ -233,6 +257,7 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
||||
|
||||
self->speech_buffer_.clear();
|
||||
self->speech_samples_count_ = 0;
|
||||
self->consecutive_silence_frames_ = 0; // Reset after flush
|
||||
// Reset stream for next segment
|
||||
if (self->noise_reducer_) {
|
||||
self->noise_reducer_->resetStream();
|
||||
|
||||
@ -16,7 +16,10 @@ class AudioCapture {
|
||||
public:
|
||||
using AudioCallback = std::function<void(const std::vector<float>&)>;
|
||||
|
||||
AudioCapture(int sample_rate, int channels);
|
||||
AudioCapture(int sample_rate, int channels,
|
||||
int silence_duration_ms = 700,
|
||||
int min_speech_duration_ms = 2000,
|
||||
int max_speech_duration_ms = 30000);
|
||||
~AudioCapture();
|
||||
|
||||
bool initialize();
|
||||
@ -44,6 +47,11 @@ public:
|
||||
void setDenoiseEnabled(bool enabled);
|
||||
bool isDenoiseEnabled() const;
|
||||
|
||||
// Get metrics from last flushed segment
|
||||
int getLastSpeechDuration() const { return last_speech_duration_ms_; }
|
||||
int getLastSilenceDuration() const { return last_silence_duration_ms_; }
|
||||
std::string getLastFlushReason() const { return last_flush_reason_; }
|
||||
|
||||
private:
|
||||
static int audioCallback(const void* input, void* output,
|
||||
unsigned long frame_count,
|
||||
@ -69,17 +77,21 @@ private:
|
||||
// VAD parameters - Higher threshold to avoid false triggers on filtered noise
|
||||
std::atomic<float> vad_rms_threshold_{0.02f}; // Was 0.01f
|
||||
std::atomic<float> vad_peak_threshold_{0.08f}; // Was 0.04f
|
||||
int silence_duration_ms_ = 400; // Wait 400ms of silence before cutting
|
||||
int min_speech_duration_ms_ = 300; // Minimum speech to send
|
||||
int max_speech_duration_ms_ = 25000; // 25s max before forced flush
|
||||
int silence_duration_ms_; // Wait 700ms of silence before cutting (was 400)
|
||||
int min_speech_duration_ms_; // Minimum 2s speech to send (was 1000)
|
||||
int max_speech_duration_ms_; // 30s max before forced flush (was 25000)
|
||||
|
||||
// Adaptive noise floor
|
||||
float noise_floor_ = 0.005f; // Estimated background noise level
|
||||
float noise_floor_alpha_ = 0.001f; // Slower adaptation
|
||||
|
||||
// Hang time - wait before cutting to avoid mid-sentence cuts
|
||||
int hang_frames_ = 0;
|
||||
int hang_frames_threshold_ = 20; // ~200ms tolerance for pauses
|
||||
// Trailing silence detection - count consecutive silence frames after speech
|
||||
int consecutive_silence_frames_ = 0;
|
||||
|
||||
// Metrics for last flushed segment (set in callback, read in processing thread)
|
||||
int last_speech_duration_ms_ = 0;
|
||||
int last_silence_duration_ms_ = 0;
|
||||
std::string last_flush_reason_;
|
||||
|
||||
// Zero-crossing rate for speech vs noise discrimination
|
||||
float last_zcr_ = 0.0f;
|
||||
|
||||
@ -24,12 +24,23 @@ Pipeline::~Pipeline() {
|
||||
bool Pipeline::initialize() {
|
||||
auto& config = Config::getInstance();
|
||||
|
||||
// Load VAD parameters from config (with fallbacks if missing)
|
||||
int silence_duration = config.getVadSilenceDurationMs();
|
||||
int min_speech = config.getVadMinSpeechDurationMs();
|
||||
int max_speech = config.getVadMaxSpeechDurationMs();
|
||||
|
||||
// Initialize audio capture with VAD-based segmentation
|
||||
audio_capture_ = std::make_unique<AudioCapture>(
|
||||
config.getAudioConfig().sample_rate,
|
||||
config.getAudioConfig().channels
|
||||
config.getAudioConfig().channels,
|
||||
silence_duration,
|
||||
min_speech,
|
||||
max_speech
|
||||
);
|
||||
|
||||
std::cout << "[Pipeline] VAD configured: silence=" << silence_duration
|
||||
<< "ms, min_speech=" << min_speech
|
||||
<< "ms, max_speech=" << max_speech << "ms" << std::endl;
|
||||
std::cout << "[Pipeline] VAD-based audio segmentation enabled" << std::endl;
|
||||
|
||||
if (!audio_capture_->initialize()) {
|
||||
@ -70,6 +81,10 @@ bool Pipeline::start() {
|
||||
}
|
||||
|
||||
running_ = true;
|
||||
segment_id_ = 0;
|
||||
|
||||
// Start session logging
|
||||
session_logger_.startSession();
|
||||
|
||||
// Start background threads
|
||||
audio_thread_ = std::thread(&Pipeline::audioThread, this);
|
||||
@ -126,6 +141,9 @@ void Pipeline::stop() {
|
||||
transcript_ss << "transcripts/transcript_" << timestamp.str() << ".txt";
|
||||
ui_->exportTranscript(transcript_ss.str());
|
||||
}
|
||||
|
||||
// End session logging
|
||||
session_logger_.endSession();
|
||||
}
|
||||
|
||||
void Pipeline::audioThread() {
|
||||
@ -143,6 +161,8 @@ void Pipeline::audioThread() {
|
||||
chunk.sample_rate = config.getAudioConfig().sample_rate;
|
||||
chunk.channels = config.getAudioConfig().channels;
|
||||
|
||||
float push_duration = static_cast<float>(audio_data.size()) / (chunk.sample_rate * chunk.channels);
|
||||
std::cout << "[Queue] Pushing " << push_duration << "s chunk, queue size: " << audio_queue_.size() << std::endl;
|
||||
audio_queue_.push(std::move(chunk));
|
||||
});
|
||||
|
||||
@ -159,6 +179,7 @@ void Pipeline::audioThread() {
|
||||
|
||||
void Pipeline::processingThread() {
|
||||
auto& config = Config::getInstance();
|
||||
int audio_segment_id = 0;
|
||||
|
||||
while (running_) {
|
||||
auto chunk_opt = audio_queue_.wait_and_pop();
|
||||
@ -168,7 +189,42 @@ void Pipeline::processingThread() {
|
||||
|
||||
auto& chunk = chunk_opt.value();
|
||||
float duration = static_cast<float>(chunk.data.size()) / (chunk.sample_rate * chunk.channels);
|
||||
std::cout << "[Processing] Speech segment: " << duration << "s" << std::endl;
|
||||
|
||||
// Debug: log queue size to detect double-push
|
||||
std::cout << "[Queue] Processing chunk, " << audio_queue_.size() << " remaining" << std::endl;
|
||||
|
||||
// Save audio segment to session directory for debugging
|
||||
audio_segment_id++;
|
||||
if (session_logger_.isActive()) {
|
||||
std::stringstream audio_path;
|
||||
audio_path << session_logger_.getSessionPath() << "/audio_"
|
||||
<< std::setfill('0') << std::setw(3) << audio_segment_id << ".ogg";
|
||||
|
||||
AudioBuffer segment_buffer(chunk.sample_rate, chunk.channels);
|
||||
segment_buffer.addSamples(chunk.data);
|
||||
if (segment_buffer.saveToOpus(audio_path.str())) {
|
||||
std::cout << "[Session] Saved audio segment: " << audio_path.str() << std::endl;
|
||||
}
|
||||
}
|
||||
|
||||
// Calculate audio RMS for logging
|
||||
float audio_rms = 0.0f;
|
||||
if (!chunk.data.empty()) {
|
||||
float sum_sq = 0.0f;
|
||||
for (float s : chunk.data) sum_sq += s * s;
|
||||
audio_rms = std::sqrt(sum_sq / chunk.data.size());
|
||||
}
|
||||
|
||||
std::cout << "[Processing] Speech segment: " << duration << "s (RMS=" << audio_rms << ")" << std::endl;
|
||||
|
||||
// Time Whisper
|
||||
auto whisper_start = std::chrono::steady_clock::now();
|
||||
|
||||
// Build dynamic prompt with recent context
|
||||
std::string dynamic_prompt = buildDynamicPrompt();
|
||||
if (!recent_transcriptions_.empty()) {
|
||||
std::cout << "[Context] Using " << recent_transcriptions_.size() << " previous segments" << std::endl;
|
||||
}
|
||||
|
||||
// Transcribe with Whisper
|
||||
auto whisper_result = whisper_client_->transcribe(
|
||||
@ -178,12 +234,17 @@ void Pipeline::processingThread() {
|
||||
config.getWhisperConfig().model,
|
||||
config.getWhisperConfig().language,
|
||||
config.getWhisperConfig().temperature,
|
||||
config.getWhisperConfig().prompt,
|
||||
dynamic_prompt,
|
||||
config.getWhisperConfig().response_format
|
||||
);
|
||||
|
||||
auto whisper_end = std::chrono::steady_clock::now();
|
||||
int64_t whisper_latency = std::chrono::duration_cast<std::chrono::milliseconds>(
|
||||
whisper_end - whisper_start).count();
|
||||
|
||||
if (!whisper_result.has_value()) {
|
||||
std::cerr << "Whisper transcription failed" << std::endl;
|
||||
session_logger_.logFilteredSegment("", "whisper_failed", duration, audio_rms);
|
||||
continue;
|
||||
}
|
||||
|
||||
@ -195,6 +256,7 @@ void Pipeline::processingThread() {
|
||||
size_t end = text.find_last_not_of(" \t\n\r");
|
||||
if (start == std::string::npos) {
|
||||
std::cout << "[Skip] Empty transcription" << std::endl;
|
||||
session_logger_.logFilteredSegment("", "empty", duration, audio_rms);
|
||||
continue;
|
||||
}
|
||||
text = text.substr(start, end - start + 1);
|
||||
@ -267,14 +329,32 @@ void Pipeline::processingThread() {
|
||||
|
||||
if (is_garbage) {
|
||||
std::cout << "[Skip] Filtered: " << text << std::endl;
|
||||
session_logger_.logFilteredSegment(text, "hallucination", duration, audio_rms);
|
||||
continue;
|
||||
}
|
||||
|
||||
// Deduplication: skip if exact same as last transcription
|
||||
if (text == last_transcription_) {
|
||||
std::cout << "[Skip] Duplicate: " << text << std::endl;
|
||||
session_logger_.logFilteredSegment(text, "duplicate", duration, audio_rms);
|
||||
continue;
|
||||
}
|
||||
last_transcription_ = text;
|
||||
|
||||
// Update dynamic context for next Whisper call
|
||||
recent_transcriptions_.push_back(text);
|
||||
if (recent_transcriptions_.size() > MAX_CONTEXT_SEGMENTS) {
|
||||
recent_transcriptions_.erase(recent_transcriptions_.begin());
|
||||
}
|
||||
|
||||
// Track audio cost
|
||||
if (ui_) {
|
||||
ui_->addAudioCost(duration);
|
||||
}
|
||||
|
||||
// Time Claude
|
||||
auto claude_start = std::chrono::steady_clock::now();
|
||||
|
||||
// Translate with Claude
|
||||
auto claude_result = claude_client_->translate(
|
||||
text,
|
||||
@ -283,8 +363,13 @@ void Pipeline::processingThread() {
|
||||
config.getClaudeConfig().temperature
|
||||
);
|
||||
|
||||
auto claude_end = std::chrono::steady_clock::now();
|
||||
int64_t claude_latency = std::chrono::duration_cast<std::chrono::milliseconds>(
|
||||
claude_end - claude_start).count();
|
||||
|
||||
if (!claude_result.has_value()) {
|
||||
std::cerr << "Claude translation failed" << std::endl;
|
||||
session_logger_.logFilteredSegment(text, "claude_failed", duration, audio_rms);
|
||||
continue;
|
||||
}
|
||||
|
||||
@ -308,8 +393,28 @@ void Pipeline::processingThread() {
|
||||
ui_->setAccumulatedText(accumulated_chinese_, accumulated_french_);
|
||||
ui_->addTranslation(text, claude_result->text);
|
||||
|
||||
// Log successful segment
|
||||
segment_id_++;
|
||||
SegmentLog seg;
|
||||
seg.id = segment_id_;
|
||||
seg.chinese = text;
|
||||
seg.french = claude_result->text;
|
||||
seg.audio_duration_sec = duration;
|
||||
seg.audio_rms = audio_rms;
|
||||
seg.whisper_latency_ms = whisper_latency;
|
||||
seg.claude_latency_ms = claude_latency;
|
||||
seg.was_filtered = false;
|
||||
seg.filter_reason = "";
|
||||
seg.timestamp = ""; // Will be set by logger
|
||||
// Add VAD metrics from AudioCapture
|
||||
seg.speech_duration_ms = audio_capture_->getLastSpeechDuration();
|
||||
seg.silence_duration_ms = audio_capture_->getLastSilenceDuration();
|
||||
seg.flush_reason = audio_capture_->getLastFlushReason();
|
||||
session_logger_.logSegment(seg);
|
||||
|
||||
std::cout << "CN: " << text << std::endl;
|
||||
std::cout << "FR: " << claude_result->text << std::endl;
|
||||
std::cout << "[Latency] Whisper: " << whisper_latency << "ms, Claude: " << claude_latency << "ms" << std::endl;
|
||||
std::cout << "---" << std::endl;
|
||||
}
|
||||
}
|
||||
@ -358,10 +463,34 @@ bool Pipeline::shouldClose() const {
|
||||
void Pipeline::clearAccumulated() {
|
||||
accumulated_chinese_.clear();
|
||||
accumulated_french_.clear();
|
||||
recent_transcriptions_.clear();
|
||||
last_transcription_.clear();
|
||||
if (ui_) {
|
||||
ui_->setAccumulatedText("", "");
|
||||
}
|
||||
std::cout << "[Pipeline] Cleared accumulated text" << std::endl;
|
||||
std::cout << "[Pipeline] Cleared accumulated text and context" << std::endl;
|
||||
}
|
||||
|
||||
std::string Pipeline::buildDynamicPrompt() const {
|
||||
auto& config = Config::getInstance();
|
||||
std::string base_prompt = config.getWhisperConfig().prompt;
|
||||
|
||||
// If no recent transcriptions, just return base prompt
|
||||
if (recent_transcriptions_.empty()) {
|
||||
return base_prompt;
|
||||
}
|
||||
|
||||
// Build context from recent transcriptions
|
||||
std::stringstream context;
|
||||
context << base_prompt;
|
||||
context << "\n\nContexte des phrases précédentes:\n";
|
||||
|
||||
for (size_t i = 0; i < recent_transcriptions_.size(); ++i) {
|
||||
context << std::to_string(i + 1) << ". "
|
||||
<< recent_transcriptions_[i] << "\n";
|
||||
}
|
||||
|
||||
return context.str();
|
||||
}
|
||||
|
||||
} // namespace secondvoice
|
||||
|
||||
@ -6,6 +6,7 @@
|
||||
#include <string>
|
||||
#include <vector>
|
||||
#include "../utils/ThreadSafeQueue.h"
|
||||
#include "../utils/SessionLogger.h"
|
||||
|
||||
namespace secondvoice {
|
||||
|
||||
@ -60,6 +61,20 @@ private:
|
||||
// Simple accumulation
|
||||
std::string accumulated_chinese_;
|
||||
std::string accumulated_french_;
|
||||
|
||||
// Dynamic context for Whisper (last N transcriptions)
|
||||
std::vector<std::string> recent_transcriptions_;
|
||||
static constexpr size_t MAX_CONTEXT_SEGMENTS = 3;
|
||||
|
||||
// Deduplication: skip if same as last transcription
|
||||
std::string last_transcription_;
|
||||
|
||||
// Build dynamic prompt with recent context
|
||||
std::string buildDynamicPrompt() const;
|
||||
|
||||
// Session logging
|
||||
SessionLogger session_logger_;
|
||||
int segment_id_ = 0;
|
||||
};
|
||||
|
||||
} // namespace secondvoice
|
||||
|
||||
@ -52,10 +52,9 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
|
||||
}
|
||||
std::cerr << "[Config] File opened successfully" << std::endl;
|
||||
|
||||
json config_json;
|
||||
try {
|
||||
std::cerr << "[Config] About to parse JSON..." << std::endl;
|
||||
config_file >> config_json;
|
||||
config_file >> config_;
|
||||
std::cerr << "[Config] JSON parsed successfully" << std::endl;
|
||||
} catch (const json::parse_error& e) {
|
||||
std::cerr << "Error parsing config.json: " << e.what() << std::endl;
|
||||
@ -66,8 +65,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
|
||||
}
|
||||
|
||||
// Parse audio config
|
||||
if (config_json.contains("audio")) {
|
||||
auto& audio = config_json["audio"];
|
||||
if (config_.contains("audio")) {
|
||||
auto& audio = config_["audio"];
|
||||
audio_config_.sample_rate = audio.value("sample_rate", 16000);
|
||||
audio_config_.channels = audio.value("channels", 1);
|
||||
audio_config_.chunk_duration_seconds = audio.value("chunk_duration_seconds", 10);
|
||||
@ -76,8 +75,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
|
||||
}
|
||||
|
||||
// Parse whisper config
|
||||
if (config_json.contains("whisper")) {
|
||||
auto& whisper = config_json["whisper"];
|
||||
if (config_.contains("whisper")) {
|
||||
auto& whisper = config_["whisper"];
|
||||
whisper_config_.model = whisper.value("model", "whisper-1");
|
||||
whisper_config_.language = whisper.value("language", "zh");
|
||||
whisper_config_.temperature = whisper.value("temperature", 0.0f);
|
||||
@ -87,8 +86,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
|
||||
}
|
||||
|
||||
// Parse claude config
|
||||
if (config_json.contains("claude")) {
|
||||
auto& claude = config_json["claude"];
|
||||
if (config_.contains("claude")) {
|
||||
auto& claude = config_["claude"];
|
||||
claude_config_.model = claude.value("model", "claude-haiku-4-20250514");
|
||||
claude_config_.max_tokens = claude.value("max_tokens", 1024);
|
||||
claude_config_.temperature = claude.value("temperature", 0.3f);
|
||||
@ -96,8 +95,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
|
||||
}
|
||||
|
||||
// Parse UI config
|
||||
if (config_json.contains("ui")) {
|
||||
auto& ui = config_json["ui"];
|
||||
if (config_.contains("ui")) {
|
||||
auto& ui = config_["ui"];
|
||||
ui_config_.window_width = ui.value("window_width", 800);
|
||||
ui_config_.window_height = ui.value("window_height", 600);
|
||||
ui_config_.font_size = ui.value("font_size", 16);
|
||||
@ -105,8 +104,8 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
|
||||
}
|
||||
|
||||
// Parse recording config
|
||||
if (config_json.contains("recording")) {
|
||||
auto& recording = config_json["recording"];
|
||||
if (config_.contains("recording")) {
|
||||
auto& recording = config_["recording"];
|
||||
recording_config_.save_audio = recording.value("save_audio", true);
|
||||
recording_config_.output_directory = recording.value("output_directory", "./recordings");
|
||||
}
|
||||
@ -114,4 +113,25 @@ bool Config::load(const std::string& config_path, const std::string& env_path) {
|
||||
return true;
|
||||
}
|
||||
|
||||
int Config::getVadSilenceDurationMs() const {
|
||||
if (config_.contains("vad") && config_["vad"].contains("silence_duration_ms")) {
|
||||
return config_["vad"]["silence_duration_ms"].get<int>();
|
||||
}
|
||||
return 700; // Default from AudioCapture.h:72 (unchanged)
|
||||
}
|
||||
|
||||
int Config::getVadMinSpeechDurationMs() const {
|
||||
if (config_.contains("vad") && config_["vad"].contains("min_speech_duration_ms")) {
|
||||
return config_["vad"]["min_speech_duration_ms"].get<int>();
|
||||
}
|
||||
return 2000; // Default from AudioCapture.h:73 (updated in TASK2)
|
||||
}
|
||||
|
||||
int Config::getVadMaxSpeechDurationMs() const {
|
||||
if (config_.contains("vad") && config_["vad"].contains("max_speech_duration_ms")) {
|
||||
return config_["vad"]["max_speech_duration_ms"].get<int>();
|
||||
}
|
||||
return 30000; // Default from AudioCapture.h:74 (updated in TASK2)
|
||||
}
|
||||
|
||||
} // namespace secondvoice
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
#pragma once
|
||||
|
||||
#include <string>
|
||||
#include <nlohmann/json.hpp>
|
||||
|
||||
namespace secondvoice {
|
||||
|
||||
@ -55,6 +56,10 @@ public:
|
||||
const std::string& getOpenAIKey() const { return openai_key_; }
|
||||
const std::string& getAnthropicKey() const { return anthropic_key_; }
|
||||
|
||||
int getVadSilenceDurationMs() const;
|
||||
int getVadMinSpeechDurationMs() const;
|
||||
int getVadMaxSpeechDurationMs() const;
|
||||
|
||||
private:
|
||||
Config() = default;
|
||||
Config(const Config&) = delete;
|
||||
@ -68,6 +73,7 @@ private:
|
||||
|
||||
std::string openai_key_;
|
||||
std::string anthropic_key_;
|
||||
nlohmann::json config_;
|
||||
};
|
||||
|
||||
} // namespace secondvoice
|
||||
|
||||
201
src/utils/SessionLogger.cpp
Normal file
201
src/utils/SessionLogger.cpp
Normal file
@ -0,0 +1,201 @@
|
||||
#include "SessionLogger.h"
|
||||
#include <nlohmann/json.hpp>
|
||||
#include <filesystem>
|
||||
#include <iostream>
|
||||
#include <iomanip>
|
||||
#include <sstream>
|
||||
|
||||
namespace secondvoice {
|
||||
|
||||
using json = nlohmann::json;
|
||||
|
||||
SessionLogger::SessionLogger() = default;
|
||||
|
||||
SessionLogger::~SessionLogger() {
|
||||
if (is_active_) {
|
||||
endSession();
|
||||
}
|
||||
}
|
||||
|
||||
std::string SessionLogger::getCurrentTimestamp() const {
|
||||
auto now = std::chrono::system_clock::now();
|
||||
auto time_t = std::chrono::system_clock::to_time_t(now);
|
||||
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(
|
||||
now.time_since_epoch()) % 1000;
|
||||
|
||||
std::stringstream ss;
|
||||
ss << std::put_time(std::localtime(&time_t), "%Y-%m-%d_%H%M%S");
|
||||
return ss.str();
|
||||
}
|
||||
|
||||
void SessionLogger::startSession() {
|
||||
if (is_active_) {
|
||||
endSession();
|
||||
}
|
||||
|
||||
session_start_time_ = getCurrentTimestamp();
|
||||
session_path_ = "./sessions/" + session_start_time_;
|
||||
|
||||
// Create directories
|
||||
std::filesystem::create_directories(session_path_ + "/segments");
|
||||
|
||||
is_active_ = true;
|
||||
segment_count_ = 0;
|
||||
filtered_count_ = 0;
|
||||
total_audio_sec_ = 0.0f;
|
||||
total_whisper_ms_ = 0;
|
||||
total_claude_ms_ = 0;
|
||||
segments_.clear();
|
||||
|
||||
std::cout << "[Session] Started: " << session_path_ << std::endl;
|
||||
}
|
||||
|
||||
void SessionLogger::endSession() {
|
||||
if (!is_active_) return;
|
||||
|
||||
writeSessionJson();
|
||||
is_active_ = false;
|
||||
|
||||
std::cout << "[Session] Ended: " << segment_count_ << " segments, "
|
||||
<< filtered_count_ << " filtered, "
|
||||
<< total_audio_sec_ << "s audio" << std::endl;
|
||||
}
|
||||
|
||||
void SessionLogger::logSegment(const SegmentLog& segment) {
|
||||
if (!is_active_) return;
|
||||
|
||||
// Update counters
|
||||
segment_count_++;
|
||||
total_audio_sec_ += segment.audio_duration_sec;
|
||||
total_whisper_ms_ += segment.whisper_latency_ms;
|
||||
total_claude_ms_ += segment.claude_latency_ms;
|
||||
|
||||
// Store segment
|
||||
segments_.push_back(segment);
|
||||
|
||||
// Write individual segment JSON
|
||||
std::stringstream filename;
|
||||
filename << session_path_ << "/segments/"
|
||||
<< std::setfill('0') << std::setw(3) << segment.id << ".json";
|
||||
|
||||
json j;
|
||||
j["id"] = segment.id;
|
||||
j["chinese"] = segment.chinese;
|
||||
j["french"] = segment.french;
|
||||
j["audio_duration_sec"] = segment.audio_duration_sec;
|
||||
j["audio_rms"] = segment.audio_rms;
|
||||
j["whisper_latency_ms"] = segment.whisper_latency_ms;
|
||||
j["claude_latency_ms"] = segment.claude_latency_ms;
|
||||
j["was_filtered"] = segment.was_filtered;
|
||||
j["filter_reason"] = segment.filter_reason;
|
||||
j["timestamp"] = segment.timestamp;
|
||||
j["vad_metrics"] = {
|
||||
{"speech_duration_ms", segment.speech_duration_ms},
|
||||
{"silence_duration_ms", segment.silence_duration_ms},
|
||||
{"flush_reason", segment.flush_reason}
|
||||
};
|
||||
|
||||
std::ofstream file(filename.str());
|
||||
if (file.is_open()) {
|
||||
file << j.dump(2);
|
||||
file.close();
|
||||
}
|
||||
|
||||
std::cout << "[Session] Logged segment #" << segment.id
|
||||
<< " (" << segment.audio_duration_sec << "s)" << std::endl;
|
||||
}
|
||||
|
||||
void SessionLogger::logFilteredSegment(const std::string& chinese, const std::string& reason,
|
||||
float audio_duration, float audio_rms) {
|
||||
if (!is_active_) return;
|
||||
|
||||
filtered_count_++;
|
||||
total_audio_sec_ += audio_duration;
|
||||
|
||||
// Log filtered segment with special marker
|
||||
SegmentLog seg;
|
||||
seg.id = segment_count_ + filtered_count_;
|
||||
seg.chinese = chinese;
|
||||
seg.french = "[FILTERED]";
|
||||
seg.audio_duration_sec = audio_duration;
|
||||
seg.audio_rms = audio_rms;
|
||||
seg.whisper_latency_ms = 0;
|
||||
seg.claude_latency_ms = 0;
|
||||
seg.was_filtered = true;
|
||||
seg.filter_reason = reason;
|
||||
seg.timestamp = getCurrentTimestamp();
|
||||
|
||||
segments_.push_back(seg);
|
||||
|
||||
// Write filtered segment JSON
|
||||
std::stringstream filename;
|
||||
filename << session_path_ << "/segments/"
|
||||
<< std::setfill('0') << std::setw(3) << seg.id << "_filtered.json";
|
||||
|
||||
json j;
|
||||
j["id"] = seg.id;
|
||||
j["chinese"] = seg.chinese;
|
||||
j["filter_reason"] = reason;
|
||||
j["audio_duration_sec"] = audio_duration;
|
||||
j["audio_rms"] = audio_rms;
|
||||
j["timestamp"] = seg.timestamp;
|
||||
|
||||
std::ofstream file(filename.str());
|
||||
if (file.is_open()) {
|
||||
file << j.dump(2);
|
||||
file.close();
|
||||
}
|
||||
}
|
||||
|
||||
void SessionLogger::writeSessionJson() {
|
||||
json session;
|
||||
session["start_time"] = session_start_time_;
|
||||
session["end_time"] = getCurrentTimestamp();
|
||||
session["total_segments"] = segment_count_;
|
||||
session["filtered_segments"] = filtered_count_;
|
||||
session["total_audio_seconds"] = total_audio_sec_;
|
||||
session["avg_whisper_latency_ms"] = segment_count_ > 0 ?
|
||||
total_whisper_ms_ / segment_count_ : 0;
|
||||
session["avg_claude_latency_ms"] = segment_count_ > 0 ?
|
||||
total_claude_ms_ / segment_count_ : 0;
|
||||
|
||||
// Summary of all segments
|
||||
json segments_summary = json::array();
|
||||
for (const auto& seg : segments_) {
|
||||
json s;
|
||||
s["id"] = seg.id;
|
||||
s["chinese"] = seg.chinese;
|
||||
s["french"] = seg.french;
|
||||
s["duration"] = seg.audio_duration_sec;
|
||||
s["filtered"] = seg.was_filtered;
|
||||
if (seg.was_filtered) {
|
||||
s["filter_reason"] = seg.filter_reason;
|
||||
}
|
||||
segments_summary.push_back(s);
|
||||
}
|
||||
session["segments"] = segments_summary;
|
||||
|
||||
std::string filepath = session_path_ + "/session.json";
|
||||
std::ofstream file(filepath);
|
||||
if (file.is_open()) {
|
||||
file << session.dump(2);
|
||||
file.close();
|
||||
std::cout << "[Session] Wrote " << filepath << std::endl;
|
||||
}
|
||||
|
||||
// Also write plain text transcript
|
||||
std::string transcript_path = session_path_ + "/transcript.txt";
|
||||
std::ofstream transcript(transcript_path);
|
||||
if (transcript.is_open()) {
|
||||
transcript << "=== SecondVoice Session " << session_start_time_ << " ===\n\n";
|
||||
for (const auto& seg : segments_) {
|
||||
if (!seg.was_filtered) {
|
||||
transcript << "CN: " << seg.chinese << "\n";
|
||||
transcript << "FR: " << seg.french << "\n\n";
|
||||
}
|
||||
}
|
||||
transcript.close();
|
||||
}
|
||||
}
|
||||
|
||||
} // namespace secondvoice
|
||||
68
src/utils/SessionLogger.h
Normal file
68
src/utils/SessionLogger.h
Normal file
@ -0,0 +1,68 @@
|
||||
#pragma once
|
||||
|
||||
#include <string>
|
||||
#include <vector>
|
||||
#include <chrono>
|
||||
#include <fstream>
|
||||
|
||||
namespace secondvoice {
|
||||
|
||||
struct SegmentLog {
|
||||
int id;
|
||||
std::string chinese;
|
||||
std::string french;
|
||||
float audio_duration_sec;
|
||||
float audio_rms;
|
||||
int64_t whisper_latency_ms;
|
||||
int64_t claude_latency_ms;
|
||||
bool was_filtered;
|
||||
std::string filter_reason;
|
||||
std::string timestamp;
|
||||
|
||||
// VAD metrics (added for TASK8)
|
||||
int speech_duration_ms = 0;
|
||||
int silence_duration_ms = 0;
|
||||
std::string flush_reason = "";
|
||||
};
|
||||
|
||||
class SessionLogger {
|
||||
public:
|
||||
SessionLogger();
|
||||
~SessionLogger();
|
||||
|
||||
// Start a new session (creates directory)
|
||||
void startSession();
|
||||
|
||||
// End session (writes session.json summary)
|
||||
void endSession();
|
||||
|
||||
// Log a segment
|
||||
void logSegment(const SegmentLog& segment);
|
||||
|
||||
// Log a filtered/skipped segment
|
||||
void logFilteredSegment(const std::string& chinese, const std::string& reason,
|
||||
float audio_duration, float audio_rms);
|
||||
|
||||
// Get current session path
|
||||
std::string getSessionPath() const { return session_path_; }
|
||||
|
||||
// Check if session is active
|
||||
bool isActive() const { return is_active_; }
|
||||
|
||||
private:
|
||||
std::string getCurrentTimestamp() const;
|
||||
void writeSessionJson();
|
||||
|
||||
bool is_active_ = false;
|
||||
std::string session_path_;
|
||||
std::string session_start_time_;
|
||||
int segment_count_ = 0;
|
||||
int filtered_count_ = 0;
|
||||
float total_audio_sec_ = 0.0f;
|
||||
int total_whisper_ms_ = 0;
|
||||
int total_claude_ms_ = 0;
|
||||
|
||||
std::vector<SegmentLog> segments_;
|
||||
};
|
||||
|
||||
} // namespace secondvoice
|
||||
Loading…
Reference in New Issue
Block a user