refactor: Improve VAD trailing silence detection and update docs
- Replace hang time logic with consecutive silence frame counter for more precise speech end detection - Update Whisper prompt to utilize previous context for better transcription coherence - Expand README with comprehensive feature list, architecture details, debugging status, and session logging structure - Add troubleshooting section for real-world testing conditions and known issues
This commit is contained in:
parent
a28bb89913
commit
db0f8e5990
334
README.md
334
README.md
@ -4,16 +4,50 @@ Real-time Chinese to French translation system for live meetings.
|
|||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API, and translates it to French using Claude AI in real-time. Perfect for understanding Chinese meetings on the fly.
|
SecondVoice captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI in real-time. Designed for understanding Chinese meetings, calls, and conversations on the fly.
|
||||||
|
|
||||||
|
### Why This Project?
|
||||||
|
|
||||||
|
Built to solve a real need: understanding Chinese meetings in real-time without constant reliance on bilingual support. Perfect for:
|
||||||
|
- Business meetings with Chinese speakers
|
||||||
|
- Family/administrative calls
|
||||||
|
- Professional conferences
|
||||||
|
- Any live Chinese conversation where real-time comprehension is needed
|
||||||
|
|
||||||
|
**Status**: MVP complete, actively being debugged and improved based on real-world usage.
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### Windows (MinGW) - Recommended
|
||||||
|
|
||||||
|
```batch
|
||||||
|
# First-time setup
|
||||||
|
.\setup_mingw.bat
|
||||||
|
|
||||||
|
# Build
|
||||||
|
.\build_mingw.bat
|
||||||
|
|
||||||
|
# Run
|
||||||
|
cd build\mingw-Release
|
||||||
|
SecondVoice.exe
|
||||||
|
```
|
||||||
|
|
||||||
|
**Requirements**: `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`, plus a working microphone.
|
||||||
|
|
||||||
|
See full setup instructions below for other platforms.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
- 🎤 Real-time audio capture
|
- 🎤 **Real-time audio capture** with Voice Activity Detection (VAD)
|
||||||
- 🗣️ Chinese speech-to-text (Whisper API)
|
- 🔇 **Noise reduction** using RNNoise neural network
|
||||||
- 🌐 Chinese to French translation (Claude API)
|
- 🗣️ **Chinese speech-to-text** via Whisper API (gpt-4o-mini-transcribe)
|
||||||
- 🖥️ Clean ImGui interface
|
- 🧠 **Hallucination filtering** - removes known Whisper artifacts
|
||||||
- 💾 Full recording saved to disk
|
- 🌐 **Chinese to French translation** via Claude AI (claude-haiku-4-20250514)
|
||||||
- ⚙️ Configurable chunk sizes and settings
|
- 🖥️ **Clean ImGui interface** with adjustable VAD thresholds
|
||||||
|
- 💾 **Full session recording** with structured logging
|
||||||
|
- 📊 **Session archival** - audio, transcripts, translations, and metadata
|
||||||
|
- ⚡ **Opus compression** - 46x bandwidth reduction (16kHz PCM → 24kbps Opus)
|
||||||
|
- ⚙️ **Configurable settings** via config.json
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
@ -116,20 +150,138 @@ The application will:
|
|||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
Audio Capture (PortAudio)
|
Audio Input (16kHz mono)
|
||||||
↓
|
↓
|
||||||
Whisper API (Speech-to-Text)
|
Voice Activity Detection (VAD) - RMS + Peak thresholds
|
||||||
↓
|
↓
|
||||||
Claude API (Translation)
|
Noise Reduction (RNNoise) - 16→48→16 kHz resampling
|
||||||
↓
|
↓
|
||||||
ImGui UI (Display)
|
Opus Encoding (24kbps OGG) - 46x compression
|
||||||
|
↓
|
||||||
|
Whisper API (gpt-4o-mini-transcribe) - Chinese STT
|
||||||
|
↓
|
||||||
|
Hallucination Filter - Remove known artifacts
|
||||||
|
↓
|
||||||
|
Claude API (claude-haiku-4) - Chinese → French translation
|
||||||
|
↓
|
||||||
|
ImGui UI Display + Session Logging
|
||||||
```
|
```
|
||||||
|
|
||||||
### Threading Model
|
### Threading Model (3 threads)
|
||||||
|
|
||||||
- **Thread 1**: Audio capture (PortAudio callback)
|
1. **Audio Thread** (`Pipeline::audioThread`)
|
||||||
- **Thread 2**: AI processing (Whisper + Claude API calls)
|
- PortAudio callback captures 16kHz mono audio
|
||||||
- **Thread 3**: UI rendering (ImGui + OpenGL)
|
- Applies VAD (Voice Activity Detection) using RMS + Peak thresholds
|
||||||
|
- Pushes speech chunks to processing queue
|
||||||
|
|
||||||
|
2. **Processing Thread** (`Pipeline::processingThread`)
|
||||||
|
- Consumes audio chunks from queue
|
||||||
|
- Applies RNNoise denoising (upsampled to 48kHz → denoised → downsampled to 16kHz)
|
||||||
|
- Encodes to Opus/OGG for bandwidth efficiency
|
||||||
|
- Calls Whisper API for Chinese transcription
|
||||||
|
- Filters known hallucinations (YouTube phrases, music markers, etc.)
|
||||||
|
- Calls Claude API for French translation
|
||||||
|
- Logs to session files
|
||||||
|
|
||||||
|
3. **UI Thread** (main)
|
||||||
|
- GLFW/ImGui rendering loop (must run on main thread)
|
||||||
|
- Displays real-time transcription and translation
|
||||||
|
- Allows runtime VAD threshold adjustment
|
||||||
|
- Handles user controls (stop recording, etc.)
|
||||||
|
|
||||||
|
### Core Components
|
||||||
|
|
||||||
|
**Audio Processing**:
|
||||||
|
- `AudioCapture.cpp` - PortAudio wrapper with VAD-based segmentation
|
||||||
|
- `AudioBuffer.cpp` - Accumulates samples, exports WAV/Opus
|
||||||
|
- `NoiseReducer.cpp` - RNNoise denoising with resampling
|
||||||
|
|
||||||
|
**API Clients**:
|
||||||
|
- `WhisperClient.cpp` - OpenAI Whisper API (multipart/form-data)
|
||||||
|
- `ClaudeClient.cpp` - Anthropic Claude API (JSON)
|
||||||
|
- `WinHttpClient.cpp` - Native Windows HTTP client (replaced libcurl)
|
||||||
|
|
||||||
|
**Core Logic**:
|
||||||
|
- `Pipeline.cpp` - Orchestrates audio → transcription → translation flow
|
||||||
|
- `TranslationUI.cpp` - ImGui interface with VAD controls
|
||||||
|
|
||||||
|
**Utilities**:
|
||||||
|
- `Config.cpp` - Loads config.json + .env
|
||||||
|
- `ThreadSafeQueue.h` - Lock-free queue for audio chunks
|
||||||
|
|
||||||
|
## Known Issues & Active Debugging
|
||||||
|
|
||||||
|
**Status**: Real-world testing has identified issues with degraded audio conditions (see `PLAN_DEBUG.md` for details).
|
||||||
|
|
||||||
|
### Current Problems
|
||||||
|
|
||||||
|
Based on transcript analysis from actual meetings (November 2025):
|
||||||
|
|
||||||
|
1. **VAD cutting speech too early**
|
||||||
|
- Voice Activity Detection triggers end-of-segment prematurely
|
||||||
|
- Results in fragmented phrases ("我很。" → "Je suis.")
|
||||||
|
- **Hypothesis**: Silence threshold too aggressive for multi-speaker scenarios
|
||||||
|
|
||||||
|
2. **Segments too short for context**
|
||||||
|
- Whisper receives insufficient audio context for accurate Chinese transcription
|
||||||
|
- Single-word or two-word segments lack conversational context
|
||||||
|
- **Impact**: Lower accuracy, especially with homonyms
|
||||||
|
|
||||||
|
3. **Ambient noise interpreted as speech**
|
||||||
|
- Background sounds trigger false VAD positives
|
||||||
|
- Test transcript shows "太多声音了" (too much noise) being captured
|
||||||
|
- **Mitigation**: RNNoise helps but not sufficient for very noisy environments
|
||||||
|
|
||||||
|
4. **Loss of inter-segment context**
|
||||||
|
- Each audio chunk processed independently
|
||||||
|
- Whisper cannot use previous context for better transcription
|
||||||
|
- **Potential solution**: Pass previous 2-3 transcriptions in prompt
|
||||||
|
|
||||||
|
### Test Conditions
|
||||||
|
|
||||||
|
Testing has been performed under **deliberately degraded conditions** to ensure robustness:
|
||||||
|
- Multiple simultaneous speakers
|
||||||
|
- Variable microphone distance
|
||||||
|
- Variable volume levels
|
||||||
|
- Fast-paced conversations
|
||||||
|
- Low-quality microphone
|
||||||
|
|
||||||
|
These conditions are intentionally harsh to validate real-world meeting scenarios.
|
||||||
|
|
||||||
|
### Debug Plan
|
||||||
|
|
||||||
|
See `PLAN_DEBUG.md` for:
|
||||||
|
- Detailed session logging implementation (JSON per segment + metadata)
|
||||||
|
- Improved Whisper prompt engineering
|
||||||
|
- VAD threshold tuning recommendations
|
||||||
|
- Context propagation strategies
|
||||||
|
|
||||||
|
## Session Logging
|
||||||
|
|
||||||
|
### Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
sessions/
|
||||||
|
└── YYYY-MM-DD_HHMMSS/
|
||||||
|
├── session.json # Session metadata
|
||||||
|
├── segments/
|
||||||
|
│ ├── 001.json # Segment: Chinese + French + metadata
|
||||||
|
│ ├── 002.json
|
||||||
|
│ └── ...
|
||||||
|
└── transcript.txt # Final export
|
||||||
|
```
|
||||||
|
|
||||||
|
### Segment Format
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"id": 1,
|
||||||
|
"chinese": "两个老鼠求我",
|
||||||
|
"french": "Deux souris me supplient"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Future enhancements**: Audio duration, RMS levels, timestamps, Whisper/Claude latencies per segment.
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
@ -143,8 +295,9 @@ ImGui UI (Display)
|
|||||||
"chunk_duration_seconds": 10
|
"chunk_duration_seconds": 10
|
||||||
},
|
},
|
||||||
"whisper": {
|
"whisper": {
|
||||||
"model": "whisper-1",
|
"model": "gpt-4o-mini-transcribe",
|
||||||
"language": "zh"
|
"language": "zh",
|
||||||
|
"prompt": "Transcription d'une réunion en chinois mandarin. Plusieurs interlocuteurs. Ne transcris PAS : musique, silence, bruits de fond. Si l'audio est inaudible, renvoie une chaîne vide. Noms possibles: Tingting, Alexis."
|
||||||
},
|
},
|
||||||
"claude": {
|
"claude": {
|
||||||
"model": "claude-haiku-4-20250514",
|
"model": "claude-haiku-4-20250514",
|
||||||
@ -166,23 +319,33 @@ ANTHROPIC_API_KEY=sk-ant-...
|
|||||||
- **Claude Haiku**: ~$0.03-0.05/hour
|
- **Claude Haiku**: ~$0.03-0.05/hour
|
||||||
- **Total**: ~$0.40/hour of recording
|
- **Total**: ~$0.40/hour of recording
|
||||||
|
|
||||||
## Project Structure
|
## Advanced Features
|
||||||
|
|
||||||
```
|
### GPU Forcing (Hybrid Graphics Systems)
|
||||||
secondvoice/
|
|
||||||
├── src/
|
`main.cpp` exports symbols to force dedicated GPU on Optimus/PowerXpress systems:
|
||||||
│ ├── main.cpp # Entry point
|
- `NvOptimusEnablement` - Forces NVIDIA GPU
|
||||||
│ ├── audio/ # Audio capture & buffer
|
- `AmdPowerXpressRequestHighPerformance` - Forces AMD GPU
|
||||||
│ ├── api/ # Whisper & Claude clients
|
|
||||||
│ ├── ui/ # ImGui interface
|
Critical for laptops with both integrated and dedicated GPUs.
|
||||||
│ ├── utils/ # Config & thread-safe queue
|
|
||||||
│ └── core/ # Pipeline orchestration
|
### Hallucination Filtering
|
||||||
├── docs/ # Documentation
|
|
||||||
├── recordings/ # Output recordings
|
`Pipeline.cpp` maintains an extensive list (~65 patterns) of known Whisper hallucinations:
|
||||||
├── config.json # Runtime configuration
|
- YouTube phrases: "Thank you for watching", "Subscribe", "Like and comment"
|
||||||
├── .env # API keys (not committed)
|
- Chinese video endings: "谢谢观看", "再见", "订阅我的频道"
|
||||||
└── CMakeLists.txt # Build configuration
|
- Music symbols: "♪♪", "🎵"
|
||||||
```
|
- Silence markers: "...", "silence", "inaudible"
|
||||||
|
|
||||||
|
These are automatically filtered before translation to avoid wasting API calls.
|
||||||
|
|
||||||
|
### Console-Only Build
|
||||||
|
|
||||||
|
A `SecondVoice_Console` target exists for headless testing:
|
||||||
|
- Uses `main_console.cpp`
|
||||||
|
- No ImGui/GLFW dependencies
|
||||||
|
- Outputs transcriptions to stdout
|
||||||
|
- Useful for debugging and automated testing
|
||||||
|
|
||||||
## Development
|
## Development
|
||||||
|
|
||||||
@ -219,30 +382,101 @@ cmake --build build
|
|||||||
- Check all system dependencies are installed
|
- Check all system dependencies are installed
|
||||||
- Try `cmake --build build --clean-first`
|
- Try `cmake --build build --clean-first`
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
secondvoice/
|
||||||
|
├── src/
|
||||||
|
│ ├── main.cpp # Entry point, forces NVIDIA GPU
|
||||||
|
│ ├── core/
|
||||||
|
│ │ └── Pipeline.cpp # Audio→Transcription→Translation orchestration
|
||||||
|
│ ├── audio/
|
||||||
|
│ │ ├── AudioCapture.cpp # PortAudio + VAD segmentation
|
||||||
|
│ │ ├── AudioBuffer.cpp # Sample accumulation, WAV/Opus export
|
||||||
|
│ │ └── NoiseReducer.cpp # RNNoise (16→48→16 kHz)
|
||||||
|
│ ├── api/
|
||||||
|
│ │ ├── WhisperClient.cpp # OpenAI Whisper (multipart/form-data)
|
||||||
|
│ │ ├── ClaudeClient.cpp # Anthropic Claude (JSON)
|
||||||
|
│ │ └── WinHttpClient.cpp # Native Windows HTTP
|
||||||
|
│ ├── ui/
|
||||||
|
│ │ └── TranslationUI.cpp # ImGui interface + VAD controls
|
||||||
|
│ └── utils/
|
||||||
|
│ ├── Config.cpp # config.json + .env loader
|
||||||
|
│ └── ThreadSafeQueue.h # Lock-free audio queue
|
||||||
|
├── docs/ # Build guides
|
||||||
|
├── sessions/ # Session recordings + logs
|
||||||
|
├── recordings/ # Legacy recordings directory
|
||||||
|
├── denoised/ # Denoised audio outputs
|
||||||
|
├── config.json # Runtime configuration
|
||||||
|
├── .env # API keys (not committed)
|
||||||
|
├── CLAUDE.md # Development guide for Claude Code
|
||||||
|
├── PLAN_DEBUG.md # Active debugging plan
|
||||||
|
└── CMakeLists.txt # Build configuration
|
||||||
|
```
|
||||||
|
|
||||||
|
### External Dependencies
|
||||||
|
|
||||||
|
**Fetched via CMake FetchContent**:
|
||||||
|
- ImGui v1.90.1 - UI framework
|
||||||
|
- Opus v1.5.2 - Audio encoding
|
||||||
|
- Ogg v1.3.6 - Container format
|
||||||
|
- RNNoise v0.1.1 - Neural network noise reduction
|
||||||
|
|
||||||
|
**vcpkg Dependencies** (x64-mingw-static triplet):
|
||||||
|
- portaudio - Cross-platform audio I/O
|
||||||
|
- nlohmann_json - JSON parsing
|
||||||
|
- glfw3 - Windowing/input
|
||||||
|
- glad - OpenGL loader
|
||||||
|
|
||||||
## Roadmap
|
## Roadmap
|
||||||
|
|
||||||
### Phase 1 - MVP (Current)
|
### Phase 1 - MVP ✅ (Complete)
|
||||||
- ✅ Audio capture
|
- ✅ Audio capture with VAD
|
||||||
- ✅ Whisper integration
|
- ✅ Noise reduction (RNNoise)
|
||||||
- ✅ Claude integration
|
- ✅ Whisper API integration
|
||||||
- ✅ ImGui UI
|
- ✅ Claude API integration
|
||||||
- ✅ Stop button
|
- ✅ ImGui UI with runtime VAD adjustment
|
||||||
|
- ✅ Opus compression
|
||||||
|
- ✅ Hallucination filtering
|
||||||
|
- ✅ Session recording
|
||||||
|
|
||||||
### Phase 2 - Enhancement
|
### Phase 2 - Debugging 🔄 (Current)
|
||||||
- ⬜ Auto-summary post-meeting
|
- 🔄 Session logging (JSON per segment)
|
||||||
- ⬜ Export transcripts
|
- 🔄 Improved Whisper prompt engineering
|
||||||
- ⬜ Search functionality
|
- 🔄 VAD threshold optimization
|
||||||
|
- 🔄 Context propagation between segments
|
||||||
|
- ⬜ Automated testing with sample audio
|
||||||
|
|
||||||
|
### Phase 3 - Enhancement
|
||||||
|
- ⬜ Auto-summary post-meeting (Claude analysis)
|
||||||
|
- ⬜ Full-text search (SQLite FTS5)
|
||||||
|
- ⬜ Semantic search (embeddings)
|
||||||
- ⬜ Speaker diarization
|
- ⬜ Speaker diarization
|
||||||
- ⬜ Replay mode
|
- ⬜ Replay mode with synced transcripts
|
||||||
|
- ⬜ Multi-language support extension
|
||||||
|
|
||||||
|
## Development Documentation
|
||||||
|
|
||||||
|
- **CLAUDE.md** - Development guide for Claude Code AI assistant
|
||||||
|
- **PLAN_DEBUG.md** - Active debugging plan with identified issues and solutions
|
||||||
|
- **WINDOWS_BUILD.md** - Detailed Windows build instructions
|
||||||
|
- **WINDOWS_MINGW.md** - MinGW-specific build guide
|
||||||
|
- **WINDOWS_QUICK_START.md** - Quick start for Windows users
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
This is a personal project built to solve a real need. Bug reports and suggestions welcome:
|
||||||
|
|
||||||
|
**Known issues**: See `PLAN_DEBUG.md` for current debugging efforts
|
||||||
|
**Architecture**: See `CLAUDE.md` for detailed system design
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
See LICENSE file.
|
See LICENSE file.
|
||||||
|
|
||||||
## Contributing
|
## Acknowledgments
|
||||||
|
|
||||||
This is a personal project, but suggestions and bug reports are welcome via issues.
|
- OpenAI Whisper for excellent Chinese transcription
|
||||||
|
- Anthropic Claude for context-aware translation
|
||||||
## Contact
|
- RNNoise for neural network-based noise reduction
|
||||||
|
- ImGui for clean, immediate-mode UI
|
||||||
See docs/SecondVoice.md for project context and motivation.
|
|
||||||
|
|||||||
@ -10,7 +10,7 @@
|
|||||||
"model": "gpt-4o-mini-transcribe",
|
"model": "gpt-4o-mini-transcribe",
|
||||||
"language": "zh",
|
"language": "zh",
|
||||||
"temperature": 0.0,
|
"temperature": 0.0,
|
||||||
"prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.",
|
"prompt": "Transcription en direct d'une conversation en chinois mandarin. Plusieurs interlocuteurs parlent, parfois en même temps. Si un contexte de phrases précédentes est fourni, utilise-le pour maintenir la cohérence (noms propres, sujets, terminologie). RÈGLES STRICTES: (1) Ne transcris QUE les paroles audibles en chinois. (2) Si l'audio est inaudible, du bruit, ou du silence, renvoie une chaîne vide. (3) NE GÉNÈRE JAMAIS ces phrases: 谢谢观看, 感谢收看, 订阅, 请订阅, 下期再见, Thank you, Subscribe, 字幕. (4) Ignore: musique, applaudissements, rires, bruits de fond, respirations.",
|
||||||
"stream": false,
|
"stream": false,
|
||||||
"response_format": "text"
|
"response_format": "text"
|
||||||
},
|
},
|
||||||
|
|||||||
@ -135,16 +135,12 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
|||||||
// Speech = energy OK AND (ZCR OK or very high energy)
|
// Speech = energy OK AND (ZCR OK or very high energy)
|
||||||
bool frame_has_speech = energy_ok && (zcr_ok || denoised_rms > adaptive_rms_thresh * 3.0f);
|
bool frame_has_speech = energy_ok && (zcr_ok || denoised_rms > adaptive_rms_thresh * 3.0f);
|
||||||
|
|
||||||
// Hang time logic: don't immediately cut on silence
|
// Reset trailing silence counter when speech detected
|
||||||
if (frame_has_speech) {
|
if (frame_has_speech) {
|
||||||
self->hang_frames_ = self->hang_frames_threshold_; // Reset hang counter
|
self->consecutive_silence_frames_ = 0;
|
||||||
} else if (self->hang_frames_ > 0) {
|
|
||||||
self->hang_frames_--;
|
|
||||||
frame_has_speech = true; // Keep "speaking" during hang time
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// Calculate durations in samples
|
// Calculate durations in samples
|
||||||
int silence_samples_threshold = (self->silence_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
|
|
||||||
int min_speech_samples = (self->min_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
|
int min_speech_samples = (self->min_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
|
||||||
int max_speech_samples = (self->max_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
|
int max_speech_samples = (self->max_speech_duration_ms_ * self->sample_rate_ * self->channels_) / 1000;
|
||||||
|
|
||||||
@ -183,16 +179,17 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
|||||||
}
|
}
|
||||||
self->speech_buffer_.clear();
|
self->speech_buffer_.clear();
|
||||||
self->speech_samples_count_ = 0;
|
self->speech_samples_count_ = 0;
|
||||||
|
self->consecutive_silence_frames_ = 0; // Reset after forced flush
|
||||||
// Reset stream for next segment
|
// Reset stream for next segment
|
||||||
if (self->noise_reducer_) {
|
if (self->noise_reducer_) {
|
||||||
self->noise_reducer_->resetStream();
|
self->noise_reducer_->resetStream();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
} else {
|
} else {
|
||||||
// True silence (after hang time expired)
|
// Silence detected
|
||||||
self->silence_samples_count_ += sample_count;
|
self->silence_samples_count_ += sample_count;
|
||||||
|
|
||||||
// If we were speaking and now have enough silence, flush
|
// If we were speaking and now have silence, track consecutive silence frames
|
||||||
if (self->speech_buffer_.size() > 0) {
|
if (self->speech_buffer_.size() > 0) {
|
||||||
// Add trailing silence (denoised)
|
// Add trailing silence (denoised)
|
||||||
if (!denoised_samples.empty()) {
|
if (!denoised_samples.empty()) {
|
||||||
@ -204,7 +201,16 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
if (self->silence_samples_count_ >= silence_samples_threshold) {
|
// Increment consecutive silence frame counter
|
||||||
|
self->consecutive_silence_frames_++;
|
||||||
|
|
||||||
|
// Calculate threshold in frames (callbacks)
|
||||||
|
// frames_per_buffer = frame_count from callback
|
||||||
|
int frames_per_buffer = static_cast<int>(frame_count);
|
||||||
|
int silence_threshold_frames = (self->silence_duration_ms_ * self->sample_rate_) / (1000 * frames_per_buffer);
|
||||||
|
|
||||||
|
// Flush when consecutive silence exceeds threshold
|
||||||
|
if (self->consecutive_silence_frames_ >= silence_threshold_frames) {
|
||||||
self->is_speech_active_.store(false, std::memory_order_relaxed);
|
self->is_speech_active_.store(false, std::memory_order_relaxed);
|
||||||
|
|
||||||
// Flush if we have enough speech
|
// Flush if we have enough speech
|
||||||
@ -220,7 +226,9 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
|||||||
|
|
||||||
float duration = static_cast<float>(self->speech_buffer_.size()) /
|
float duration = static_cast<float>(self->speech_buffer_.size()) /
|
||||||
(self->sample_rate_ * self->channels_);
|
(self->sample_rate_ * self->channels_);
|
||||||
std::cout << "[VAD] Speech ended (noise_floor=" << self->noise_floor_
|
std::cout << "[VAD] Speech ended (trailing silence detected, "
|
||||||
|
<< self->consecutive_silence_frames_ << " frames, "
|
||||||
|
<< "noise_floor=" << self->noise_floor_
|
||||||
<< "), flushing " << duration << "s (denoised)" << std::endl;
|
<< "), flushing " << duration << "s (denoised)" << std::endl;
|
||||||
|
|
||||||
if (self->callback_) {
|
if (self->callback_) {
|
||||||
@ -233,6 +241,7 @@ int AudioCapture::audioCallback(const void* input, void* output,
|
|||||||
|
|
||||||
self->speech_buffer_.clear();
|
self->speech_buffer_.clear();
|
||||||
self->speech_samples_count_ = 0;
|
self->speech_samples_count_ = 0;
|
||||||
|
self->consecutive_silence_frames_ = 0; // Reset after flush
|
||||||
// Reset stream for next segment
|
// Reset stream for next segment
|
||||||
if (self->noise_reducer_) {
|
if (self->noise_reducer_) {
|
||||||
self->noise_reducer_->resetStream();
|
self->noise_reducer_->resetStream();
|
||||||
|
|||||||
@ -77,9 +77,8 @@ private:
|
|||||||
float noise_floor_ = 0.005f; // Estimated background noise level
|
float noise_floor_ = 0.005f; // Estimated background noise level
|
||||||
float noise_floor_alpha_ = 0.001f; // Slower adaptation
|
float noise_floor_alpha_ = 0.001f; // Slower adaptation
|
||||||
|
|
||||||
// Hang time - wait before cutting to avoid mid-sentence cuts
|
// Trailing silence detection - count consecutive silence frames after speech
|
||||||
int hang_frames_ = 0;
|
int consecutive_silence_frames_ = 0;
|
||||||
int hang_frames_threshold_ = 35; // ~350ms tolerance for pauses (was 20)
|
|
||||||
|
|
||||||
// Zero-crossing rate for speech vs noise discrimination
|
// Zero-crossing rate for speech vs noise discrimination
|
||||||
float last_zcr_ = 0.0f;
|
float last_zcr_ = 0.0f;
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user