feat: Add transcript export and debug planning docs
- Add CLAUDE.md with project documentation for AI assistance - Add PLAN_DEBUG.md with debugging hypotheses and logging plan - Update Pipeline and TranslationUI with transcript export functionality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
371e86d0b7
commit
21bcc9ed71
119
CLAUDE.md
Normal file
119
CLAUDE.md
Normal file
@ -0,0 +1,119 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
SecondVoice is a real-time Chinese-to-French translation system for live meetings. It captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI.
|
||||
|
||||
## Build Commands
|
||||
|
||||
### Windows (MinGW) - Primary Build
|
||||
```batch
|
||||
# First-time setup
|
||||
.\setup_mingw.bat
|
||||
|
||||
# Build (Release)
|
||||
.\build_mingw.bat
|
||||
|
||||
# Build (Debug)
|
||||
.\build_mingw.bat --debug
|
||||
|
||||
# Clean rebuild
|
||||
.\build_mingw.bat --clean
|
||||
```
|
||||
|
||||
### Running the Application
|
||||
```batch
|
||||
cd build\mingw-Release
|
||||
SecondVoice.exe
|
||||
```
|
||||
|
||||
Requires:
|
||||
- `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`
|
||||
- `config.json` (copied automatically during build)
|
||||
- A microphone
|
||||
|
||||
## Architecture
|
||||
|
||||
### Threading Model (3 threads)
|
||||
1. **Audio Thread** (`Pipeline::audioThread`) - PortAudio callback captures audio, applies VAD (Voice Activity Detection), pushes chunks to queue
|
||||
2. **Processing Thread** (`Pipeline::processingThread`) - Consumes audio chunks, calls Whisper API for transcription, then Claude API for translation
|
||||
3. **UI Thread** (main) - GLFW/ImGui rendering loop, must run on main thread
|
||||
|
||||
### Core Components
|
||||
|
||||
```
|
||||
src/
|
||||
├── main.cpp # Entry point, forces NVIDIA GPU
|
||||
├── core/Pipeline.cpp # Orchestrates audio→transcription→translation flow
|
||||
├── audio/
|
||||
│ ├── AudioCapture.cpp # PortAudio wrapper with VAD-based segmentation
|
||||
│ ├── AudioBuffer.cpp # Accumulates samples, exports WAV/Opus
|
||||
│ └── NoiseReducer.cpp # RNNoise denoising (16kHz→48kHz→16kHz resampling)
|
||||
├── api/
|
||||
│ ├── WhisperClient.cpp # OpenAI Whisper API (multipart/form-data)
|
||||
│ ├── ClaudeClient.cpp # Anthropic Claude API (JSON)
|
||||
│ └── WinHttpClient.cpp # Native Windows HTTP client (replaced libcurl)
|
||||
├── ui/TranslationUI.cpp # ImGui interface with VAD threshold controls
|
||||
└── utils/
|
||||
├── Config.cpp # Loads config.json + .env
|
||||
└── ThreadSafeQueue.h # Lock-free queue for audio chunks
|
||||
```
|
||||
|
||||
### Key Data Flow
|
||||
1. `AudioCapture` detects speech via VAD thresholds (RMS + Peak)
|
||||
2. Speech segments sent to `NoiseReducer` (RNNoise) for denoising
|
||||
3. Denoised audio encoded to Opus/OGG for bandwidth efficiency (46x reduction)
|
||||
4. `WhisperClient` sends audio to gpt-4o-mini-transcribe
|
||||
5. `Pipeline` filters Whisper hallucinations (known garbage phrases)
|
||||
6. `ClaudeClient` translates Chinese text to French
|
||||
7. `TranslationUI` displays accumulated transcription/translation
|
||||
|
||||
### External Dependencies (fetched via CMake FetchContent)
|
||||
- **ImGui** v1.90.1 - UI framework
|
||||
- **Opus** v1.5.2 - Audio encoding
|
||||
- **Ogg** v1.3.6 - Container format
|
||||
- **RNNoise** v0.1.1 - Neural network noise reduction
|
||||
|
||||
### vcpkg Dependencies (x64-mingw-static triplet)
|
||||
- portaudio, nlohmann_json, glfw3, glad
|
||||
|
||||
## Configuration
|
||||
|
||||
### config.json
|
||||
- `audio.sample_rate`: 16000 Hz (required for Whisper)
|
||||
- `whisper.model`: "gpt-4o-mini-transcribe"
|
||||
- `whisper.language`: "zh" (Chinese)
|
||||
- `claude.model`: "claude-3-5-haiku-20241022"
|
||||
|
||||
### VAD Tuning
|
||||
VAD thresholds are adjustable in the UI at runtime:
|
||||
- RMS threshold: speech detection sensitivity
|
||||
- Peak threshold: transient/click rejection
|
||||
|
||||
## Important Implementation Details
|
||||
|
||||
### Whisper Hallucination Filtering
|
||||
`Pipeline.cpp` contains an extensive list of known Whisper hallucinations (lines ~195-260) that are filtered out:
|
||||
- "Thank you for watching", "Subscribe", YouTube phrases
|
||||
- Chinese video endings: "谢谢观看", "再见", "订阅"
|
||||
- Music symbols, silence markers
|
||||
- Single-word interjections
|
||||
|
||||
### GPU Forcing (Optimus/PowerXpress)
|
||||
`main.cpp` exports `NvOptimusEnablement` and `AmdPowerXpressRequestHighPerformance` symbols to force dedicated GPU usage on hybrid graphics systems.
|
||||
|
||||
### Audio Processing Pipeline
|
||||
1. 16kHz mono input → Upsampled to 48kHz for RNNoise
|
||||
2. RNNoise denoising (480-sample frames at 48kHz)
|
||||
3. Transient suppression (claps, clicks, pops)
|
||||
4. Downsampled back to 16kHz
|
||||
5. Opus encoding at 24kbps for API transmission
|
||||
|
||||
## Console-Only Build
|
||||
|
||||
A `SecondVoice_Console` target exists for testing without UI:
|
||||
- Uses `main_console.cpp`
|
||||
- No ImGui/GLFW dependencies
|
||||
- Outputs transcriptions to stdout
|
||||
60
PLAN_DEBUG.md
Normal file
60
PLAN_DEBUG.md
Normal file
@ -0,0 +1,60 @@
|
||||
# Plan de Debug SecondVoice
|
||||
|
||||
## Problème observé
|
||||
|
||||
Transcript du 2025-11-23 (5:31 min, 75 segments) montre :
|
||||
- Phrases fragmentées ("我很。" → "Je suis.")
|
||||
- Erreurs de transcription ("两个老鼠求我" - deux souris me supplient)
|
||||
- Segments d'un ou deux mots sans contexte
|
||||
- Hallucinations Whisper ("汪汪汪汪")
|
||||
|
||||
## Hypothèses (à valider)
|
||||
|
||||
1. **VAD coupe trop tôt** - Le Voice Activity Detection déclenche la fin de segment trop rapidement, coupant les phrases en plein milieu
|
||||
|
||||
2. **Segments trop courts** - Whisper n'a pas assez de contexte audio pour transcrire correctement le chinois
|
||||
|
||||
3. **Bruit ambiant** - Du bruit est interprété comme de la parole (segment 22 mentionne "太多声音了")
|
||||
|
||||
4. **Perte de contexte inter-segments** - Chaque segment est traité isolément, Whisper ne peut pas utiliser le contexte des phrases précédentes
|
||||
|
||||
## Plan : Système de logging par session
|
||||
|
||||
### Objectif
|
||||
Collecter des données exploitables pour identifier la source des problèmes.
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
sessions/
|
||||
└── YYYY-MM-DD_HHMMSS/
|
||||
├── session.json # Métadonnées globales
|
||||
├── segments/
|
||||
│ ├── 001.json
|
||||
│ ├── 002.json
|
||||
│ └── ...
|
||||
└── transcript.txt # Export final (existant)
|
||||
```
|
||||
|
||||
### Format segment JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"id": 1,
|
||||
"chinese": "两个老鼠求我",
|
||||
"french": "Deux souris me supplient"
|
||||
}
|
||||
```
|
||||
|
||||
### À définir
|
||||
|
||||
- [ ] Quelles métadonnées audio ajouter ? (durée, RMS, timestamps)
|
||||
- [ ] Sauvegarder les fichiers audio .opus par segment ?
|
||||
- [ ] Infos Whisper ? (latence, modèle, filtered)
|
||||
- [ ] Infos Claude ? (latence, modèle)
|
||||
|
||||
## Prochaines étapes
|
||||
|
||||
1. Implémenter le système de logging basique (JSON chinois/français)
|
||||
2. Analyser les patterns dans les données
|
||||
3. Enrichir avec plus de métadonnées si nécessaire
|
||||
@ -103,13 +103,15 @@ void Pipeline::stop() {
|
||||
|
||||
// Save full recording
|
||||
auto& config = Config::getInstance();
|
||||
auto now = std::chrono::system_clock::now();
|
||||
auto time_t = std::chrono::system_clock::to_time_t(now);
|
||||
std::stringstream timestamp;
|
||||
timestamp << std::put_time(std::localtime(&time_t), "%Y%m%d_%H%M%S");
|
||||
|
||||
if (config.getRecordingConfig().save_audio && full_recording_) {
|
||||
auto now = std::chrono::system_clock::now();
|
||||
auto time_t = std::chrono::system_clock::to_time_t(now);
|
||||
std::stringstream ss;
|
||||
ss << config.getRecordingConfig().output_directory << "/"
|
||||
<< "recording_" << std::put_time(std::localtime(&time_t), "%Y%m%d_%H%M%S")
|
||||
<< ".wav";
|
||||
<< "recording_" << timestamp.str() << ".wav";
|
||||
|
||||
if (full_recording_->saveToWav(ss.str())) {
|
||||
std::cout << "Recording saved to: " << ss.str() << std::endl;
|
||||
@ -117,6 +119,13 @@ void Pipeline::stop() {
|
||||
std::cerr << "Failed to save recording" << std::endl;
|
||||
}
|
||||
}
|
||||
|
||||
// Auto-export transcript when stopping
|
||||
if (ui_) {
|
||||
std::stringstream transcript_ss;
|
||||
transcript_ss << "transcripts/transcript_" << timestamp.str() << ".txt";
|
||||
ui_->exportTranscript(transcript_ss.str());
|
||||
}
|
||||
}
|
||||
|
||||
void Pipeline::audioThread() {
|
||||
@ -334,6 +343,12 @@ void Pipeline::update() {
|
||||
clearAccumulated();
|
||||
ui_->resetClearRequest();
|
||||
}
|
||||
|
||||
// Check if export was requested
|
||||
if (ui_->isExportRequested()) {
|
||||
ui_->exportTranscript();
|
||||
ui_->resetExportRequest();
|
||||
}
|
||||
}
|
||||
|
||||
bool Pipeline::shouldClose() const {
|
||||
|
||||
@ -7,6 +7,11 @@
|
||||
#include <imgui_impl_opengl3.h>
|
||||
#include <iostream>
|
||||
#include <thread>
|
||||
#include <fstream>
|
||||
#include <chrono>
|
||||
#include <iomanip>
|
||||
#include <sstream>
|
||||
#include <filesystem>
|
||||
|
||||
namespace secondvoice {
|
||||
|
||||
@ -314,14 +319,30 @@ void TranslationUI::renderTranslations() {
|
||||
void TranslationUI::renderControls() {
|
||||
ImGui::Spacing();
|
||||
|
||||
// Center the stop button
|
||||
float button_width = 200.0f;
|
||||
// Center buttons
|
||||
float button_width = 150.0f;
|
||||
float spacing = 20.0f;
|
||||
float total_width = button_width * 2 + spacing;
|
||||
float window_width = ImGui::GetWindowWidth();
|
||||
ImGui::SetCursorPosX((window_width - button_width) * 0.5f);
|
||||
ImGui::SetCursorPosX((window_width - total_width) * 0.5f);
|
||||
|
||||
if (ImGui::Button("STOP RECORDING", ImVec2(button_width, 40))) {
|
||||
// Stop button (red)
|
||||
ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.7f, 0.2f, 0.2f, 1.0f));
|
||||
ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.8f, 0.3f, 0.3f, 1.0f));
|
||||
if (ImGui::Button("STOP", ImVec2(button_width, 40))) {
|
||||
stop_requested_ = true;
|
||||
}
|
||||
ImGui::PopStyleColor(2);
|
||||
|
||||
ImGui::SameLine(0, spacing);
|
||||
|
||||
// Export button (blue)
|
||||
ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.2f, 0.4f, 0.7f, 1.0f));
|
||||
ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.3f, 0.5f, 0.8f, 1.0f));
|
||||
if (ImGui::Button("EXPORT", ImVec2(button_width, 40))) {
|
||||
export_requested_ = true;
|
||||
}
|
||||
ImGui::PopStyleColor(2);
|
||||
|
||||
ImGui::Spacing();
|
||||
}
|
||||
@ -437,4 +458,86 @@ void TranslationUI::renderAudioPanel() {
|
||||
ImGui::PopStyleColor();
|
||||
}
|
||||
|
||||
bool TranslationUI::exportTranscript(const std::string& filename) const {
|
||||
// Generate filename if not provided
|
||||
std::string output_file = filename;
|
||||
if (output_file.empty()) {
|
||||
auto now = std::chrono::system_clock::now();
|
||||
auto time_t = std::chrono::system_clock::to_time_t(now);
|
||||
std::stringstream ss;
|
||||
ss << "transcripts/transcript_"
|
||||
<< std::put_time(std::localtime(&time_t), "%Y%m%d_%H%M%S")
|
||||
<< ".txt";
|
||||
output_file = ss.str();
|
||||
}
|
||||
|
||||
// Create directory if needed
|
||||
std::filesystem::path filepath(output_file);
|
||||
if (filepath.has_parent_path()) {
|
||||
std::filesystem::create_directories(filepath.parent_path());
|
||||
}
|
||||
|
||||
std::ofstream file(output_file, std::ios::out | std::ios::binary);
|
||||
if (!file.is_open()) {
|
||||
std::cerr << "[Export] Failed to open file: " << output_file << std::endl;
|
||||
return false;
|
||||
}
|
||||
|
||||
// Write UTF-8 BOM for Windows compatibility
|
||||
file << "\xEF\xBB\xBF";
|
||||
|
||||
// Header
|
||||
auto now = std::chrono::system_clock::now();
|
||||
auto time_t = std::chrono::system_clock::to_time_t(now);
|
||||
file << "═══════════════════════════════════════════════════════════════\n";
|
||||
file << " SecondVoice - Transcript Export\n";
|
||||
file << " Date: " << std::put_time(std::localtime(&time_t), "%Y-%m-%d %H:%M:%S") << "\n";
|
||||
file << " Duration: " << (recording_duration_ / 60) << ":"
|
||||
<< std::setfill('0') << std::setw(2) << (recording_duration_ % 60) << "\n";
|
||||
file << " Segments: " << messages_.size() << "\n";
|
||||
file << "═══════════════════════════════════════════════════════════════\n\n";
|
||||
|
||||
// Accumulated text (full transcript)
|
||||
if (!accumulated_chinese_.empty() || !accumulated_french_.empty()) {
|
||||
file << "───────────────────────────────────────────────────────────────\n";
|
||||
file << " TEXTE COMPLET / FULL TEXT\n";
|
||||
file << "───────────────────────────────────────────────────────────────\n\n";
|
||||
|
||||
file << "[中文 / Chinese]\n";
|
||||
file << accumulated_chinese_ << "\n\n";
|
||||
|
||||
file << "[Français / French]\n";
|
||||
file << accumulated_french_ << "\n\n";
|
||||
}
|
||||
|
||||
// Individual segments
|
||||
if (!messages_.empty()) {
|
||||
file << "───────────────────────────────────────────────────────────────\n";
|
||||
file << " SEGMENTS DÉTAILLÉS / DETAILED SEGMENTS\n";
|
||||
file << "───────────────────────────────────────────────────────────────\n\n";
|
||||
|
||||
int segment_num = 1;
|
||||
for (const auto& msg : messages_) {
|
||||
file << "[Segment " << segment_num++ << "]\n";
|
||||
file << "中文: " << msg.chinese << "\n";
|
||||
file << "FR: " << msg.french << "\n\n";
|
||||
}
|
||||
}
|
||||
|
||||
// Footer with stats
|
||||
file << "───────────────────────────────────────────────────────────────\n";
|
||||
file << " STATISTIQUES / STATISTICS\n";
|
||||
file << "───────────────────────────────────────────────────────────────\n";
|
||||
file << " Audio processed: " << static_cast<int>(total_audio_seconds_) << " seconds\n";
|
||||
file << " Whisper API calls: " << whisper_calls_ << "\n";
|
||||
file << " Claude API calls: " << claude_calls_ << "\n";
|
||||
file << " Estimated cost: $" << std::fixed << std::setprecision(4) << getEstimatedCost() << "\n";
|
||||
file << "═══════════════════════════════════════════════════════════════\n";
|
||||
|
||||
file.close();
|
||||
|
||||
std::cout << "[Export] Transcript saved to: " << output_file << std::endl;
|
||||
return true;
|
||||
}
|
||||
|
||||
} // namespace secondvoice
|
||||
|
||||
@ -30,6 +30,11 @@ public:
|
||||
bool isClearRequested() const { return clear_requested_; }
|
||||
void resetClearRequest() { clear_requested_ = false; }
|
||||
|
||||
// Export transcript to file
|
||||
bool exportTranscript(const std::string& filename = "") const;
|
||||
bool isExportRequested() const { return export_requested_; }
|
||||
void resetExportRequest() { export_requested_ = false; }
|
||||
|
||||
void setRecordingDuration(int seconds) { recording_duration_ = seconds; }
|
||||
void setProcessingStatus(const std::string& status) { processing_status_ = status; }
|
||||
|
||||
@ -55,6 +60,7 @@ private:
|
||||
std::string accumulated_french_;
|
||||
bool stop_requested_ = false;
|
||||
bool clear_requested_ = false;
|
||||
bool export_requested_ = false;
|
||||
bool auto_scroll_ = true;
|
||||
|
||||
int recording_duration_ = 0;
|
||||
|
||||
Loading…
Reference in New Issue
Block a user