feat: Add transcript export and debug planning docs

- Add CLAUDE.md with project documentation for AI assistance
- Add PLAN_DEBUG.md with debugging hypotheses and logging plan
- Update Pipeline and TranslationUI with transcript export functionality

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
StillHammer 2025-11-23 19:59:29 +08:00
parent 371e86d0b7
commit 21bcc9ed71
5 changed files with 311 additions and 8 deletions

119
CLAUDE.md Normal file
View File

@ -0,0 +1,119 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
SecondVoice is a real-time Chinese-to-French translation system for live meetings. It captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI.
## Build Commands
### Windows (MinGW) - Primary Build
```batch
# First-time setup
.\setup_mingw.bat
# Build (Release)
.\build_mingw.bat
# Build (Debug)
.\build_mingw.bat --debug
# Clean rebuild
.\build_mingw.bat --clean
```
### Running the Application
```batch
cd build\mingw-Release
SecondVoice.exe
```
Requires:
- `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`
- `config.json` (copied automatically during build)
- A microphone
## Architecture
### Threading Model (3 threads)
1. **Audio Thread** (`Pipeline::audioThread`) - PortAudio callback captures audio, applies VAD (Voice Activity Detection), pushes chunks to queue
2. **Processing Thread** (`Pipeline::processingThread`) - Consumes audio chunks, calls Whisper API for transcription, then Claude API for translation
3. **UI Thread** (main) - GLFW/ImGui rendering loop, must run on main thread
### Core Components
```
src/
├── main.cpp # Entry point, forces NVIDIA GPU
├── core/Pipeline.cpp # Orchestrates audio→transcription→translation flow
├── audio/
│ ├── AudioCapture.cpp # PortAudio wrapper with VAD-based segmentation
│ ├── AudioBuffer.cpp # Accumulates samples, exports WAV/Opus
│ └── NoiseReducer.cpp # RNNoise denoising (16kHz→48kHz→16kHz resampling)
├── api/
│ ├── WhisperClient.cpp # OpenAI Whisper API (multipart/form-data)
│ ├── ClaudeClient.cpp # Anthropic Claude API (JSON)
│ └── WinHttpClient.cpp # Native Windows HTTP client (replaced libcurl)
├── ui/TranslationUI.cpp # ImGui interface with VAD threshold controls
└── utils/
├── Config.cpp # Loads config.json + .env
└── ThreadSafeQueue.h # Lock-free queue for audio chunks
```
### Key Data Flow
1. `AudioCapture` detects speech via VAD thresholds (RMS + Peak)
2. Speech segments sent to `NoiseReducer` (RNNoise) for denoising
3. Denoised audio encoded to Opus/OGG for bandwidth efficiency (46x reduction)
4. `WhisperClient` sends audio to gpt-4o-mini-transcribe
5. `Pipeline` filters Whisper hallucinations (known garbage phrases)
6. `ClaudeClient` translates Chinese text to French
7. `TranslationUI` displays accumulated transcription/translation
### External Dependencies (fetched via CMake FetchContent)
- **ImGui** v1.90.1 - UI framework
- **Opus** v1.5.2 - Audio encoding
- **Ogg** v1.3.6 - Container format
- **RNNoise** v0.1.1 - Neural network noise reduction
### vcpkg Dependencies (x64-mingw-static triplet)
- portaudio, nlohmann_json, glfw3, glad
## Configuration
### config.json
- `audio.sample_rate`: 16000 Hz (required for Whisper)
- `whisper.model`: "gpt-4o-mini-transcribe"
- `whisper.language`: "zh" (Chinese)
- `claude.model`: "claude-3-5-haiku-20241022"
### VAD Tuning
VAD thresholds are adjustable in the UI at runtime:
- RMS threshold: speech detection sensitivity
- Peak threshold: transient/click rejection
## Important Implementation Details
### Whisper Hallucination Filtering
`Pipeline.cpp` contains an extensive list of known Whisper hallucinations (lines ~195-260) that are filtered out:
- "Thank you for watching", "Subscribe", YouTube phrases
- Chinese video endings: "谢谢观看", "再见", "订阅"
- Music symbols, silence markers
- Single-word interjections
### GPU Forcing (Optimus/PowerXpress)
`main.cpp` exports `NvOptimusEnablement` and `AmdPowerXpressRequestHighPerformance` symbols to force dedicated GPU usage on hybrid graphics systems.
### Audio Processing Pipeline
1. 16kHz mono input → Upsampled to 48kHz for RNNoise
2. RNNoise denoising (480-sample frames at 48kHz)
3. Transient suppression (claps, clicks, pops)
4. Downsampled back to 16kHz
5. Opus encoding at 24kbps for API transmission
## Console-Only Build
A `SecondVoice_Console` target exists for testing without UI:
- Uses `main_console.cpp`
- No ImGui/GLFW dependencies
- Outputs transcriptions to stdout

60
PLAN_DEBUG.md Normal file
View File

@ -0,0 +1,60 @@
# Plan de Debug SecondVoice
## Problème observé
Transcript du 2025-11-23 (5:31 min, 75 segments) montre :
- Phrases fragmentées ("我很。" → "Je suis.")
- Erreurs de transcription ("两个老鼠求我" - deux souris me supplient)
- Segments d'un ou deux mots sans contexte
- Hallucinations Whisper ("汪汪汪汪")
## Hypothèses (à valider)
1. **VAD coupe trop tôt** - Le Voice Activity Detection déclenche la fin de segment trop rapidement, coupant les phrases en plein milieu
2. **Segments trop courts** - Whisper n'a pas assez de contexte audio pour transcrire correctement le chinois
3. **Bruit ambiant** - Du bruit est interprété comme de la parole (segment 22 mentionne "太多声音了")
4. **Perte de contexte inter-segments** - Chaque segment est traité isolément, Whisper ne peut pas utiliser le contexte des phrases précédentes
## Plan : Système de logging par session
### Objectif
Collecter des données exploitables pour identifier la source des problèmes.
### Structure
```
sessions/
└── YYYY-MM-DD_HHMMSS/
├── session.json # Métadonnées globales
├── segments/
│ ├── 001.json
│ ├── 002.json
│ └── ...
└── transcript.txt # Export final (existant)
```
### Format segment JSON
```json
{
"id": 1,
"chinese": "两个老鼠求我",
"french": "Deux souris me supplient"
}
```
### À définir
- [ ] Quelles métadonnées audio ajouter ? (durée, RMS, timestamps)
- [ ] Sauvegarder les fichiers audio .opus par segment ?
- [ ] Infos Whisper ? (latence, modèle, filtered)
- [ ] Infos Claude ? (latence, modèle)
## Prochaines étapes
1. Implémenter le système de logging basique (JSON chinois/français)
2. Analyser les patterns dans les données
3. Enrichir avec plus de métadonnées si nécessaire

View File

@ -103,13 +103,15 @@ void Pipeline::stop() {
// Save full recording
auto& config = Config::getInstance();
auto now = std::chrono::system_clock::now();
auto time_t = std::chrono::system_clock::to_time_t(now);
std::stringstream timestamp;
timestamp << std::put_time(std::localtime(&time_t), "%Y%m%d_%H%M%S");
if (config.getRecordingConfig().save_audio && full_recording_) {
auto now = std::chrono::system_clock::now();
auto time_t = std::chrono::system_clock::to_time_t(now);
std::stringstream ss;
ss << config.getRecordingConfig().output_directory << "/"
<< "recording_" << std::put_time(std::localtime(&time_t), "%Y%m%d_%H%M%S")
<< ".wav";
<< "recording_" << timestamp.str() << ".wav";
if (full_recording_->saveToWav(ss.str())) {
std::cout << "Recording saved to: " << ss.str() << std::endl;
@ -117,6 +119,13 @@ void Pipeline::stop() {
std::cerr << "Failed to save recording" << std::endl;
}
}
// Auto-export transcript when stopping
if (ui_) {
std::stringstream transcript_ss;
transcript_ss << "transcripts/transcript_" << timestamp.str() << ".txt";
ui_->exportTranscript(transcript_ss.str());
}
}
void Pipeline::audioThread() {
@ -334,6 +343,12 @@ void Pipeline::update() {
clearAccumulated();
ui_->resetClearRequest();
}
// Check if export was requested
if (ui_->isExportRequested()) {
ui_->exportTranscript();
ui_->resetExportRequest();
}
}
bool Pipeline::shouldClose() const {

View File

@ -7,6 +7,11 @@
#include <imgui_impl_opengl3.h>
#include <iostream>
#include <thread>
#include <fstream>
#include <chrono>
#include <iomanip>
#include <sstream>
#include <filesystem>
namespace secondvoice {
@ -314,14 +319,30 @@ void TranslationUI::renderTranslations() {
void TranslationUI::renderControls() {
ImGui::Spacing();
// Center the stop button
float button_width = 200.0f;
// Center buttons
float button_width = 150.0f;
float spacing = 20.0f;
float total_width = button_width * 2 + spacing;
float window_width = ImGui::GetWindowWidth();
ImGui::SetCursorPosX((window_width - button_width) * 0.5f);
ImGui::SetCursorPosX((window_width - total_width) * 0.5f);
if (ImGui::Button("STOP RECORDING", ImVec2(button_width, 40))) {
// Stop button (red)
ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.7f, 0.2f, 0.2f, 1.0f));
ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.8f, 0.3f, 0.3f, 1.0f));
if (ImGui::Button("STOP", ImVec2(button_width, 40))) {
stop_requested_ = true;
}
ImGui::PopStyleColor(2);
ImGui::SameLine(0, spacing);
// Export button (blue)
ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.2f, 0.4f, 0.7f, 1.0f));
ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.3f, 0.5f, 0.8f, 1.0f));
if (ImGui::Button("EXPORT", ImVec2(button_width, 40))) {
export_requested_ = true;
}
ImGui::PopStyleColor(2);
ImGui::Spacing();
}
@ -437,4 +458,86 @@ void TranslationUI::renderAudioPanel() {
ImGui::PopStyleColor();
}
bool TranslationUI::exportTranscript(const std::string& filename) const {
// Generate filename if not provided
std::string output_file = filename;
if (output_file.empty()) {
auto now = std::chrono::system_clock::now();
auto time_t = std::chrono::system_clock::to_time_t(now);
std::stringstream ss;
ss << "transcripts/transcript_"
<< std::put_time(std::localtime(&time_t), "%Y%m%d_%H%M%S")
<< ".txt";
output_file = ss.str();
}
// Create directory if needed
std::filesystem::path filepath(output_file);
if (filepath.has_parent_path()) {
std::filesystem::create_directories(filepath.parent_path());
}
std::ofstream file(output_file, std::ios::out | std::ios::binary);
if (!file.is_open()) {
std::cerr << "[Export] Failed to open file: " << output_file << std::endl;
return false;
}
// Write UTF-8 BOM for Windows compatibility
file << "\xEF\xBB\xBF";
// Header
auto now = std::chrono::system_clock::now();
auto time_t = std::chrono::system_clock::to_time_t(now);
file << "═══════════════════════════════════════════════════════════════\n";
file << " SecondVoice - Transcript Export\n";
file << " Date: " << std::put_time(std::localtime(&time_t), "%Y-%m-%d %H:%M:%S") << "\n";
file << " Duration: " << (recording_duration_ / 60) << ":"
<< std::setfill('0') << std::setw(2) << (recording_duration_ % 60) << "\n";
file << " Segments: " << messages_.size() << "\n";
file << "═══════════════════════════════════════════════════════════════\n\n";
// Accumulated text (full transcript)
if (!accumulated_chinese_.empty() || !accumulated_french_.empty()) {
file << "───────────────────────────────────────────────────────────────\n";
file << " TEXTE COMPLET / FULL TEXT\n";
file << "───────────────────────────────────────────────────────────────\n\n";
file << "[中文 / Chinese]\n";
file << accumulated_chinese_ << "\n\n";
file << "[Français / French]\n";
file << accumulated_french_ << "\n\n";
}
// Individual segments
if (!messages_.empty()) {
file << "───────────────────────────────────────────────────────────────\n";
file << " SEGMENTS DÉTAILLÉS / DETAILED SEGMENTS\n";
file << "───────────────────────────────────────────────────────────────\n\n";
int segment_num = 1;
for (const auto& msg : messages_) {
file << "[Segment " << segment_num++ << "]\n";
file << "中文: " << msg.chinese << "\n";
file << "FR: " << msg.french << "\n\n";
}
}
// Footer with stats
file << "───────────────────────────────────────────────────────────────\n";
file << " STATISTIQUES / STATISTICS\n";
file << "───────────────────────────────────────────────────────────────\n";
file << " Audio processed: " << static_cast<int>(total_audio_seconds_) << " seconds\n";
file << " Whisper API calls: " << whisper_calls_ << "\n";
file << " Claude API calls: " << claude_calls_ << "\n";
file << " Estimated cost: $" << std::fixed << std::setprecision(4) << getEstimatedCost() << "\n";
file << "═══════════════════════════════════════════════════════════════\n";
file.close();
std::cout << "[Export] Transcript saved to: " << output_file << std::endl;
return true;
}
} // namespace secondvoice

View File

@ -30,6 +30,11 @@ public:
bool isClearRequested() const { return clear_requested_; }
void resetClearRequest() { clear_requested_ = false; }
// Export transcript to file
bool exportTranscript(const std::string& filename = "") const;
bool isExportRequested() const { return export_requested_; }
void resetExportRequest() { export_requested_ = false; }
void setRecordingDuration(int seconds) { recording_duration_ = seconds; }
void setProcessingStatus(const std::string& status) { processing_status_ = status; }
@ -55,6 +60,7 @@ private:
std::string accumulated_french_;
bool stop_requested_ = false;
bool clear_requested_ = false;
bool export_requested_ = false;
bool auto_scroll_ = true;
int recording_duration_ = 0;