feat: Add transcript export and debug planning docs

- Add CLAUDE.md with project documentation for AI assistance - Add PLAN_DEBUG.md with debugging hypotheses and logging plan - Update Pipeline and TranslationUI with transcript export functionality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-23 19:59:29 +08:00 · 2025-11-23 19:59:29 +08:00 · 21bcc9ed71
commit 21bcc9ed71
parent 371e86d0b7
5 changed files with 311 additions and 8 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,119 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+SecondVoice is a real-time Chinese-to-French translation system for live meetings. It captures audio, transcribes Chinese speech using OpenAI's Whisper API (gpt-4o-mini-transcribe), and translates it to French using Claude AI.
+
+## Build Commands
+
+### Windows (MinGW) - Primary Build
+```batch
+# First-time setup
+.\setup_mingw.bat
+
+# Build (Release)
+.\build_mingw.bat
+
+# Build (Debug)
+.\build_mingw.bat --debug
+
+# Clean rebuild
+.\build_mingw.bat --clean
+```
+
+### Running the Application
+```batch
+cd build\mingw-Release
+SecondVoice.exe
+```
+
+Requires:
+- `.env` file with `OPENAI_API_KEY` and `ANTHROPIC_API_KEY`
+- `config.json` (copied automatically during build)
+- A microphone
+
+## Architecture
+
+### Threading Model (3 threads)
+1. **Audio Thread** (`Pipeline::audioThread`) - PortAudio callback captures audio, applies VAD (Voice Activity Detection), pushes chunks to queue
+2. **Processing Thread** (`Pipeline::processingThread`) - Consumes audio chunks, calls Whisper API for transcription, then Claude API for translation
+3. **UI Thread** (main) - GLFW/ImGui rendering loop, must run on main thread
+
+### Core Components
+
+```
+src/
+├── main.cpp              # Entry point, forces NVIDIA GPU
+├── core/Pipeline.cpp     # Orchestrates audio→transcription→translation flow
+├── audio/
+│   ├── AudioCapture.cpp  # PortAudio wrapper with VAD-based segmentation
+│   ├── AudioBuffer.cpp   # Accumulates samples, exports WAV/Opus
+│   └── NoiseReducer.cpp  # RNNoise denoising (16kHz→48kHz→16kHz resampling)
+├── api/
+│   ├── WhisperClient.cpp # OpenAI Whisper API (multipart/form-data)
+│   ├── ClaudeClient.cpp  # Anthropic Claude API (JSON)
+│   └── WinHttpClient.cpp # Native Windows HTTP client (replaced libcurl)
+├── ui/TranslationUI.cpp  # ImGui interface with VAD threshold controls
+└── utils/
+    ├── Config.cpp        # Loads config.json + .env
+    └── ThreadSafeQueue.h # Lock-free queue for audio chunks
+```
+
+### Key Data Flow
+1. `AudioCapture` detects speech via VAD thresholds (RMS + Peak)
+2. Speech segments sent to `NoiseReducer` (RNNoise) for denoising
+3. Denoised audio encoded to Opus/OGG for bandwidth efficiency (46x reduction)
+4. `WhisperClient` sends audio to gpt-4o-mini-transcribe
+5. `Pipeline` filters Whisper hallucinations (known garbage phrases)
+6. `ClaudeClient` translates Chinese text to French
+7. `TranslationUI` displays accumulated transcription/translation
+
+### External Dependencies (fetched via CMake FetchContent)
+- **ImGui** v1.90.1 - UI framework
+- **Opus** v1.5.2 - Audio encoding
+- **Ogg** v1.3.6 - Container format
+- **RNNoise** v0.1.1 - Neural network noise reduction
+
+### vcpkg Dependencies (x64-mingw-static triplet)
+- portaudio, nlohmann_json, glfw3, glad
+
+## Configuration
+
+### config.json
+- `audio.sample_rate`: 16000 Hz (required for Whisper)
+- `whisper.model`: "gpt-4o-mini-transcribe"
+- `whisper.language`: "zh" (Chinese)
+- `claude.model`: "claude-3-5-haiku-20241022"
+
+### VAD Tuning
+VAD thresholds are adjustable in the UI at runtime:
+- RMS threshold: speech detection sensitivity
+- Peak threshold: transient/click rejection
+
+## Important Implementation Details
+
+### Whisper Hallucination Filtering
+`Pipeline.cpp` contains an extensive list of known Whisper hallucinations (lines ~195-260) that are filtered out:
+- "Thank you for watching", "Subscribe", YouTube phrases
+- Chinese video endings: "谢谢观看", "再见", "订阅"
+- Music symbols, silence markers
+- Single-word interjections
+
+### GPU Forcing (Optimus/PowerXpress)
+`main.cpp` exports `NvOptimusEnablement` and `AmdPowerXpressRequestHighPerformance` symbols to force dedicated GPU usage on hybrid graphics systems.
+
+### Audio Processing Pipeline
+1. 16kHz mono input → Upsampled to 48kHz for RNNoise
+2. RNNoise denoising (480-sample frames at 48kHz)
+3. Transient suppression (claps, clicks, pops)
+4. Downsampled back to 16kHz
+5. Opus encoding at 24kbps for API transmission
+
+## Console-Only Build
+
+A `SecondVoice_Console` target exists for testing without UI:
+- Uses `main_console.cpp`
+- No ImGui/GLFW dependencies
+- Outputs transcriptions to stdout
--- a/PLAN_DEBUG.md
+++ b/PLAN_DEBUG.md
@ -0,0 +1,60 @@
+# Plan de Debug SecondVoice
+
+## Problème observé
+
+Transcript du 2025-11-23 (5:31 min, 75 segments) montre :
+- Phrases fragmentées ("我很。" → "Je suis.")
+- Erreurs de transcription ("两个老鼠求我" - deux souris me supplient)
+- Segments d'un ou deux mots sans contexte
+- Hallucinations Whisper ("汪汪汪汪")
+
+## Hypothèses (à valider)
+
+1. **VAD coupe trop tôt** - Le Voice Activity Detection déclenche la fin de segment trop rapidement, coupant les phrases en plein milieu
+
+2. **Segments trop courts** - Whisper n'a pas assez de contexte audio pour transcrire correctement le chinois
+
+3. **Bruit ambiant** - Du bruit est interprété comme de la parole (segment 22 mentionne "太多声音了")
+
+4. **Perte de contexte inter-segments** - Chaque segment est traité isolément, Whisper ne peut pas utiliser le contexte des phrases précédentes
+
+## Plan : Système de logging par session
+
+### Objectif
+Collecter des données exploitables pour identifier la source des problèmes.
+
+### Structure
+
+```
+sessions/
+└── YYYY-MM-DD_HHMMSS/
+    ├── session.json           # Métadonnées globales
+    ├── segments/
+    │   ├── 001.json
+    │   ├── 002.json
+    │   └── ...
+    └── transcript.txt         # Export final (existant)
+```
+
+### Format segment JSON
+
+```json
+{
+  "id": 1,
+  "chinese": "两个老鼠求我",
+  "french": "Deux souris me supplient"
+}
+```
+
+### À définir
+
+- [ ] Quelles métadonnées audio ajouter ? (durée, RMS, timestamps)
+- [ ] Sauvegarder les fichiers audio .opus par segment ?
+- [ ] Infos Whisper ? (latence, modèle, filtered)
+- [ ] Infos Claude ? (latence, modèle)
+
+## Prochaines étapes
+
+1. Implémenter le système de logging basique (JSON chinois/français)
+2. Analyser les patterns dans les données
+3. Enrichir avec plus de métadonnées si nécessaire
--- a/src/core/Pipeline.cpp
+++ b/src/core/Pipeline.cpp
@ -103,13 +103,15 @@ void Pipeline::stop() {

    // Save full recording
    auto& config = Config::getInstance();
+    auto now = std::chrono::system_clock::now();
+    auto time_t = std::chrono::system_clock::to_time_t(now);
+    std::stringstream timestamp;
+    timestamp << std::put_time(std::localtime(&time_t), "%Y%m%d_%H%M%S");
+
    if (config.getRecordingConfig().save_audio && full_recording_) {
-        auto now = std::chrono::system_clock::now();
-        auto time_t = std::chrono::system_clock::to_time_t(now);
        std::stringstream ss;
        ss << config.getRecordingConfig().output_directory << "/"
-           << "recording_" << std::put_time(std::localtime(&time_t), "%Y%m%d_%H%M%S")
-           << ".wav";
+           << "recording_" << timestamp.str() << ".wav";

        if (full_recording_->saveToWav(ss.str())) {
            std::cout << "Recording saved to: " << ss.str() << std::endl;
@ -117,6 +119,13 @@ void Pipeline::stop() {
            std::cerr << "Failed to save recording" << std::endl;
        }
    }
+
+    // Auto-export transcript when stopping
+    if (ui_) {
+        std::stringstream transcript_ss;
+        transcript_ss << "transcripts/transcript_" << timestamp.str() << ".txt";
+        ui_->exportTranscript(transcript_ss.str());
+    }
 }

 void Pipeline::audioThread() {
@ -334,6 +343,12 @@ void Pipeline::update() {
        clearAccumulated();
        ui_->resetClearRequest();
    }
+
+    // Check if export was requested
+    if (ui_->isExportRequested()) {
+        ui_->exportTranscript();
+        ui_->resetExportRequest();
+    }
 }

 bool Pipeline::shouldClose() const {
--- a/src/ui/TranslationUI.cpp
+++ b/src/ui/TranslationUI.cpp
@ -7,6 +7,11 @@
 #include <imgui_impl_opengl3.h>
 #include <iostream>
 #include <thread>
+#include <fstream>
+#include <chrono>
+#include <iomanip>
+#include <sstream>
+#include <filesystem>

 namespace secondvoice {

@ -314,14 +319,30 @@ void TranslationUI::renderTranslations() {
 void TranslationUI::renderControls() {
    ImGui::Spacing();

-    // Center the stop button
-    float button_width = 200.0f;
+    // Center buttons
+    float button_width = 150.0f;
+    float spacing = 20.0f;
+    float total_width = button_width * 2 + spacing;
    float window_width = ImGui::GetWindowWidth();
-    ImGui::SetCursorPosX((window_width - button_width) * 0.5f);
+    ImGui::SetCursorPosX((window_width - total_width) * 0.5f);

-    if (ImGui::Button("STOP RECORDING", ImVec2(button_width, 40))) {
+    // Stop button (red)
+    ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.7f, 0.2f, 0.2f, 1.0f));
+    ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.8f, 0.3f, 0.3f, 1.0f));
+    if (ImGui::Button("STOP", ImVec2(button_width, 40))) {
        stop_requested_ = true;
    }
+    ImGui::PopStyleColor(2);
+
+    ImGui::SameLine(0, spacing);
+
+    // Export button (blue)
+    ImGui::PushStyleColor(ImGuiCol_Button, ImVec4(0.2f, 0.4f, 0.7f, 1.0f));
+    ImGui::PushStyleColor(ImGuiCol_ButtonHovered, ImVec4(0.3f, 0.5f, 0.8f, 1.0f));
+    if (ImGui::Button("EXPORT", ImVec2(button_width, 40))) {
+        export_requested_ = true;
+    }
+    ImGui::PopStyleColor(2);

    ImGui::Spacing();
 }
@ -437,4 +458,86 @@ void TranslationUI::renderAudioPanel() {
    ImGui::PopStyleColor();
 }

+bool TranslationUI::exportTranscript(const std::string& filename) const {
+    // Generate filename if not provided
+    std::string output_file = filename;
+    if (output_file.empty()) {
+        auto now = std::chrono::system_clock::now();
+        auto time_t = std::chrono::system_clock::to_time_t(now);
+        std::stringstream ss;
+        ss << "transcripts/transcript_"
+           << std::put_time(std::localtime(&time_t), "%Y%m%d_%H%M%S")
+           << ".txt";
+        output_file = ss.str();
+    }
+
+    // Create directory if needed
+    std::filesystem::path filepath(output_file);
+    if (filepath.has_parent_path()) {
+        std::filesystem::create_directories(filepath.parent_path());
+    }
+
+    std::ofstream file(output_file, std::ios::out | std::ios::binary);
+    if (!file.is_open()) {
+        std::cerr << "[Export] Failed to open file: " << output_file << std::endl;
+        return false;
+    }
+
+    // Write UTF-8 BOM for Windows compatibility
+    file << "\xEF\xBB\xBF";
+
+    // Header
+    auto now = std::chrono::system_clock::now();
+    auto time_t = std::chrono::system_clock::to_time_t(now);
+    file << "═══════════════════════════════════════════════════════════════\n";
+    file << "  SecondVoice - Transcript Export\n";
+    file << "  Date: " << std::put_time(std::localtime(&time_t), "%Y-%m-%d %H:%M:%S") << "\n";
+    file << "  Duration: " << (recording_duration_ / 60) << ":"
+         << std::setfill('0') << std::setw(2) << (recording_duration_ % 60) << "\n";
+    file << "  Segments: " << messages_.size() << "\n";
+    file << "═══════════════════════════════════════════════════════════════\n\n";
+
+    // Accumulated text (full transcript)
+    if (!accumulated_chinese_.empty() || !accumulated_french_.empty()) {
+        file << "───────────────────────────────────────────────────────────────\n";
+        file << "  TEXTE COMPLET / FULL TEXT\n";
+        file << "───────────────────────────────────────────────────────────────\n\n";
+
+        file << "[中文 / Chinese]\n";
+        file << accumulated_chinese_ << "\n\n";
+
+        file << "[Français / French]\n";
+        file << accumulated_french_ << "\n\n";
+    }
+
+    // Individual segments
+    if (!messages_.empty()) {
+        file << "───────────────────────────────────────────────────────────────\n";
+        file << "  SEGMENTS DÉTAILLÉS / DETAILED SEGMENTS\n";
+        file << "───────────────────────────────────────────────────────────────\n\n";
+
+        int segment_num = 1;
+        for (const auto& msg : messages_) {
+            file << "[Segment " << segment_num++ << "]\n";
+            file << "中文: " << msg.chinese << "\n";
+            file << "FR:   " << msg.french << "\n\n";
+        }
+    }
+
+    // Footer with stats
+    file << "───────────────────────────────────────────────────────────────\n";
+    file << "  STATISTIQUES / STATISTICS\n";
+    file << "───────────────────────────────────────────────────────────────\n";
+    file << "  Audio processed: " << static_cast<int>(total_audio_seconds_) << " seconds\n";
+    file << "  Whisper API calls: " << whisper_calls_ << "\n";
+    file << "  Claude API calls: " << claude_calls_ << "\n";
+    file << "  Estimated cost: $" << std::fixed << std::setprecision(4) << getEstimatedCost() << "\n";
+    file << "═══════════════════════════════════════════════════════════════\n";
+
+    file.close();
+
+    std::cout << "[Export] Transcript saved to: " << output_file << std::endl;
+    return true;
+}
+
 } // namespace secondvoice
--- a/src/ui/TranslationUI.h
+++ b/src/ui/TranslationUI.h
@ -30,6 +30,11 @@ public:
    bool isClearRequested() const { return clear_requested_; }
    void resetClearRequest() { clear_requested_ = false; }

+    // Export transcript to file
+    bool exportTranscript(const std::string& filename = "") const;
+    bool isExportRequested() const { return export_requested_; }
+    void resetExportRequest() { export_requested_ = false; }
+
    void setRecordingDuration(int seconds) { recording_duration_ = seconds; }
    void setProcessingStatus(const std::string& status) { processing_status_ = status; }

@ -55,6 +60,7 @@ private:
    std::string accumulated_french_;
    bool stop_requested_ = false;
    bool clear_requested_ = false;
+    bool export_requested_ = false;
    bool auto_scroll_ = true;

    int recording_duration_ = 0;