# Phase 7 - Implémentation STT Modulaire **Date de création** : 2025-11-29 **Objectif** : Architecture STT complète avec support multi-engines (Vosk, PocketSphinx, Whisper) **Nom de l'assistant** : Celuna (anciennement AISSIA) --- ## Vue d'Ensemble ### Objectifs 1. **Architecture modulaire** : Interface `ISTTEngine` avec 4 implémentations 2. **Service STT** : Layer `ISTTService` pour abstraction business logic 3. **Dual Mode** : Passive (keyword spotting) + Active (transcription complète) 4. **Coût optimisé** : Local par défaut, Whisper API en fallback optionnel --- ## Architecture Cible ``` ┌─────────────────────────────────────────────────────────┐ │ VoiceService │ │ - Gère TTS (EspeakTTSEngine) │ │ - Gère STT via ISTTService │ │ - Pub/sub IIO (voice:speak, voice:listen, etc.) │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ ISTTService │ │ - Interface service STT │ │ - Gère mode passive/active │ │ - Switch engines selon config │ │ - Fallback automatique │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ STTEngineFactory │ │ - create(type, config) → unique_ptr │ └─────────────────────────────────────────────────────────┘ │ ┌────────────────┼────────────────┬──────────────┐ ▼ ▼ ▼ ▼ ┌──────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐ │ Vosk │ │ PocketSphinx │ │ WhisperCpp │ │ WhisperAPI │ │ Engine │ │ Engine │ │ Engine │ │ Engine │ └──────────┘ └──────────────┘ └─────────────┘ └──────────────┘ Local Local (keywords) Local (précis) Remote (payant) 50MB model Léger ~10MB 75-142MB API OpenAI ``` --- ## Phase 7.1 - Service Layer (ISTTService) ### Objectif Créer une couche service qui abstrait la complexité des engines STT et gère : - Mode passive/active - Switching d'engines - Fallback automatique - Gestion erreurs ### Fichiers à créer #### 1. `src/services/ISTTService.hpp` **Interface du service STT** ```cpp #pragma once #include #include #include #include namespace aissia { enum class STTMode { PASSIVE, // Keyword spotting (économe) ACTIVE // Full transcription }; enum class STTEngineType { VOSK, POCKETSPHINX, WHISPER_CPP, WHISPER_API, AUTO // Factory choisit }; /** * @brief Callback pour résultats transcription */ using TranscriptionCallback = std::function; /** * @brief Callback pour détection keyword */ using KeywordCallback = std::function; /** * @brief Interface service STT */ class ISTTService { public: virtual ~ISTTService() = default; /** * @brief Démarre le service STT */ virtual bool start() = 0; /** * @brief Arrête le service STT */ virtual void stop() = 0; /** * @brief Change le mode STT */ virtual void setMode(STTMode mode) = 0; /** * @brief Obtient le mode actuel */ virtual STTMode getMode() const = 0; /** * @brief Transcrit un fichier audio */ virtual std::string transcribeFile(const std::string& filePath) = 0; /** * @brief Transcrit des données audio PCM */ virtual std::string transcribe(const std::vector& audioData) = 0; /** * @brief Active l'écoute en streaming (temps réel) */ virtual void startListening(TranscriptionCallback onTranscription, KeywordCallback onKeyword) = 0; /** * @brief Arrête l'écoute streaming */ virtual void stopListening() = 0; /** * @brief Configure la langue */ virtual void setLanguage(const std::string& language) = 0; /** * @brief Vérifie si le service est disponible */ virtual bool isAvailable() const = 0; /** * @brief Obtient le nom de l'engine actuel */ virtual std::string getCurrentEngine() const = 0; }; } // namespace aissia ``` **Estimation** : 50 lignes --- #### 2. `src/services/STTService.hpp` + `.cpp` **Implémentation du service STT** **Features** : - Gère 2 engines : 1 pour passive (PocketSphinx), 1 pour active (Vosk/Whisper) - Switch automatique passive → active sur keyword - Timeout active → passive (30s sans parole) - Fallback vers Whisper API si engine local fail - Thread d'écoute microphone (via PortAudio ou ALSA) **Pseudo-code** : ```cpp class STTService : public ISTTService { private: std::unique_ptr m_passiveEngine; // PocketSphinx std::unique_ptr m_activeEngine; // Vosk/Whisper std::unique_ptr m_fallbackEngine; // WhisperAPI STTMode m_currentMode = STTMode::PASSIVE; std::thread m_listenThread; std::atomic m_listening{false}; TranscriptionCallback m_onTranscription; KeywordCallback m_onKeyword; std::chrono::steady_clock::time_point m_lastActivity; public: bool start() override { // Load engines from config m_passiveEngine = STTEngineFactory::create("pocketsphinx", config); m_activeEngine = STTEngineFactory::create("vosk", config); m_fallbackEngine = STTEngineFactory::create("whisper-api", config); return m_passiveEngine && m_activeEngine; } void startListening(TranscriptionCallback onTranscription, KeywordCallback onKeyword) override { m_onTranscription = onTranscription; m_onKeyword = onKeyword; m_listening = true; m_listenThread = std::thread([this]() { listenLoop(); }); } private: void listenLoop() { // Ouvrir microphone (PortAudio) // Boucle infinie : // - Si PASSIVE : use m_passiveEngine (keywords only) // - Si keyword détecté → setMode(ACTIVE) + callback // - Si ACTIVE : use m_activeEngine (full transcription) // - Transcrit en temps réel // - Si timeout 30s → setMode(PASSIVE) } }; ``` **Estimation** : 300 lignes (service + thread microphone) --- ## Phase 7.2 - Engines STT ### Fichiers à modifier/créer #### 1. `src/shared/audio/ISTTEngine.hpp` ✅ Existe **Modifications** : Aucune (interface déjà bonne) --- #### 2. `src/shared/audio/WhisperAPIEngine.hpp` ✅ Existe **Modifications** : Aucune (déjà implémenté, sera utilisé comme fallback) --- #### 3. `src/shared/audio/VoskSTTEngine.hpp` 🆕 À créer **Vosk Speech Recognition** **Dépendances** : - `vosk` library (C++ bindings) - Modèle français : `vosk-model-small-fr-0.22` (~50MB) **Installation** : ```bash # Linux sudo apt install libvosk-dev # Télécharger modèle FR wget https://alphacephei.com/vosk/models/vosk-model-small-fr-0.22.zip unzip vosk-model-small-fr-0.22.zip -d models/ ``` **Implémentation** : ```cpp #pragma once #include "ISTTEngine.hpp" #include #include namespace aissia { class VoskSTTEngine : public ISTTEngine { public: explicit VoskSTTEngine(const std::string& modelPath) { m_logger = spdlog::get("VoskSTT"); if (!m_logger) { m_logger = spdlog::stdout_color_mt("VoskSTT"); } // Load Vosk model m_model = vosk_model_new(modelPath.c_str()); if (!m_model) { m_logger->error("Failed to load Vosk model: {}", modelPath); m_available = false; return; } // Create recognizer (16kHz, mono) m_recognizer = vosk_recognizer_new(m_model, 16000.0); m_available = true; m_logger->info("Vosk STT initialized: {}", modelPath); } ~VoskSTTEngine() override { if (m_recognizer) vosk_recognizer_free(m_recognizer); if (m_model) vosk_model_free(m_model); } std::string transcribe(const std::vector& audioData) override { if (!m_available || audioData.empty()) return ""; // Convert float to int16 std::vector samples(audioData.size()); for (size_t i = 0; i < audioData.size(); ++i) { samples[i] = static_cast(audioData[i] * 32767.0f); } // Feed audio to recognizer vosk_recognizer_accept_waveform(m_recognizer, reinterpret_cast(samples.data()), samples.size() * sizeof(int16_t)); // Get final result const char* result = vosk_recognizer_final_result(m_recognizer); // Parse JSON result: {"text": "transcription"} std::string text = parseVoskResult(result); m_logger->debug("Transcribed: {}", text); return text; } std::string transcribeFile(const std::string& filePath) override { // Load WAV file, convert to PCM, call transcribe() // (Implementation omitted for brevity) } void setLanguage(const std::string& language) override { // Vosk model is language-specific, can't change at runtime } bool isAvailable() const override { return m_available; } std::string getEngineName() const override { return "vosk"; } private: VoskModel* m_model = nullptr; VoskRecognizer* m_recognizer = nullptr; bool m_available = false; std::shared_ptr m_logger; std::string parseVoskResult(const char* json) { // Parse JSON: {"text": "bonjour"} → "bonjour" // Use nlohmann::json } }; } // namespace aissia ``` **Estimation** : 200 lignes --- #### 4. `src/shared/audio/PocketSphinxEngine.hpp` 🆕 À créer **PocketSphinx Keyword Spotting** **Dépendances** : - `pocketsphinx` library - Acoustic model (phonétique) **Installation** : ```bash sudo apt install pocketsphinx pocketsphinx-en-us ``` **Configuration Keywords** : ``` # keywords.txt celuna /1e-40/ hey celuna /1e-50/ ``` **Implémentation** : ```cpp #pragma once #include "ISTTEngine.hpp" #include #include namespace aissia { class PocketSphinxEngine : public ISTTEngine { public: explicit PocketSphinxEngine(const std::vector& keywords, const std::string& modelPath) { m_logger = spdlog::get("PocketSphinx"); if (!m_logger) { m_logger = spdlog::stdout_color_mt("PocketSphinx"); } // Create keyword file createKeywordFile(keywords); // Initialize PocketSphinx ps_config_t* config = ps_config_init(NULL); ps_config_set_str(config, "hmm", modelPath.c_str()); ps_config_set_str(config, "kws", "/tmp/celuna_keywords.txt"); ps_config_set_float(config, "kws_threshold", 1e-40); m_decoder = ps_init(config); m_available = (m_decoder != nullptr); if (m_available) { m_logger->info("PocketSphinx initialized for keyword spotting"); } } ~PocketSphinxEngine() override { if (m_decoder) ps_free(m_decoder); } std::string transcribe(const std::vector& audioData) override { if (!m_available || audioData.empty()) return ""; // Convert to int16 std::vector samples(audioData.size()); for (size_t i = 0; i < audioData.size(); ++i) { samples[i] = static_cast(audioData[i] * 32767.0f); } // Process audio ps_start_utt(m_decoder); ps_process_raw(m_decoder, samples.data(), samples.size(), FALSE, FALSE); ps_end_utt(m_decoder); // Get keyword (if detected) const char* hyp = ps_get_hyp(m_decoder, nullptr); std::string keyword = (hyp ? hyp : ""); if (!keyword.empty()) { m_logger->info("Keyword detected: {}", keyword); } return keyword; } std::string transcribeFile(const std::string& filePath) override { // Not used for keyword spotting (streaming only) return ""; } void setLanguage(const std::string& language) override {} bool isAvailable() const override { return m_available; } std::string getEngineName() const override { return "pocketsphinx"; } private: ps_decoder_t* m_decoder = nullptr; bool m_available = false; std::shared_ptr m_logger; void createKeywordFile(const std::vector& keywords) { std::ofstream file("/tmp/celuna_keywords.txt"); for (const auto& kw : keywords) { file << kw << " /1e-40/\n"; } } }; } // namespace aissia ``` **Estimation** : 180 lignes --- #### 5. `src/shared/audio/WhisperCppEngine.hpp` 🆕 À créer (OPTIONNEL) **whisper.cpp - Local Whisper** **Dépendances** : - `whisper.cpp` (ggerganov) - Modèle : `ggml-tiny.bin` (75MB) ou `ggml-base.bin` (142MB) **Installation** : ```bash git clone https://github.com/ggerganov/whisper.cpp external/whisper.cpp cd external/whisper.cpp make ./models/download-ggml-model.sh tiny ``` **Implémentation** : Similar à Vosk mais avec API whisper.cpp **Estimation** : 250 lignes **⚠️ Note** : Optionnel, à implémenter seulement si besoin haute précision locale --- #### 6. `src/shared/audio/STTEngineFactory.cpp` 📝 Modifier **Factory pattern pour créer engines** ```cpp #include "STTEngineFactory.hpp" #include "VoskSTTEngine.hpp" #include "PocketSphinxEngine.hpp" #include "WhisperCppEngine.hpp" #include "WhisperAPIEngine.hpp" namespace aissia { std::unique_ptr STTEngineFactory::create( const std::string& type, const nlohmann::json& config) { if (type == "vosk" || type == "auto") { std::string modelPath = config.value("model_path", "./models/vosk-model-small-fr-0.22"); auto engine = std::make_unique(modelPath); if (engine->isAvailable()) return engine; } if (type == "pocketsphinx") { std::vector keywords = config.value("keywords", std::vector{"celuna"}); std::string modelPath = config.value("model_path", "/usr/share/pocketsphinx/model/en-us"); auto engine = std::make_unique(keywords, modelPath); if (engine->isAvailable()) return engine; } if (type == "whisper-cpp") { std::string modelPath = config.value("model_path", "./models/ggml-tiny.bin"); auto engine = std::make_unique(modelPath); if (engine->isAvailable()) return engine; } if (type == "whisper-api") { std::string apiKey = std::getenv(config.value("api_key_env", "OPENAI_API_KEY").c_str()); if (!apiKey.empty()) { return std::make_unique(apiKey); } } // Fallback: stub engine (no-op) return std::make_unique(); } } // namespace aissia ``` **Estimation** : 80 lignes --- ## Phase 7.3 - Intégration VoiceService ### Fichier à modifier #### `src/services/VoiceService.cpp` **Modifications** : 1. **Remplacer implémentation directe par ISTTService** **Avant** : ```cpp // VoiceService gère directement WhisperAPIEngine std::unique_ptr m_sttEngine; ``` **Après** : ```cpp // VoiceService délègue à ISTTService std::unique_ptr m_sttService; ``` 2. **Initialisation** : ```cpp void VoiceService::initialize(const nlohmann::json& config) { // TTS (unchanged) m_ttsEngine = TTSEngineFactory::create(); // STT (new) m_sttService = std::make_unique(config["stt"]); m_sttService->start(); // Setup callbacks m_sttService->startListening( [this](const std::string& text, STTMode mode) { handleTranscription(text, mode); }, [this](const std::string& keyword) { handleKeyword(keyword); } ); } ``` 3. **Handlers** : ```cpp void VoiceService::handleKeyword(const std::string& keyword) { m_logger->info("Keyword detected: {}", keyword); // Publish keyword detection nlohmann::json event = { {"type", "keyword_detected"}, {"keyword", keyword}, {"timestamp", std::time(nullptr)} }; m_io->publish("voice:keyword_detected", event); // Auto-switch to active mode m_sttService->setMode(STTMode::ACTIVE); } void VoiceService::handleTranscription(const std::string& text, STTMode mode) { m_logger->info("Transcription ({}): {}", mode == STTMode::PASSIVE ? "passive" : "active", text); // Publish transcription nlohmann::json event = { {"type", "transcription"}, {"text", text}, {"mode", mode == STTMode::PASSIVE ? "passive" : "active"}, {"timestamp", std::time(nullptr)} }; m_io->publish("voice:transcription", event); } ``` **Estimation modifications** : +150 lignes --- ## Phase 7.4 - Configuration ### Fichier à modifier #### `config/voice.json` **Configuration complète** : ```json { "tts": { "enabled": true, "engine": "auto", "rate": 0, "volume": 80, "voice": "fr-fr" }, "stt": { "passive_mode": { "enabled": true, "engine": "pocketsphinx", "keywords": ["celuna", "hey celuna", "ok celuna"], "threshold": 0.8, "model_path": "/usr/share/pocketsphinx/model/en-us" }, "active_mode": { "enabled": true, "engine": "vosk", "model_path": "./models/vosk-model-small-fr-0.22", "language": "fr", "timeout_seconds": 30, "fallback_engine": "whisper-api" }, "whisper_api": { "api_key_env": "OPENAI_API_KEY", "model": "whisper-1" }, "microphone": { "device_id": -1, "sample_rate": 16000, "channels": 1, "buffer_size": 1024 } } } ``` --- ## Phase 7.5 - Tests ### Fichiers à créer #### `tests/services/STTServiceTests.cpp` **Tests unitaires** : - ✅ Création service - ✅ Start/stop - ✅ Switch passive/active - ✅ Keyword detection - ✅ Transcription - ✅ Fallback engine - ✅ Timeout active → passive **Estimation** : 200 lignes --- #### `tests/integration/IT_014_VoicePassiveMode.cpp` **Test d'intégration passive mode** : ```cpp // Simulate audio avec keyword "celuna" // Vérifie : // 1. PocketSphinx détecte keyword // 2. Event "voice:keyword_detected" publié // 3. Switch vers ACTIVE mode // 4. Timeout 30s → retour PASSIVE ``` **Estimation** : 150 lignes --- #### `tests/integration/IT_015_VoiceActiveTranscription.cpp` **Test d'intégration active mode** : ```cpp // Simulate conversation complète : // 1. User: "celuna" → keyword detected // 2. User: "quelle heure est-il ?" → transcription via Vosk // 3. AI responds → TTS // 4. Timeout → retour passive ``` **Estimation** : 200 lignes --- ## Phase 7.6 - Documentation ### Fichiers à créer/modifier #### `docs/STT_ARCHITECTURE.md` **Documentation technique** : - Architecture STT - Choix engines - Configuration - Troubleshooting **Estimation** : 400 lignes --- #### `README.md` **Mise à jour roadmap** : ```markdown ### Completed ✅ - [x] STT multi-engine (Vosk, PocketSphinx, Whisper) - [x] Passive/Active mode (keyword "Celuna") - [x] Local STT (coût zéro) ``` --- ## Récapitulatif Estimation | Tâche | Fichiers | Lignes | Priorité | |-------|----------|--------|----------| | **7.1 Service Layer** | `ISTTService.hpp`, `STTService.{h,cpp}` | 350 | P0 | | **7.2 Vosk Engine** | `VoskSTTEngine.hpp` | 200 | P0 | | **7.2 PocketSphinx** | `PocketSphinxEngine.hpp` | 180 | P1 | | **7.2 WhisperCpp** | `WhisperCppEngine.hpp` | 250 | P2 (optionnel) | | **7.2 Factory** | `STTEngineFactory.cpp` | 80 | P0 | | **7.3 VoiceService** | `VoiceService.cpp` (modifs) | +150 | P0 | | **7.4 Config** | `voice.json` | +30 | P0 | | **7.5 Tests unitaires** | `STTServiceTests.cpp` | 200 | P1 | | **7.5 Tests intégration** | `IT_014`, `IT_015` | 350 | P1 | | **7.6 Documentation** | `STT_ARCHITECTURE.md`, README | 450 | P2 | | **TOTAL** | 14 fichiers | **~2240 lignes** | | --- ## Plan d'Exécution ### Milestone 1 : MVP STT Local (Vosk seul) ⚡ **Objectif** : STT fonctionnel sans keyword detection **Tâches** : 1. ✅ Créer `ISTTService.hpp` 2. ✅ Créer `STTService` (simple, sans passive mode) 3. ✅ Créer `VoskSTTEngine` 4. ✅ Modifier `STTEngineFactory` 5. ✅ Intégrer dans `VoiceService` 6. ✅ Config `voice.json` 7. ✅ Test manuel transcription **Durée estimée** : 3-4h **Lignes** : ~600 --- ### Milestone 2 : Passive Mode (Keyword Detection) 🎧 **Objectif** : Détection "Celuna" + switch auto **Tâches** : 1. ✅ Créer `PocketSphinxEngine` 2. ✅ Étendre `STTService` (dual mode) 3. ✅ Callbacks keyword/transcription 4. ✅ Timeout active → passive 5. ✅ Config passive/active 6. ✅ Tests IT_014, IT_015 **Durée estimée** : 4-5h **Lignes** : ~700 --- ### Milestone 3 : Fallback Whisper API 🔄 **Objectif** : Robustesse avec fallback cloud **Tâches** : 1. ✅ Intégrer `WhisperAPIEngine` existant 2. ✅ Logique fallback dans `STTService` 3. ✅ Config fallback 4. ✅ Tests fallback **Durée estimée** : 2h **Lignes** : ~200 --- ### Milestone 4 : Polish & Documentation 📝 **Tâches** : 1. ✅ Documentation complète 2. ✅ Tests unitaires STTService 3. ✅ Troubleshooting guide 4. ✅ Mise à jour README **Durée estimée** : 3h **Lignes** : ~700 --- ## Dépendances Externes ### À installer ```bash # Vosk sudo apt install libvosk-dev wget https://alphacephei.com/vosk/models/vosk-model-small-fr-0.22.zip unzip vosk-model-small-fr-0.22.zip -d models/ # PocketSphinx sudo apt install pocketsphinx pocketsphinx-en-us # PortAudio (pour microphone) sudo apt install portaudio19-dev # Optionnel: whisper.cpp git clone https://github.com/ggerganov/whisper.cpp external/whisper.cpp cd external/whisper.cpp && make ``` ### CMakeLists.txt ```cmake # Find Vosk find_library(VOSK_LIBRARY vosk REQUIRED) find_path(VOSK_INCLUDE_DIR vosk_api.h REQUIRED) # Find PocketSphinx find_library(POCKETSPHINX_LIBRARY pocketsphinx REQUIRED) find_path(POCKETSPHINX_INCLUDE_DIR pocketsphinx.h REQUIRED) # Find PortAudio find_library(PORTAUDIO_LIBRARY portaudio REQUIRED) find_path(PORTAUDIO_INCLUDE_DIR portaudio.h REQUIRED) # Link target_link_libraries(VoiceService ${VOSK_LIBRARY} ${POCKETSPHINX_LIBRARY} ${PORTAUDIO_LIBRARY} ) ``` --- ## Risques & Mitigation | Risque | Impact | Mitigation | |--------|--------|------------| | **Vosk model trop lourd** | RAM (50MB) | Utiliser `vosk-model-small` au lieu de `base` | | **PocketSphinx faux positifs** | UX | Ajuster threshold (1e-40 → 1e-50) | | **Microphone permissions** | Bloquant | Guide installation PortAudio + permissions | | **Latence transcription** | UX | Buffer 1-2s audio avant transcription | | **Whisper API coût** | Budget | Utiliser seulement en fallback (rare) | --- ## Prochaines Étapes **Après validation de ce plan** : 1. **Installer dépendances** (Vosk, PocketSphinx, PortAudio) 2. **Milestone 1** : Vosk STT basique 3. **Tester** : Transcription fichier audio FR 4. **Milestone 2** : Keyword "Celuna" 5. **Tester** : Conversation complète passive → active 6. **Commit + Push** : Phase 7 complète --- ## Validation Plan **Questions avant implémentation** : 1. ✅ Architecture service layer approuvée ? 2. ✅ Choix engines (Vosk + PocketSphinx) OK ? 3. ❓ Besoin WhisperCpp ou Vosk suffit ? 4. ✅ Nom "Celuna" confirmé ? 5. ❓ Autres keywords à détecter ("hey celuna", "ok celuna") ? --- **Auteur** : Claude Code **Date** : 2025-11-29 **Phase** : 7 - STT Implementation **Status** : 📋 Plan - En attente validation