aissia/src/shared/audio/WhisperCppEngine.hpp
StillHammer a712988584 feat: Phase 7 STT - Complete implementation with 4 engines
Implemented complete STT (Speech-to-Text) system with 4 engines:

1. **PocketSphinxEngine** (new)
   - Lightweight keyword spotting
   - Perfect for passive wake word detection
   - ~10MB model, very low CPU/RAM usage
   - Keywords: "celuna", "hey celuna", etc.

2. **VoskSTTEngine** (existing)
   - Balanced local STT for full transcription
   - 50MB models, good accuracy
   - Already working

3. **WhisperCppEngine** (new)
   - High-quality offline STT using whisper.cpp
   - 75MB-2.9GB models depending on quality
   - Excellent accuracy, runs entirely local

4. **WhisperAPIEngine** (existing)
   - Cloud STT via OpenAI Whisper API
   - Best accuracy, requires internet + API key
   - Already working

Features:
- Full JSON configuration via config/voice.json
- Auto-selection mode tries engines in order
- Dual mode support (passive + active)
- Fallback chain for reliability
- All engines use ISTTEngine interface

Updated:
- STTEngineFactory: Added support for all 4 engines
- CMakeLists.txt: Added new source files
- docs/STT_CONFIGURATION.md: Complete config guide

Config example (voice.json):
{
  "passive_mode": { "engine": "pocketsphinx" },
  "active_mode": { "engine": "vosk", "fallback": "whisper-api" }
}

Architecture: ISTTService → STTEngineFactory → 4 engines
Build:  Compiles successfully
Status: Phase 7 complete, ready for testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 17:27:47 +08:00

80 lines
2.3 KiB
C++

#pragma once
#include "ISTTEngine.hpp"
#include <spdlog/spdlog.h>
#include <memory>
#include <vector>
#include <string>
// whisper.cpp forward declarations (to avoid including full headers)
struct whisper_context;
struct whisper_full_params;
namespace aissia {
/**
* @brief Whisper.cpp Speech-to-Text engine
*
* Local high-quality STT using OpenAI's Whisper model via whisper.cpp.
* Runs entirely offline with excellent accuracy.
*
* Features:
* - High accuracy (OpenAI Whisper quality)
* - Completely offline (no internet required)
* - Multiple model sizes (tiny, base, small, medium, large)
* - Multilingual support
*
* Model sizes:
* - tiny: ~75MB, fastest, less accurate
* - base: ~142MB, balanced
* - small: ~466MB, good quality
* - medium: ~1.5GB, very good
* - large: ~2.9GB, best quality
*
* Recommended: base or small for most use cases
*/
class WhisperCppEngine : public ISTTEngine {
public:
/**
* @brief Construct Whisper.cpp engine
* @param modelPath Path to Whisper GGML model file (e.g., "models/ggml-base.bin")
*/
explicit WhisperCppEngine(const std::string& modelPath);
~WhisperCppEngine() override;
// Disable copy
WhisperCppEngine(const WhisperCppEngine&) = delete;
WhisperCppEngine& operator=(const WhisperCppEngine&) = delete;
std::string transcribe(const std::vector<float>& audioData) override;
std::string transcribeFile(const std::string& filePath) override;
void setLanguage(const std::string& language) override;
bool isAvailable() const override;
std::string getEngineName() const override;
/**
* @brief Set transcription parameters
* @param threads Number of threads to use (default: 4)
* @param translate Translate to English (default: false)
*/
void setParameters(int threads = 4, bool translate = false);
private:
bool initialize();
void cleanup();
std::string processAudioData(const float* audioData, size_t numSamples);
std::shared_ptr<spdlog::logger> m_logger;
std::string m_modelPath;
std::string m_language = "auto";
bool m_available = false;
int m_threads = 4;
bool m_translate = false;
// whisper.cpp context (opaque pointer to avoid header dependency)
whisper_context* m_ctx = nullptr;
};
} // namespace aissia