π§ Layer 2: ASR / LLM / TTS
The speech-recognition, language-model, and speech-synthesis backends β all swappable.
Each stage of the voice pipeline (STT β LLM β TTS) is swappable via environment variables.
Source: INVENTORY.md Β§1.4 (the AEGIS_ settings in config.py).
STT (speech recognition) backends
π§ͺ
stub (default)
AEGIS_STT_BACKEND=stub π£οΈ
whisper
AEGIS_STT_BACKEND=whisper π
viibevoice (HTTP)
AEGIS_STT_URL=... βοΈ
elevenlabs
scribe_v2 LLM (language model) modes
π§ͺ
stub (fixed responses, default)
AEGIS_LLM_MODE=stub π
openai_compat (ollama / vLLM / LM Studio)
AEGIS_LLM_URL=http://localhost:11434/v1, model=gemma2:9b π€
OpenAI Realtime (Route C)
gpt-realtime (direct audio) TTS (speech synthesis) backends
π
edge_tts (default, +28% speed)
AEGIS_TTS_BACKEND=edge_tts π΅
kokoro
AEGIS_TTS_BACKEND=kokoro πΆ
piper
AEGIS_TTS_BACKEND=piper βοΈ
elevenlabs (needs VOICE_ID)
eleven_multilingual_v2 βΉοΈThe design intent
Because each stage can be switched via voice_backend / stt_backend / tts_backend, we avoid vendor lock-in while
gradually moving toward production quality β a switchpoint that anticipates Phase 3βs move to local voice.