🧠Layer 2: ASR / LLM / TTS

The speech-recognition, language-model, and speech-synthesis backends — all swappable.

Each stage of the voice pipeline (STT → LLM → TTS) is swappable via environment variables. Source: INVENTORY.md §1.4 (the AEGIS_ settings in config.py).

STT (speech recognition) backends

🧪

stub (default)

AEGIS_STT_BACKEND=stub

🗣️

whisper

AEGIS_STT_BACKEND=whisper

🌐

viibevoice (HTTP)

AEGIS_STT_URL=...

☁️

elevenlabs

scribe_v2

LLM (language model) modes

🧪

stub (fixed responses, default)

AEGIS_LLM_MODE=stub

🔁

openai_compat (ollama / vLLM / LM Studio)

AEGIS_LLM_URL=http://localhost:11434/v1, model=gemma2:9b

🤖

OpenAI Realtime (Route C)

gpt-realtime (direct audio)

TTS (speech synthesis) backends

🔊

edge_tts (default, +28% speed)

AEGIS_TTS_BACKEND=edge_tts

🎵

kokoro

AEGIS_TTS_BACKEND=kokoro

🎶

piper

AEGIS_TTS_BACKEND=piper

☁️

elevenlabs (needs VOICE_ID)

eleven_multilingual_v2

ℹ️The design intent

Because each stage can be switched via voice_backend / stt_backend / tts_backend, we avoid vendor lock-in while gradually moving toward production quality — a switchpoint that anticipates Phase 3’s move to local voice.