Claudia Voice: Two-Tier Conversational AI Architecture

Abstract
Voice assistants should feel like conversations, not command interfaces. Most implementations use a serial pipeline with 3-5 second latencies. Claudia Voice uses two-tier routing that handles 85% of queries at 700ms while preserving depth for complex interactions.
Voice assistants should feel like conversations, not command interfaces. Most implementations use a serial pipeline: listen → transcribe → think → speak, with 3-5 second latencies that break conversational flow. Claudia Voice implements a two-tier routing architecture that routes simple queries to a local 30B MoE model for 300-800ms responses, and complex queries to frontier models for deeper reasoning.
The Problem
The standard voice assistant pipeline is fundamentally serial:
Microphone → Speech-to-Text → LLM → Text-to-Speech → Speaker
Each step adds latency:
- Speech recognition: 500-1500ms
- LLM generation: 1000-3000ms (frontier models)
- TTS synthesis: 300-800ms
- Network roundtrips: 200-500ms per request
Result: 3-5 seconds from "Hey" to response. Users learn to treat voice assistants as command interfaces, not conversations.
Architecture
Two-Tier Routing
Claudia Voice uses a two-tier routing architecture that classifies incoming queries and routes them to the appropriate LLM:
┌─────────────────────────────────────────────────────────────┐
│ Voice Query Router │
└─────────────────────────┬───────────────────────────────────┘
│
▼
┌───────────────────────────┐
│ Fast Path (Local GLM) │
│ GLM-4.7-Flash (30B MoE) │
│ Latency: 300-800ms │
└───────────────────────────┘
│
Simple queries:
• Greetings
• General knowledge
• Math, facts
• Jokes, trivia
▼
┌───────────────────────────┐
│ Deep Path (Remote Claude)│
│ Claude Sonnet/Opus │
│ Latency: 2000-5000ms │
└───────────────────────────┘
│
Complex queries:
• Smart home control
• Calendar, email, tasks
• Personal data
• Multi-step reasoning
Classification prompt (GLM):
You are Claudia, a voice assistant. Answer directly or say ROUTE_DEEP if it involves:
- Smart home (lights, thermostat, locks)
- Calendar, email, tasks, scheduling
- Weather or real-time data
- Music playback, media control
- Web search, current events
- Actions (calling, messaging, notifications)
Why two tiers?
- Fast path: 90% of voice queries are simple (greetings, facts, simple questions). Answering these locally with GLM-4.7-Flash provides 300-800ms responses—conversational.
- Deep path: Complex queries, smart home control, personal data need frontier models. Routing to Claude ensures quality.
Audio Pipeline
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Android │───►│ Whisper │───►│ Router │───►│ GLM/ │───►│ TTS │
│ App/Kiosk│ │ (local) │ │ Classify │ │ Claude │ │ Streaming│
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
Components:
- Android app: Primary voice interface, uses SpeechRecognizer for transcription
- Web kiosk: Browser-based voice input (Web Voice API), kitchen tablet use case
- Omi pendant: BLE audio streaming, offline Whisper transcription
- Whisper (local): Runs on Mac Studio M4 (100% GPU)
- Router: Classifies and routes queries
- TTS: Kokoro (free, local MLX) + ElevenLabs (paid), sentence-level streaming for interruptibility
Smart Home Integration
Claudia Voice controls smart home devices through three integrations:
| System | Protocol | Control |
|---|---|---|
| Lutron Caseta | Telnet | Dimmers, switches throughout house |
| Philips Hue | HTTP API | Color bulbs, lightstrips, scenes |
| Nest Thermostats | Google SDM API | Temperature, HVAC mode |
Smart home queries route to Claude with tool access:
- "Turn on the living room lights to 50%" → Claude invokes Lutron API
- "Set the mood for movie night" → Claude activates Hue scene
Semantic Memory Integration
Claudia Voice integrates with Engram for persistent semantic memory:
- Auto-capture: After each turn, GLM extracts 0-5 facts worth remembering
- Auto-recall: Before responding, semantic search retrieves relevant memories
- Context continuity: The assistant references past conversations naturally
Implementation
Router Implementation
The router is a Python FastAPI service:
SYSTEM_PROMPT = """You are Claudia, a voice assistant.
ANSWER DIRECTLY (warm, brief, conversational):
- Greetings, general knowledge, facts
- Simple math, jokes, creative responses
- Natural spoken language, no markdown
ROUTE_DEEP if:
- Smart home control
- Calendar, email, tasks
- Real-time data
- Personal info, contacts, files
- Actions (calling, messaging, notifications)
"""
def classify_and_respond(query: str) -> dict:
response = ollama.chat(
model="glm-opus-distill",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": query}
],
options={"temperature": 0.65}
)
if "ROUTE_DEEP" in response.content:
deep_response = clawdbot.chat(query, context=...)
return {"path": "deep", "response": deep_response}
return {"path": "local", "response": response.content}
Key optimizations:
- Single GLM call: Classification + response in one shot, not two sequential calls
- Temperature 0.65: Balances creativity and consistency
- Timeout 15s: Headroom for GLM cold start
- Graceful degradation: If Ollama is down, everything routes deep
TTS Streaming
The /voice/text/stream endpoint provides sentence-level TTS streaming:
@app.post("/voice/text/stream")
async def stream_voice_response(query: str) -> StreamingResponse:
async def generate():
async for sentence in tts_stream(generated_text):
yield sentence # Each sentence as WAV chunk
return StreamingResponse(generate(), media_type="audio/wav")
Why sentence-level streaming?
- Interruptibility: User can start speaking mid-response, client stops TTS
- Perceived latency: First sentence plays at ~500ms, not after full generation (3-5s)
- Natural pacing: Human conversation has pauses between thoughts
Production Metrics
Latency Breakdown (Fast Path)
| Component | Latency |
|---|---|
| Whisper transcription | 200-400ms |
| GLM classification + response | 300-500ms |
| TTS synthesis (first sentence) | 150-300ms |
| Network roundtrips | 50-100ms |
| Total (first response) | 700-1300ms |
Routing Distribution
Based on production logs (Feb 2026):
| Path | Percentage | Avg Latency |
|---|---|---|
| Local (GLM) | ~85% | 700-1300ms |
| Deep (Claude) | ~15% | 3000-6000ms |
Finding: 85% of voice queries are simple enough for local handling.
Smart Home Device Coverage
| Device | Count | Control |
|---|---|---|
| Lutron dimmers | 15 | All rooms |
| Hue bulbs/strips | 8 | Family room, TV room |
| Nest thermostats | 2 | Upstairs, downstairs |
Lessons Learned
Local Models Are Underrated
GLM-4.7-Flash (30B MoE) handles 85% of voice queries at sub-second latency. Frontier models aren't necessary for simple greetings, facts, and casual conversation.
Classification Is the Critical Path
Routing accuracy determines UX. Misclassifying a simple query as "deep" adds 2-3 seconds of latency. The classification prompt needs regular tuning based on production logs.
TTS Streaming Is Non-Negotiable
Blocking TTS (generate full response → synthesize → play) feels broken. Sentence-level streaming with first-sentence playback at ~700ms transforms the experience from "command interface" to "conversation."
Voice Activity Detection Is Hard
Android's SpeechRecognizer has built-in endpoint detection tuned for dictation, not conversation. It waits 1-2 seconds of silence before returning. Users expect 200-400ms.
Solution (planned): WebSocket upgrade with server-side VAD (Silero) for proper turn-taking.
Open Questions
Optimal Routing Thresholds
Current routing is binary (local vs deep). What about a third tier? A medium model (14B) could handle moderate complexity better than local GLM but faster than Claude?
Context Window Optimization
The router maintains 3-turn conversation context locally. Does this help or hurt? Would fresh context per query be better?
Prosody-Aware Responses
Hume EVI demonstrates emotion-responsive speech. Should Claudia vary tone based on query content?
Conclusion
Claudia Voice demonstrates that production conversational assistants don't require frontier models. A two-tier architecture routing 85% of queries to a local 30B model achieves conversational latency while preserving depth for complex queries. Smart home integration, semantic memory, and sentence-level TTS streaming complete the experience.
The architecture is intentionally modular—each layer can be upgraded independently. The WebSocket audio upgrade planned for Q2 2026 will bring sub-500ms latency, matching Gemini Live and GPT-4o Realtime.