Claudia Voice: Two-Tier Conversational AI Architecture

Voice assistants should feel like conversations, not command interfaces. Most implementations use a serial pipeline: listen → transcribe → think → speak, with 3-5 second latencies that break conversational flow. Claudia Voice implements a two-tier routing architecture that routes simple queries to a local 30B MoE model for 300-800ms responses, and complex queries to frontier models for deeper reasoning.

The Problem

The standard voice assistant pipeline is fundamentally serial:

Microphone → Speech-to-Text → LLM → Text-to-Speech → Speaker

Each step adds latency:

Speech recognition: 500-1500ms
LLM generation: 1000-3000ms (frontier models)
TTS synthesis: 300-800ms
Network roundtrips: 200-500ms per request

Result: 3-5 seconds from "Hey" to response. Users learn to treat voice assistants as command interfaces, not conversations.

Architecture

Two-Tier Routing

Claudia Voice uses a two-tier routing architecture that classifies incoming queries and routes them to the appropriate LLM:

┌─────────────────────────────────────────────────────────────┐
│                    Voice Query Router                        │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
              ┌───────────────────────────┐
              │  Fast Path (Local GLM)    │
              │  GLM-4.7-Flash (30B MoE)  │
              │  Latency: 300-800ms       │
              └───────────────────────────┘
                          │
                  Simple queries:
                  • Greetings
                  • General knowledge
                  • Math, facts
                  • Jokes, trivia

                          ▼
              ┌───────────────────────────┐
              │  Deep Path (Remote Claude)│
              │  Claude Sonnet/Opus       │
              │  Latency: 2000-5000ms     │
              └───────────────────────────┘
                          │
                  Complex queries:
                  • Smart home control
                  • Calendar, email, tasks
                  • Personal data
                  • Multi-step reasoning

Classification prompt (GLM):

You are Claudia, a voice assistant. Answer directly or say ROUTE_DEEP if it involves:
- Smart home (lights, thermostat, locks)
- Calendar, email, tasks, scheduling
- Weather or real-time data
- Music playback, media control
- Web search, current events
- Actions (calling, messaging, notifications)

Why two tiers?

Fast path: 90% of voice queries are simple (greetings, facts, simple questions). Answering these locally with GLM-4.7-Flash provides 300-800ms responses—conversational.
Deep path: Complex queries, smart home control, personal data need frontier models. Routing to Claude ensures quality.

Audio Pipeline

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│ Android  │───►│ Whisper  │───►│ Router   │───►│  GLM/    │───►│  TTS     │
│ App/Kiosk│    │ (local)  │    │ Classify │    │  Claude  │    │ Streaming│
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘

Components:

Android app: Primary voice interface, uses SpeechRecognizer for transcription
Web kiosk: Browser-based voice input (Web Voice API), kitchen tablet use case
Omi pendant: BLE audio streaming, offline Whisper transcription
Whisper (local): Runs on Mac Studio M4 (100% GPU)
Router: Classifies and routes queries
TTS: Kokoro (free, local MLX) + ElevenLabs (paid), sentence-level streaming for interruptibility

Smart Home Integration

Claudia Voice controls smart home devices through three integrations:

System	Protocol	Control
Lutron Caseta	Telnet	Dimmers, switches throughout house
Philips Hue	HTTP API	Color bulbs, lightstrips, scenes
Nest Thermostats	Google SDM API	Temperature, HVAC mode

Smart home queries route to Claude with tool access:

"Turn on the living room lights to 50%" → Claude invokes Lutron API
"Set the mood for movie night" → Claude activates Hue scene

Semantic Memory Integration

Claudia Voice integrates with Engram for persistent semantic memory:

Auto-capture: After each turn, GLM extracts 0-5 facts worth remembering
Auto-recall: Before responding, semantic search retrieves relevant memories
Context continuity: The assistant references past conversations naturally

Implementation

Router Implementation

The router is a Python FastAPI service:

SYSTEM_PROMPT = """You are Claudia, a voice assistant.

ANSWER DIRECTLY (warm, brief, conversational):
- Greetings, general knowledge, facts
- Simple math, jokes, creative responses
- Natural spoken language, no markdown

ROUTE_DEEP if:
- Smart home control
- Calendar, email, tasks
- Real-time data
- Personal info, contacts, files
- Actions (calling, messaging, notifications)
"""

def classify_and_respond(query: str) -> dict:
    response = ollama.chat(
        model="glm-opus-distill",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": query}
        ],
        options={"temperature": 0.65}
    )
    
    if "ROUTE_DEEP" in response.content:
        deep_response = clawdbot.chat(query, context=...)
        return {"path": "deep", "response": deep_response}
    
    return {"path": "local", "response": response.content}

Key optimizations:

Single GLM call: Classification + response in one shot, not two sequential calls
Temperature 0.65: Balances creativity and consistency
Timeout 15s: Headroom for GLM cold start
Graceful degradation: If Ollama is down, everything routes deep

TTS Streaming

The /voice/text/stream endpoint provides sentence-level TTS streaming:

@app.post("/voice/text/stream")
async def stream_voice_response(query: str) -> StreamingResponse:
    async def generate():
        async for sentence in tts_stream(generated_text):
            yield sentence  # Each sentence as WAV chunk
    
    return StreamingResponse(generate(), media_type="audio/wav")

Why sentence-level streaming?

Interruptibility: User can start speaking mid-response, client stops TTS
Perceived latency: First sentence plays at ~500ms, not after full generation (3-5s)
Natural pacing: Human conversation has pauses between thoughts

Production Metrics

Latency Breakdown (Fast Path)

Component	Latency
Whisper transcription	200-400ms
GLM classification + response	300-500ms
TTS synthesis (first sentence)	150-300ms
Network roundtrips	50-100ms
Total (first response)	700-1300ms

Routing Distribution

Based on production logs (Feb 2026):

Path	Percentage	Avg Latency
Local (GLM)	~85%	700-1300ms
Deep (Claude)	~15%	3000-6000ms

Finding: 85% of voice queries are simple enough for local handling.

Smart Home Device Coverage

Device	Count	Control
Lutron dimmers	15	All rooms
Hue bulbs/strips	8	Family room, TV room
Nest thermostats	2	Upstairs, downstairs

Lessons Learned

Local Models Are Underrated

GLM-4.7-Flash (30B MoE) handles 85% of voice queries at sub-second latency. Frontier models aren't necessary for simple greetings, facts, and casual conversation.

Classification Is the Critical Path

Routing accuracy determines UX. Misclassifying a simple query as "deep" adds 2-3 seconds of latency. The classification prompt needs regular tuning based on production logs.

TTS Streaming Is Non-Negotiable

Blocking TTS (generate full response → synthesize → play) feels broken. Sentence-level streaming with first-sentence playback at ~700ms transforms the experience from "command interface" to "conversation."

Voice Activity Detection Is Hard

Android's SpeechRecognizer has built-in endpoint detection tuned for dictation, not conversation. It waits 1-2 seconds of silence before returning. Users expect 200-400ms.

Solution (planned): WebSocket upgrade with server-side VAD (Silero) for proper turn-taking.

Open Questions

Optimal Routing Thresholds

Current routing is binary (local vs deep). What about a third tier? A medium model (14B) could handle moderate complexity better than local GLM but faster than Claude?

Context Window Optimization

The router maintains 3-turn conversation context locally. Does this help or hurt? Would fresh context per query be better?

Prosody-Aware Responses

Hume EVI demonstrates emotion-responsive speech. Should Claudia vary tone based on query content?

Conclusion

Claudia Voice demonstrates that production conversational assistants don't require frontier models. A two-tier architecture routing 85% of queries to a local 30B model achieves conversational latency while preserving depth for complex queries. Smart home integration, semantic memory, and sentence-level TTS streaming complete the experience.

The architecture is intentionally modular—each layer can be upgraded independently. The WebSocket audio upgrade planned for Q2 2026 will bring sub-500ms latency, matching Gemini Live and GPT-4o Realtime.

Abstract