AI Systems

Claudia Voice: Two-Tier Conversational AI Architecture

Ben Zanghi
February 14, 2026
10 min read
Claudia Voice: Two-Tier Conversational AI Architecture

Abstract

Voice assistants should feel like conversations, not command interfaces. Most implementations use a serial pipeline with 3-5 second latencies. Claudia Voice uses two-tier routing that handles 85% of queries at 700ms while preserving depth for complex interactions.

Voice assistants should feel like conversations, not command interfaces. Most implementations use a serial pipeline: listen → transcribe → think → speak, with 3-5 second latencies that break conversational flow. Claudia Voice implements a two-tier routing architecture that routes simple queries to a local 30B MoE model for 300-800ms responses, and complex queries to frontier models for deeper reasoning.

The Problem

The standard voice assistant pipeline is fundamentally serial:

Microphone → Speech-to-Text → LLM → Text-to-Speech → Speaker

Each step adds latency:

Result: 3-5 seconds from "Hey" to response. Users learn to treat voice assistants as command interfaces, not conversations.

Architecture

Two-Tier Routing

Claudia Voice uses a two-tier routing architecture that classifies incoming queries and routes them to the appropriate LLM:

┌─────────────────────────────────────────────────────────────┐
│                    Voice Query Router                        │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
              ┌───────────────────────────┐
              │  Fast Path (Local GLM)    │
              │  GLM-4.7-Flash (30B MoE)  │
              │  Latency: 300-800ms       │
              └───────────────────────────┘
                          │
                  Simple queries:
                  • Greetings
                  • General knowledge
                  • Math, facts
                  • Jokes, trivia

                          ▼
              ┌───────────────────────────┐
              │  Deep Path (Remote Claude)│
              │  Claude Sonnet/Opus       │
              │  Latency: 2000-5000ms     │
              └───────────────────────────┘
                          │
                  Complex queries:
                  • Smart home control
                  • Calendar, email, tasks
                  • Personal data
                  • Multi-step reasoning

Classification prompt (GLM):

You are Claudia, a voice assistant. Answer directly or say ROUTE_DEEP if it involves:
- Smart home (lights, thermostat, locks)
- Calendar, email, tasks, scheduling
- Weather or real-time data
- Music playback, media control
- Web search, current events
- Actions (calling, messaging, notifications)

Why two tiers?

Audio Pipeline

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│ Android  │───►│ Whisper  │───►│ Router   │───►│  GLM/    │───►│  TTS     │
│ App/Kiosk│    │ (local)  │    │ Classify │    │  Claude  │    │ Streaming│
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘

Components:

Smart Home Integration

Claudia Voice controls smart home devices through three integrations:

SystemProtocolControl
Lutron CasetaTelnetDimmers, switches throughout house
Philips HueHTTP APIColor bulbs, lightstrips, scenes
Nest ThermostatsGoogle SDM APITemperature, HVAC mode

Smart home queries route to Claude with tool access:

Semantic Memory Integration

Claudia Voice integrates with Engram for persistent semantic memory:

Implementation

Router Implementation

The router is a Python FastAPI service:

SYSTEM_PROMPT = """You are Claudia, a voice assistant.

ANSWER DIRECTLY (warm, brief, conversational):
- Greetings, general knowledge, facts
- Simple math, jokes, creative responses
- Natural spoken language, no markdown

ROUTE_DEEP if:
- Smart home control
- Calendar, email, tasks
- Real-time data
- Personal info, contacts, files
- Actions (calling, messaging, notifications)
"""

def classify_and_respond(query: str) -> dict:
    response = ollama.chat(
        model="glm-opus-distill",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": query}
        ],
        options={"temperature": 0.65}
    )
    
    if "ROUTE_DEEP" in response.content:
        deep_response = clawdbot.chat(query, context=...)
        return {"path": "deep", "response": deep_response}
    
    return {"path": "local", "response": response.content}

Key optimizations:

TTS Streaming

The /voice/text/stream endpoint provides sentence-level TTS streaming:

@app.post("/voice/text/stream")
async def stream_voice_response(query: str) -> StreamingResponse:
    async def generate():
        async for sentence in tts_stream(generated_text):
            yield sentence  # Each sentence as WAV chunk
    
    return StreamingResponse(generate(), media_type="audio/wav")

Why sentence-level streaming?

Production Metrics

Latency Breakdown (Fast Path)

ComponentLatency
Whisper transcription200-400ms
GLM classification + response300-500ms
TTS synthesis (first sentence)150-300ms
Network roundtrips50-100ms
Total (first response)700-1300ms

Routing Distribution

Based on production logs (Feb 2026):

PathPercentageAvg Latency
Local (GLM)~85%700-1300ms
Deep (Claude)~15%3000-6000ms

Finding: 85% of voice queries are simple enough for local handling.

Smart Home Device Coverage

DeviceCountControl
Lutron dimmers15All rooms
Hue bulbs/strips8Family room, TV room
Nest thermostats2Upstairs, downstairs

Lessons Learned

Local Models Are Underrated

GLM-4.7-Flash (30B MoE) handles 85% of voice queries at sub-second latency. Frontier models aren't necessary for simple greetings, facts, and casual conversation.

Classification Is the Critical Path

Routing accuracy determines UX. Misclassifying a simple query as "deep" adds 2-3 seconds of latency. The classification prompt needs regular tuning based on production logs.

TTS Streaming Is Non-Negotiable

Blocking TTS (generate full response → synthesize → play) feels broken. Sentence-level streaming with first-sentence playback at ~700ms transforms the experience from "command interface" to "conversation."

Voice Activity Detection Is Hard

Android's SpeechRecognizer has built-in endpoint detection tuned for dictation, not conversation. It waits 1-2 seconds of silence before returning. Users expect 200-400ms.

Solution (planned): WebSocket upgrade with server-side VAD (Silero) for proper turn-taking.

Open Questions

Optimal Routing Thresholds

Current routing is binary (local vs deep). What about a third tier? A medium model (14B) could handle moderate complexity better than local GLM but faster than Claude?

Context Window Optimization

The router maintains 3-turn conversation context locally. Does this help or hurt? Would fresh context per query be better?

Prosody-Aware Responses

Hume EVI demonstrates emotion-responsive speech. Should Claudia vary tone based on query content?

Conclusion

Claudia Voice demonstrates that production conversational assistants don't require frontier models. A two-tier architecture routing 85% of queries to a local 30B model achieves conversational latency while preserving depth for complex queries. Smart home integration, semantic memory, and sentence-level TTS streaming complete the experience.

The architecture is intentionally modular—each layer can be upgraded independently. The WebSocket audio upgrade planned for Q2 2026 will bring sub-500ms latency, matching Gemini Live and GPT-4o Realtime.

Back to all articles