PersonaPlex-MLX: From Neural Codec to Full Voice AI Stack

Abstract
The codec port was supposed to take a week. Three weeks later: Mimi running at 2.7ms on Apple Silicon, a streaming ASR→LLM→TTS pipeline at under 1.3s total latency, a persona system, a WebSocket API, and a browser UI. Here's how it happened.
Most on-device AI focuses on text. Speech — real speech, with emotion, timing, and nuance — is harder. Large audio models require fast convolutions, resampling layers, and high-throughput streaming. They weren't designed for laptops.
Mimi is a neural audio codec developed by Kyutai that compresses raw audio into semantic tokens at 12.5 tokens/second. It's the audio backbone behind Moshi, Kyutai's real-time conversational AI. Getting Mimi running natively on Apple Silicon via MLX means fully local, real-time speech AI — no cloud, no round-trip latency, no data leaving the device.
This is the engineering story of that port — and everything that came after it.
Why Mimi?
Neural audio codecs are a recent innovation. Traditional codecs (MP3, Opus) compress audio with psychoacoustic tricks. Neural codecs use learned representations — encoder networks that compress audio into discrete tokens, decoder networks that reconstruct audio from tokens.
The advantage: semantic tokens. Unlike raw waveform samples, codec tokens encode meaning. A language model operating on codec tokens can understand and generate speech the same way it understands and generates text. This is the architecture behind GPT-4o's voice mode, Gemini Live, and Moshi.
Mimi specifically:
- Encodes at 12.5 tokens/second (vs 75 tokens/second for older codecs like EnCodec)
- Uses causal (streaming-capable) convolutions — designed for real-time use
- Combines semantic and acoustic tokens — richer representation than audio-only codecs
- Apache 2.0 licensed — fully open for research and commercial use
The goal: Run the full encode→decode pipeline on Apple Silicon with latency low enough for real-time use.
The MLX Constraint
PyTorch is the reference implementation for essentially all AI research. Mimi's weights are PyTorch checkpoints. Running them on Apple Silicon in PyTorch works — but it uses CPU fallbacks for unsupported ops, misses Metal GPU acceleration, and drains battery.
MLX is Apple's machine learning framework for Apple Silicon. It's NumPy-compatible, uses Metal natively, and is optimized for the M-series unified memory architecture. The tradeoff: it's newer, has a smaller op surface, and some PyTorch patterns don't translate directly.
The porting challenge: every op in Mimi's architecture needs an MLX equivalent that matches numerically and runs efficiently.
Architecture
Mimi has three major components:
1. Residual Vector Quantizer (RVQ)
The quantizer compresses encoder output into discrete tokens using learned codebooks. Eight codebooks run in series — each quantizes the residual error from the previous one.
MLX mapping: Straightforward. Codebook look-ups are embedding tables (nn.Embedding), which MLX handles natively.
2. Encoder (Waveform → Latents)
The encoder takes raw 24kHz audio and compresses it through strided convolutional blocks. Each block halves temporal resolution — 24kHz → 12kHz → ... → 75Hz — until the output runs at 75 Hz (12.5 tokens/second).
Key components:
- Dilated convolutions: Dilation rates [1, 3, 9] capture multi-scale temporal context without increasing parameter count
- Weight normalization: Conv weights parameterized as
weight = g × (v / ||v||) - Residual connections: Standard skip connections for gradient flow
MLX challenge: nn.Conv1d doesn't support dilation. Solution: custom DilatedConv1d that manually inserts zeros between filter taps.
class DilatedConv1d(nn.Module):
def __call__(self, x):
if self.dilation == 1:
return self.conv(x)
k = self.conv.weight
dilated_size = (self.kernel_size - 1) * self.dilation + 1
dilated_k = mx.zeros([k.shape[0], k.shape[1], dilated_size])
for i in range(self.kernel_size):
dilated_k[:, :, i * self.dilation] = k[:, :, i]
return mx.conv1d(x, dilated_k)
Benchmark on M4 Max: 2.15ms vs 18.7ms for a Python loop — an 8.7× speedup.
3. Decoder (Latents → Waveform)
The decoder mirrors the encoder with transposed convolutions upsampling quantized latents back to 24kHz. Each block doubles temporal resolution: 75Hz → ... → 24kHz.
MLX challenge: PyTorch's F.interpolate doesn't exist in MLX. Custom ConvTrUpsample1d and ConvDownsample1d modules replace it. Validation confirmed numerically equivalent output to the PyTorch reference — the key requirement for correct weight loading.
Weight Loading
Mimi's weights are distributed as a HuggingFace checkpoint (kyutai/mimi). Loading into MLX requires four transformations:
Key mapping: encoder.model.0.weight → encoder.layers.0.weight. Systematic translation of ~200 weight keys.
Shape transposition: PyTorch Conv1d is [out, in, k]; MLX is [k, in, out]. Every conv weight transposes.
LayerNorm initialization: Layers initialized to ones/zeros aren't stored in the checkpoint. MLX handles this via default initialization.
Weight norm materialization: PyTorch stores weight_g and weight_v separately. MLX needs materialized weight = g × (v / ||v||), computed during load.
Codec Performance
After Week 1, the Mimi codec was running significantly faster than expected on M4 Max (Mac Studio):
| Operation | Latency | vs. Target |
|---|---|---|
| Encoder | 0.92ms | 127× faster |
| Decoder | 1.19ms | 160× faster |
| Quantizer | ~1ms | within target |
| Full codec round-trip | 2.7ms | 182× faster than real-time |
Real-time factor: 0.021× — the codec processes audio 47.6× faster than the audio plays. The budget for streaming at 12.5 tokens/second is 80ms per chunk. We're sitting at 2.7ms.
End-to-end validation against the PyTorch reference achieved 3.98 dB SNR — numerically correct. For a GAN-trained neural audio codec, ~4 dB waveform SNR is expected behavior (the codec is perceptually optimized, not waveform-exact).
What We Built Next
The codec worked. That unlocked the next question: what does a full local voice conversation stack actually look like?
Week 2: Pipeline Assembly
With the codec validated, the goal became end-to-end voice conversation with zero cloud dependencies. The stack: faster-whisper for ASR, Ollama (llama3.2:1b) for the language model, Piper TTS for synthesis.
Each component required optimization:
ASR (Day 13): Replaced Whisper CLI (3.5s) with faster-whisper running on-device — latency dropped to 172ms (20× speedup). mlx-whisper benchmarked at 176ms — comparable, but faster-whisper won on compatibility.
TTS (Day 14): macOS say took 1,518ms. Replaced with Piper TTS — warm latency dropped to 69ms (22× speedup). Kokoro-ONNX int8 was benchmarked at 1,772ms and rejected. Piper's quality-to-latency tradeoff was the clear winner.
Real-time input (Day 15): Added push-to-talk loop via realtime_demo.py. Press ENTER to record, ENTER to stop — mic → Whisper → LLM → Piper → speaker. Warm latency: 1,723–2,564ms, best turns under 2 seconds.
LM speedup (Day 16): The language model was the bottleneck at 1,872ms average. Two changes: brevity system prompt + max_tokens=80. LM latency dropped to 381ms (80% reduction). Total pipeline average: ~1,253ms. Also added webrtcvad for hands-free auto-detection.
Week 3: Streaming and API
Streaming LM→TTS (Day 17): Instead of waiting for the full LM response, a split_sentences() generator streams tokens and synthesizes each sentence as it arrives. First-audio latency: ~216–357ms — roughly 140ms faster than sequential for typical responses.
Persona system (Day 18): Three named personas defined in config/personas.yaml — Claudia (witty, conversational), Aria (precise, technical), Sage (reflective, philosophical). Each has distinct system prompts, temperature settings, and voice assignments. Runtime switching via switch_persona() or interactive !persona <name> command.
HTTP API (Day 18): FastAPI server with GET /personas, POST /chat (text→text), POST /speak (text→WAV). Personality differentiation confirmed in testing — Claudia and Aria respond to the same prompt in measurably different styles.
Conversation sessions + SSE (Day 19): POST /session creates stateful UUID sessions with full multi-turn history. POST /chat/stream streams tokens as Server-Sent Events in real time. Backward compatible — stateless /chat unchanged.
WebSocket channel (Day 20): Persistent bidirectional WS /ws/chat endpoint with per-connection session history, real-time token streaming, ping/pong keep-alive, switch_persona and reset commands.
Web UI (Day 21): Self-contained dark-theme chat interface served at /. Features: persona switcher (loaded from API), real-time token streaming with blinking cursor, thinking indicator, latency badge per message, named session sidebar (create/switch/clear), auto-reconnect on disconnect. Also resolved all 10 previously-failing VAD tests by installing silero-vad 6.2.1.
Why It Matters
Running a neural audio codec on-device enables applications cloud systems can't:
Privacy-first voice AI: Conversations never leave the device. No audio uploaded, no transcripts logged. Relevant for clinical environments, legal work, anything where the conversation itself is sensitive.
Offline-capable: Works on planes, in air-gapped networks, anywhere without connectivity.
Ultra-low latency: No cloud round-trips. The codec alone processes audio at 47.6× real-time. End-to-end first-audio latency is under 400ms in streaming mode.
Composable with local LLMs: The same llama3.2:1b running locally can be swapped for any Ollama model. The persona system makes it easy to tune personality and style per use case.
Current Status
| Component | Status | Latency |
|---|---|---|
| Mimi codec in MLX | ✅ Complete | 2.7ms round-trip |
| Weight loading from PyTorch | ✅ Complete | — |
| PyTorch reference validation | ✅ Complete | 3.98 dB SNR |
| faster-whisper ASR | ✅ Complete | 172ms |
| Piper TTS synthesis | ✅ Complete | 69ms warm |
| Real-time push-to-talk | ✅ Complete | <2s warm latency |
| Streaming LM→TTS pipeline | ✅ Complete | 216–357ms first audio |
| Persona system (3 personas) | ✅ Complete | — |
| HTTP API (FastAPI) | ✅ Complete | — |
| WebSocket full-duplex channel | ✅ Complete | — |
| Browser chat UI | ✅ Complete | — |
| Native Moshi LM integration | 🔄 In progress | — |
| Voice cloning | ⬜ Planned | — |
The stack runs entirely on an M4 Max Mac Studio. No API keys. No network calls. No data leaving the machine.