AI Research

PersonaPlex-MLX: From Neural Codec to Full Voice AI Stack

Ben Zanghi
March 2026
14 min read
PersonaPlex-MLX: From Neural Codec to Full Voice AI Stack

Abstract

The codec port was supposed to take a week. Three weeks later: Mimi running at 2.7ms on Apple Silicon, a streaming ASR→LLM→TTS pipeline at under 1.3s total latency, a persona system, a WebSocket API, and a browser UI. Here's how it happened.

Most on-device AI focuses on text. Speech — real speech, with emotion, timing, and nuance — is harder. Large audio models require fast convolutions, resampling layers, and high-throughput streaming. They weren't designed for laptops.

Mimi is a neural audio codec developed by Kyutai that compresses raw audio into semantic tokens at 12.5 tokens/second. It's the audio backbone behind Moshi, Kyutai's real-time conversational AI. Getting Mimi running natively on Apple Silicon via MLX means fully local, real-time speech AI — no cloud, no round-trip latency, no data leaving the device.

This is the engineering story of that port — and everything that came after it.

Why Mimi?

Neural audio codecs are a recent innovation. Traditional codecs (MP3, Opus) compress audio with psychoacoustic tricks. Neural codecs use learned representations — encoder networks that compress audio into discrete tokens, decoder networks that reconstruct audio from tokens.

The advantage: semantic tokens. Unlike raw waveform samples, codec tokens encode meaning. A language model operating on codec tokens can understand and generate speech the same way it understands and generates text. This is the architecture behind GPT-4o's voice mode, Gemini Live, and Moshi.

Mimi specifically:

The goal: Run the full encode→decode pipeline on Apple Silicon with latency low enough for real-time use.

The MLX Constraint

PyTorch is the reference implementation for essentially all AI research. Mimi's weights are PyTorch checkpoints. Running them on Apple Silicon in PyTorch works — but it uses CPU fallbacks for unsupported ops, misses Metal GPU acceleration, and drains battery.

MLX is Apple's machine learning framework for Apple Silicon. It's NumPy-compatible, uses Metal natively, and is optimized for the M-series unified memory architecture. The tradeoff: it's newer, has a smaller op surface, and some PyTorch patterns don't translate directly.

The porting challenge: every op in Mimi's architecture needs an MLX equivalent that matches numerically and runs efficiently.

Architecture

Mimi has three major components:

1. Residual Vector Quantizer (RVQ)

The quantizer compresses encoder output into discrete tokens using learned codebooks. Eight codebooks run in series — each quantizes the residual error from the previous one.

MLX mapping: Straightforward. Codebook look-ups are embedding tables (nn.Embedding), which MLX handles natively.

2. Encoder (Waveform → Latents)

The encoder takes raw 24kHz audio and compresses it through strided convolutional blocks. Each block halves temporal resolution — 24kHz → 12kHz → ... → 75Hz — until the output runs at 75 Hz (12.5 tokens/second).

Key components:

MLX challenge: nn.Conv1d doesn't support dilation. Solution: custom DilatedConv1d that manually inserts zeros between filter taps.

class DilatedConv1d(nn.Module):
    def __call__(self, x):
        if self.dilation == 1:
            return self.conv(x)
        k = self.conv.weight
        dilated_size = (self.kernel_size - 1) * self.dilation + 1
        dilated_k = mx.zeros([k.shape[0], k.shape[1], dilated_size])
        for i in range(self.kernel_size):
            dilated_k[:, :, i * self.dilation] = k[:, :, i]
        return mx.conv1d(x, dilated_k)

Benchmark on M4 Max: 2.15ms vs 18.7ms for a Python loop — an 8.7× speedup.

3. Decoder (Latents → Waveform)

The decoder mirrors the encoder with transposed convolutions upsampling quantized latents back to 24kHz. Each block doubles temporal resolution: 75Hz → ... → 24kHz.

MLX challenge: PyTorch's F.interpolate doesn't exist in MLX. Custom ConvTrUpsample1d and ConvDownsample1d modules replace it. Validation confirmed numerically equivalent output to the PyTorch reference — the key requirement for correct weight loading.

Weight Loading

Mimi's weights are distributed as a HuggingFace checkpoint (kyutai/mimi). Loading into MLX requires four transformations:

Key mapping: encoder.model.0.weightencoder.layers.0.weight. Systematic translation of ~200 weight keys.

Shape transposition: PyTorch Conv1d is [out, in, k]; MLX is [k, in, out]. Every conv weight transposes.

LayerNorm initialization: Layers initialized to ones/zeros aren't stored in the checkpoint. MLX handles this via default initialization.

Weight norm materialization: PyTorch stores weight_g and weight_v separately. MLX needs materialized weight = g × (v / ||v||), computed during load.

Codec Performance

After Week 1, the Mimi codec was running significantly faster than expected on M4 Max (Mac Studio):

OperationLatencyvs. Target
Encoder0.92ms127× faster
Decoder1.19ms160× faster
Quantizer~1mswithin target
Full codec round-trip2.7ms182× faster than real-time

Real-time factor: 0.021× — the codec processes audio 47.6× faster than the audio plays. The budget for streaming at 12.5 tokens/second is 80ms per chunk. We're sitting at 2.7ms.

End-to-end validation against the PyTorch reference achieved 3.98 dB SNR — numerically correct. For a GAN-trained neural audio codec, ~4 dB waveform SNR is expected behavior (the codec is perceptually optimized, not waveform-exact).

What We Built Next

The codec worked. That unlocked the next question: what does a full local voice conversation stack actually look like?

Week 2: Pipeline Assembly

With the codec validated, the goal became end-to-end voice conversation with zero cloud dependencies. The stack: faster-whisper for ASR, Ollama (llama3.2:1b) for the language model, Piper TTS for synthesis.

Each component required optimization:

ASR (Day 13): Replaced Whisper CLI (3.5s) with faster-whisper running on-device — latency dropped to 172ms (20× speedup). mlx-whisper benchmarked at 176ms — comparable, but faster-whisper won on compatibility.

TTS (Day 14): macOS say took 1,518ms. Replaced with Piper TTS — warm latency dropped to 69ms (22× speedup). Kokoro-ONNX int8 was benchmarked at 1,772ms and rejected. Piper's quality-to-latency tradeoff was the clear winner.

Real-time input (Day 15): Added push-to-talk loop via realtime_demo.py. Press ENTER to record, ENTER to stop — mic → Whisper → LLM → Piper → speaker. Warm latency: 1,723–2,564ms, best turns under 2 seconds.

LM speedup (Day 16): The language model was the bottleneck at 1,872ms average. Two changes: brevity system prompt + max_tokens=80. LM latency dropped to 381ms (80% reduction). Total pipeline average: ~1,253ms. Also added webrtcvad for hands-free auto-detection.

Week 3: Streaming and API

Streaming LM→TTS (Day 17): Instead of waiting for the full LM response, a split_sentences() generator streams tokens and synthesizes each sentence as it arrives. First-audio latency: ~216–357ms — roughly 140ms faster than sequential for typical responses.

Persona system (Day 18): Three named personas defined in config/personas.yaml — Claudia (witty, conversational), Aria (precise, technical), Sage (reflective, philosophical). Each has distinct system prompts, temperature settings, and voice assignments. Runtime switching via switch_persona() or interactive !persona <name> command.

HTTP API (Day 18): FastAPI server with GET /personas, POST /chat (text→text), POST /speak (text→WAV). Personality differentiation confirmed in testing — Claudia and Aria respond to the same prompt in measurably different styles.

Conversation sessions + SSE (Day 19): POST /session creates stateful UUID sessions with full multi-turn history. POST /chat/stream streams tokens as Server-Sent Events in real time. Backward compatible — stateless /chat unchanged.

WebSocket channel (Day 20): Persistent bidirectional WS /ws/chat endpoint with per-connection session history, real-time token streaming, ping/pong keep-alive, switch_persona and reset commands.

Web UI (Day 21): Self-contained dark-theme chat interface served at /. Features: persona switcher (loaded from API), real-time token streaming with blinking cursor, thinking indicator, latency badge per message, named session sidebar (create/switch/clear), auto-reconnect on disconnect. Also resolved all 10 previously-failing VAD tests by installing silero-vad 6.2.1.

Why It Matters

Running a neural audio codec on-device enables applications cloud systems can't:

Privacy-first voice AI: Conversations never leave the device. No audio uploaded, no transcripts logged. Relevant for clinical environments, legal work, anything where the conversation itself is sensitive.

Offline-capable: Works on planes, in air-gapped networks, anywhere without connectivity.

Ultra-low latency: No cloud round-trips. The codec alone processes audio at 47.6× real-time. End-to-end first-audio latency is under 400ms in streaming mode.

Composable with local LLMs: The same llama3.2:1b running locally can be swapped for any Ollama model. The persona system makes it easy to tune personality and style per use case.

Current Status

ComponentStatusLatency
Mimi codec in MLX✅ Complete2.7ms round-trip
Weight loading from PyTorch✅ Complete
PyTorch reference validation✅ Complete3.98 dB SNR
faster-whisper ASR✅ Complete172ms
Piper TTS synthesis✅ Complete69ms warm
Real-time push-to-talk✅ Complete<2s warm latency
Streaming LM→TTS pipeline✅ Complete216–357ms first audio
Persona system (3 personas)✅ Complete
HTTP API (FastAPI)✅ Complete
WebSocket full-duplex channel✅ Complete
Browser chat UI✅ Complete
Native Moshi LM integration🔄 In progress
Voice cloning⬜ Planned

The stack runs entirely on an M4 Max Mac Studio. No API keys. No network calls. No data leaving the machine.

Back to all articles