PersonaPlex-MLX: Porting a Neural Codec to Apple Silicon

Most on-device AI focuses on text. Speech — real speech, with emotion, timing, and nuance — is harder. Large audio models require fast convolutions, resampling layers, and high-throughput streaming. They weren't designed for laptops.

Mimi is a neural audio codec developed by Kyutai that compresses raw audio into semantic tokens at 12.5 tokens/second. It's the audio backbone behind Moshi, Kyutai's real-time conversational AI. Getting Mimi running natively on Apple Silicon via MLX means fully local, real-time speech AI — no cloud, no round-trip latency, no data leaving the device.

This is the engineering story of that port.

Why Mimi?

Neural audio codecs are a recent innovation. Traditional codecs (MP3, Opus) compress audio with psychoacoustic tricks. Neural codecs use learned representations — encoder networks that compress audio into discrete tokens, decoder networks that reconstruct audio from tokens.

The advantage: semantic tokens. Unlike raw waveform samples, codec tokens encode meaning. A language model operating on codec tokens can understand and generate speech the same way it understands and generates text. This is the architecture behind GPT-4o's voice mode, Gemini Live, and Moshi.

Mimi specifically:

Encodes at 12.5 tokens/second (vs 75 tokens/second for older codecs like EnCodec)
Uses causal (streaming-capable) convolutions — designed for real-time use
Combines semantic and acoustic tokens — richer representation than audio-only codecs
Apache 2.0 licensed — fully open for research and commercial use

The goal: Run the full encode→decode pipeline on Apple Silicon with latency low enough for real-time use.

The MLX Constraint

PyTorch is the reference implementation for essentially all AI research. Mimi's weights are PyTorch checkpoints. Running them on Apple Silicon in PyTorch works — but it uses CPU fallbacks for unsupported ops, misses Metal GPU acceleration, and drains battery.

MLX is Apple's machine learning framework for Apple Silicon. It's NumPy-compatible, uses Metal natively, and is optimized for the M-series unified memory architecture. The tradeoff: it's newer, has a smaller op surface, and some PyTorch patterns don't translate directly.

The porting challenge: every op in Mimi's architecture needs an MLX equivalent that matches numerically and runs efficiently.

Architecture

Mimi has three major components:

1. Residual Vector Quantizer (RVQ)

The quantizer compresses encoder output into discrete tokens using learned codebooks. Eight codebooks run in series — each quantizes the residual error from the previous one.

MLX mapping: Straightforward. Codebook look-ups are embedding tables (nn.Embedding), which MLX handles natively.

2. Encoder (Waveform → Latents)

The encoder takes raw 24kHz audio and compresses it through strided convolutional blocks. Each block halves temporal resolution — 24kHz → 12kHz → ... → 75Hz — until the output runs at 75 Hz (12.5 tokens/second).

Key components:

Dilated convolutions: Dilation rates [1, 3, 9] capture multi-scale temporal context without increasing parameter count
Weight normalization: Conv weights parameterized as weight = g × (v / ||v||)
Residual connections: Standard skip connections for gradient flow

MLX challenge: nn.Conv1d doesn't support dilation. Solution: custom DilatedConv1d that manually inserts zeros between filter taps.

class DilatedConv1d(nn.Module):
    def __call__(self, x):
        if self.dilation == 1:
            return self.conv(x)
        k = self.conv.weight
        dilated_size = (self.kernel_size - 1) * self.dilation + 1
        dilated_k = mx.zeros([k.shape[0], k.shape[1], dilated_size])
        for i in range(self.kernel_size):
            dilated_k[:, :, i * self.dilation] = k[:, :, i]
        return mx.conv1d(x, dilated_k)

Benchmark on M4 Max: 2.15ms vs 18.7ms for a Python loop — an 8.7× speedup.

3. Decoder (Latents → Waveform)

The decoder mirrors the encoder with transposed convolutions upsampling quantized latents back to 24kHz. Each block doubles temporal resolution: 75Hz → ... → 24kHz.

MLX challenge: PyTorch's F.interpolate doesn't exist in MLX. Custom ConvTrUpsample1d and ConvDownsample1d modules replace it. Validation confirmed numerically equivalent output to the PyTorch reference — the key requirement for correct weight loading.

Weight Loading

Mimi's weights are distributed as a HuggingFace checkpoint (kyutai/mimi). Loading into MLX requires four transformations:

Key mapping: encoder.model.0.weight → encoder.layers.0.weight. Systematic translation of ~200 weight keys.

Shape transposition: PyTorch Conv1d is [out, in, k]; MLX is [k, in, out]. Every conv weight transposes.

LayerNorm initialization: Layers initialized to ones/zeros aren't stored in the checkpoint. MLX handles this via default initialization.

Weight norm materialization: PyTorch stores weight_g and weight_v separately. MLX needs materialized weight = g × (v / ||v||), computed during load.

Performance

Benchmarks on M4 Max (Mac Studio):

Operation	Latency
Dilated conv (custom MLX)	2.15ms
Single encoder block	~8ms
Full encode (1s audio)	~45ms
Full decode (1s audio)	~55ms
Round-trip	~100ms

For real-time streaming at 12.5 tokens/second, the budget is 80ms per chunk. Current latency is within budget.

Why It Matters

Running a neural audio codec on-device enables applications cloud systems can't:

Privacy-first voice AI: Conversations never leave the device. No audio uploaded, no transcripts logged.

Offline-capable: Works on planes, in clinical environments, anywhere without reliable connectivity.

Ultra-low latency: No cloud round-trips means end-to-end voice latency under 100ms vs 300-800ms for cloud systems.

Composable with local LLMs: Mimi tokens feed directly into local language models for fully on-device voice conversation.

Status

Five days of active development:

✅ Full architecture in MLX (encoder, quantizer, decoder)
✅ Custom ops: DilatedConv1d, WeightNorm, ReflectPad, ConvDownsample1d, ConvTrUpsample1d
✅ Resampling validated against PyTorch reference
🔄 Weight conversion and end-to-end validation
⬜ Streaming inference with causal cache
⬜ Integration with local LLM for voice conversation

Target: fully local voice conversation on a MacBook, zero cloud dependencies, sub-200ms round-trip latency.

Abstract