PersonaPlex-MLX: Toward a Full-Duplex Local Agent Interface
PersonaPlex-MLX asks whether a personal AI interface can feel conversational at human speed while still doing real, inspectable agent work. The project combines local speech infrastructure, cancellable interaction loops, task contracts, risk gates, and evidence-backed worker execution into a gated research program.

Most AI products still treat voice as a prettier command line: speak, wait, receive an answer, repeat. That interaction model is too slow and too brittle for real work. Real work involves interruption, correction, uncertainty, approval, inspection, and recovery. A useful personal agent needs to hear those signals while work is underway, not after a request has already turned into a sealed black box.
PersonaPlex-MLX is an attempt to build that interface locally. It started as an Apple Silicon speech stack: a Mimi neural codec port, local ASR, local language-model routing, local TTS, personas, sessions, and streaming APIs. That first chapter proved an important substrate: conversational speech can run on a Mac without sending audio to a cloud service. The current research question is larger: can that substrate become a full-duplex work interface for supervised agentic execution?
The project is now organized as a gated research program rather than a single demo. Each phase asks for evidence that the interface is becoming more usable, safer, and more capable. The target is not a generic voice assistant. The target is a workbench where a person can speak naturally, interrupt immediately, redirect mid-task, inspect state, approve risk, and receive evidence before anything leaves the machine.
Research Thesis
The central thesis is that agent interfaces should separate the fast loop from the slow loop.
The fast loop is the live interaction layer: microphone state, speech recognition, cancellable assistant speech, backchannels, interruption, correction, acknowledgement, and visible state. It should feel responsive on human conversational timescales. If the system is speaking and the user interrupts, the speech should stop. If the user says "wait" or "not that," the interface should preserve control rather than racing ahead.
The slow loop is the work layer: reasoning, task decomposition, coding agents, verification, review, memory synthesis, artifact creation, and long-running execution. It can take seconds, minutes, or hours, but it must remain supervisable. The user should not have to manage the mechanics of the worker to stay in control of the work.
Between those loops sits a coordination kernel: task contracts, scope boundaries, risk classes, approval gates, stop paths, progress events, verification evidence, and review summaries. This kernel is the difference between a voice chatbot that happens to call tools and a reliable interface for delegated work.
The core evaluation criterion is not whether the system can answer in a pleasant voice. It is whether a person can interrupt, redirect, recover, and supervise slow work without losing the thread.
Background: The MLX Speech Substrate
The original PersonaPlex-MLX work focused on local speech AI for Apple Silicon. The most technical piece was a port of Kyutai's Mimi neural audio codec to MLX. Mimi compresses speech into discrete audio tokens and is one of the important building blocks behind real-time speech models such as Moshi.
That work established several useful results:
| Component | Result |
|---|---|
| Mimi codec in MLX | Full encode/decode path running locally |
| Codec performance | Approximately 2.7ms round trip on an M4 Max Mac Studio |
| PyTorch checkpoint loading | Weight mapping, shape transposition, and weight-norm materialization implemented |
| Reference validation | Expected perceptual-codec behavior rather than waveform-exact reconstruction |
| ASR pipeline | On-device faster-whisper path replacing slower CLI transcription |
| TTS pipeline | Low-latency local synthesis path tested against slower system defaults |
| API surface | FastAPI, streaming, WebSocket, sessions, and browser UI prototypes |
The important caveat: the earlier stack was not yet true simultaneous full-duplex conversation. It had strong pieces of a local voice stack, including streaming and bidirectional APIs, but the practical mic loop still behaved closer to push-to-talk or turn-taking. That distinction matters. Calling a system "full-duplex" because it has a WebSocket is too generous; the user experience has to support real interruption while audio and work are active.
The current project keeps the local-first philosophy and the PersonaPlex name, but shifts the emphasis from "can we run the audio stack locally?" to "can local interaction become a trustworthy control surface for agent work?"
System Architecture
The revived PersonaPlex project is structured around five layers.
1. Interaction Harness
The browser harness is the primary test surface. It includes transcript display, assistant state, manual utterance entry, microphone controls, optional browser speech recognition, local Whisper STT, assistant speech playback, pause/resume/cancel controls, and a structured event log. This layer exists to make interaction testable before real workers are attached.
The design choice is deliberate: the harness is not decoration around an agent. It is the experiment. Before a background worker edits files or launches a task, the interface must prove that the user can stop, correct, and understand what is happening.
2. Local Speech Providers
The current prototype supports a local Whisper STT path through a small Node server and keeps browser speech recognition as a fallback/debug path. The browser Web Speech path has shown network errors in Chrome, so the local path is treated as the serious route. TTS is cancellable, with local Rocky voice support where available and browser speech as a fallback.
The near-term goal is not to pick a final speech model. It is to preserve provider optionality while measuring the interaction properties that matter: acknowledgement time, speech start time, cancellation latency, transcript quality, and recovery from misrecognition.
3. Conversation State Machine
The harness tracks listening, speaking, paused, working, cancelled, and ready states. It also records barge-in, transcript rejection, readiness signals, and task transitions. These states are what make the experience observable. Without them, the system can appear alive while being impossible to debug.
One current relationship test starts with an assistant-initiated shared problem, invites correction, waits for readiness, and only then generates a task contract. The mic stays off until the user explicitly starts listening. That small rule encodes a larger principle: proactivity should be bounded by consent and visible state.
4. Safe Agent Workbench
Phase 2 introduces the workbench layer. It reads local repo context, generates bounded task contracts, classifies risk, specifies acceptance criteria, defines verification steps, and requires a stop path. The current worker path is intentionally dry-run oriented: it proves the safety envelope before attaching a real coding worker.
A task contract includes:
- goal
- in-scope and out-of-scope boundaries
- risk class
- acceptance criteria
- verification plan
- allowed write surface
- stop/cancel behavior
- evidence expected from the worker
This is the architectural hinge of the project. Voice without contracts becomes charming chaos. Agents without a live control surface become opaque automation. PersonaPlex is trying to join the two without giving up supervision.
5. Evidence and Evaluation
The project uses phase gates, scripted validations, manual browser smoke tests, event logs, and acceptance criteria. Phase gates are intentionally phrased as decisions: proceed, refactor, narrow, or stop. That keeps the project honest. A demo that looks impressive but fails interruption is not a shipped interaction model; it is just a very confident loading spinner.
Progress To Date
Phase 0 established the project nucleus: prior art index, research summary, loop contract, OKR gates, evaluation pack, phase acceptance criteria, and a clear boundary between local prototypes and external action. This phase shipped.
Phase 1 produced a browser-first interaction harness. It supports manual utterances, transcript display, state panels, cancellable speech synthesis, pause/resume/cancel controls for simulated slow work, event logging, JSON export, local Whisper STT, auto-VAD, echo guard behavior, diagnostics, and a server API for health checks and speech transcription. Static validation and HTTP smoke checks passed.
Phase 2 is underway. The safe workbench scaffolding now includes read-only repo context, task contract generation, no-mutation dry-run workers, verification test workers, and a relationship/conversation smoke route. The current gate is to prove that a useful local task can be described, scoped, simulated, interrupted, and verified before any real worker is connected.
Current endpoint surface includes:
GET /api/healthfor local STT availabilityPOST /api/sttfor short recorded utterancesGET /api/tts/healthandPOST /api/ttsfor optional local TTSGET /api/repo/contextfor read-only project contextPOST /api/task/contractfor bounded task contractsPOST /api/agent/dry-runfor no-mutation worker simulationPOST /api/agent/test-runfor contract-approved verification tests
The most important progress is conceptual as much as technical: the project has stopped treating "voice assistant" as the product category. The better category is supervised agent interface.
Evaluation Method
The evaluation pack is built around felt control and instrumented evidence.
For the fast loop, the questions are:
- Does assistant speech cancel immediately when interrupted?
- Does the interface distinguish listening, speaking, paused, and working states?
- Can the user correct a misunderstanding without starting over?
- Does the transcript preserve enough context to recover from an error?
- Does the system avoid activating the mic without explicit user action?
For the slow loop, the questions are:
- Does the task contract accurately capture the user's goal?
- Are scope boundaries explicit enough to block unrelated work?
- Is risk classified before execution?
- Are acceptance criteria and verification steps present before worker launch?
- Does cancel produce an explicit artifact rather than a fake success?
- Is evidence summarized before any external action?
For the combined loop, the hardest question is:
- Can the user supervise slow agent work conversationally without becoming the project manager for the machinery itself?
That last question is the real benchmark. Latency numbers matter, but the system can be fast and still feel wrong. A good interface should make control cheaper, not merely make answers arrive sooner.
Findings
Local-first speech remains strategically useful
The MLX and local speech work is still valuable, even though the current harness is more important than the codec itself. Local speech keeps private conversations private, reduces cloud dependency, supports offline experimentation, and gives tighter control over latency. It also makes the project more honest: if the system is supposed to be a personal work interface, the default should not be shipping every utterance to a remote service.
Full duplex is an interaction property, not a transport property
The earlier WebSocket and streaming work created necessary infrastructure, but true full duplex must be evaluated at the user experience level. Can the user interrupt while the system is speaking? Can the system update state while work continues? Can a correction alter an in-flight task? If not, the implementation may be streaming, but the interaction is still turn-based.
Task contracts are the bridge from conversation to work
Natural language is a poor execution boundary by itself. The task contract gives the system something inspectable before work begins. It also creates a place for risk, scope, acceptance criteria, and verification to live. This is especially important for coding agents, where a vague request can turn into broad file edits very quickly.
The harness should be judged by recovery, not polish
The most revealing tests are not happy-path demos. They are misrecognition, interruption, cancellation, correction, and verification failure. A system that recovers cleanly from those moments is closer to real usability than one that gives a beautiful answer when everything goes right.
Limitations
PersonaPlex-MLX is still a prototype. It has not yet proven real worker execution through the live harness. The current workbench intentionally stops at dry runs and verification simulations. That is a feature, not a deficiency: attaching a powerful worker before the control surface is trustworthy would test the wrong thing.
The speech stack is also in transition. The original MLX codec work demonstrated impressive local performance, but the current browser harness relies on pragmatic STT/TTS paths while the interaction architecture stabilizes. Native realtime speech models, Moshi-style audio-token models, and future provider upgrades remain candidates rather than assumptions.
Finally, event logs are necessary but insufficient. A log can prove that a cancellation event fired. It cannot prove that the user felt in control. The project therefore needs both instrumented gates and manual felt gates.
Roadmap
The next milestone is Phase 2 completion: one bounded local task should move from spoken goal to task contract to dry-run worker to verification evidence while remaining interruptible and inspectable. If that works, the project can attach a real scoped worker in a constrained workspace.
Phase 3 adds memory and self-improvement. The system should summarize sessions, persist interaction metrics, classify user corrections, and turn failures into eval cases. The constraint is that it must not silently mutate safety policy or treat reflection as evidence. Learning has to be grounded in logs, tests, corrections, and reviewable proposals.
Phase 4 adds multimodal context: current app, selected text, repo state, and optional screenshots with explicit privacy controls. The goal is to let the user say "fix that" and have the system know what "that" means without becoming creepy or noisy.
Phase 5 keeps the architecture model-agnostic. STT, TTS, realtime models, local LLMs, and cloud providers should be swappable behind provider interfaces. The project should be ready for better native realtime models without rebuilding the control surface every time a new model ships.
Conclusion
PersonaPlex-MLX began as a local speech experiment and is becoming a research program for agent interaction. The important bet is that the next useful AI interface is not just a faster chatbot or a better voice. It is a live control layer for delegated work.
The project is still early, but the direction is sharp: build the interaction harness first, prove interruption and recovery, wrap work in contracts, attach workers only after the safety envelope holds, and evaluate the system by whether it gives the user more control rather than more automation theater.
If successful, PersonaPlex will not feel like talking to an app. It will feel like supervising a capable local collaborator while keeping your hands on the actual levers.