PersonaPlex 7B on Apple Silicon with massive memory leak in full-duplex mode. Anyone get this working?
Posted by Excellent_Koala769@reddit | LocalLLaMA | View on Reddit | 3 comments
I've been trying to run NVIDIA's PersonaPlex 7B (the full-duplex speech-to-speech model based on Moshi) locally on an M5 Max with 128GB unified memory. The goal is simple: a real-time voice chat demo where you talk to it like a phone call.
What I've tried:
1. speech-swift MLX 8-bit (PersonaPlexDemo + custom WebSocket server)
- Inference speed was great: 48-62ms/step (well under the 80ms real-time budget)
- But RAM goes from around 50% to 93% within 10 seconds of starting a full-duplex session, then crashes with
freed pointer was not the last allocation(MLX arena allocator assertion) - Root cause:
KVCacheSimpleusesconcatenated([old, new], axis: 2)every step. Under MLX's lazy evaluation, old arrays aren't freed before new ones are allocated, resulting in O(n²) memory growth across 32 transformer layers - Tried switching to
KVCachePreAllocated(scatter writes into a fixed buffer). Memory was stable but inference slowed to 413ms/step (8x slower). MLX's Metal kernels are heavily optimized for concat, not scatter - Full-duplex audio quality was also bad, mostly gibberish and static even when memory wasn't an issue
- Turn-based mode worked OK but defeats the purpose of the model
2. NVIDIA's official PyTorch server
- MPS support is literally commented out in their source (
#| Literal["mps"]) - CPU-only would never hit real-time on a 7B model
System specs: M5 Max, 128GB unified memory, macOS 26.4, Swift 6.3, MLX latest
What I'm looking for:
- Has anyone gotten PersonaPlex (or even base Moshi) running in stable full-duplex mode on Apple Silicon without the memory leak?
- Is
personaplex-mlx(the Python MLX port) any better with memory management? - Has anyone tried moshi.cpp with Metal/GGML for sustained real-time sessions?
- Any workarounds for the MLX KV cache memory issue? Periodic
mx.eval()flushes? Manualmx.metal.clear_cache()? - Or is this just fundamentally broken on MLX right now and I need a CUDA GPU?
Happy to share the exact code and patches I tried if anyone wants to dig in.
vamsammy@reddit
Have you raised issue at speech-swift ? I've found the dev to be responsive.
irregardless@reddit
this post just saved me a lot of trouble. I've been in the planning stages of a personal project based on the full-duplex promise of the PersonaPlex model. if you're having trouble with that hardware, there's no way it'll work with my substantially less capable hardware.
my read on this is that mlx is fundamentally broken at the moment. looking into it, full-duplex must maintain 2 kv caches, one for user audio and one for model audio. Updating continuously at 12.5Hz, with no natural reset point because a "phone call" is unbounded and yeah, you're going to fill memory fast. half-duplex sidesteps this by allowing the cache to be reset between turns.
i ran the problem by claude and gemini, and both agreed that for full-duplex to work, mlx needs a pre-allocated ring buffer, streamingllm-style attention sinks, and metal scatter optimization. short of all three, "phone style" full-duplex is not possible.
Aaaaaaaaaeeeee@reddit
I've run the base moshi for 4bit on a m2 16GB air. Memory rises at regular pace can probably go on for 2+minutes, which means his setup must be broken.