GGUF with MTP vs MLX without. Is mlx still the way to go for mac users?
Posted by mouseofcatofschrodi@reddit | LocalLLaMA | View on Reddit | 22 comments
Has anyone of the mac users tested the speed difference (token gen, promt processing) between mlx quants without mtp, vs gguf quants with mtp?
More or less once a month I wonder if mlx is still the correct path in mac. Some reasons:
- LM Studio has bad caching for mlx. And not MTP of course.
- omlx has very good cache + turboquant + dflash, but no MTP (yet, I see it will come soon since it is already in the dev branch).
- I have discovered two other engine wrappers that are interesting: rapid-mlx and mtplx, didn't try them yet. The second has MTP.
In general for MLX there is no alternative to llama.cpp that has it all, with so many configurations.
I keep using mlx, cause it is more efficient on a mac. But now with MTP already in llama.cpp, I wonder if using metal llama + MTP the speeds would be better than mlx.
And the most important part, the quant world has more options for the GGUFs.
Appreciate if someone has experience or knowledge to share.
asankhs@reddit
You can try mlx-optiq.com
Pleasant-Shallot-707@reddit
I think the reason MLX doesn't have a llama.cpp is because MLX comes with every version of MacOS.
Kina_Kai@reddit
This is a nonsense sentence.
Long_Respond1735@reddit
I ran them side by side 26B non MTP was faster , MB M4 Pro 24GB RAM because MTP only wins when the drafter predicts multiple accepted tokens cheaply enough to offset its own overhead
havnar-@reddit
I’ve used Dflash instead to pump my tokens per second. Seems to work pretty well.
FerradalFCG@reddit
I'm having same experience, i dont know if I'm configuring something wrong with oMLX but I dont see any improvements using MTP or dflash...
ex-arman68@reddit
This is misleading: MTP does not benefit everything: the larger the model, the more to gain from it, hence why MoE of small models is a bad use case. It is meant for medium to large dense model, or MoE or medium to large dense experts. Then there is the task as well: it does well on deterministic tasks such as coding (high token acceptance), and poorly on creative tasks such as brainstorming (low token acceptance, possibly before the benefit threshold).
wtfihavetonamemyself@reddit
I want to love mlx. But the bugs and conditions particularly with MOE models kills it for me. I’d love to move from llama and get faster token speed.
But with ssd cache or not, I haven’t gotten reliable subsequent answer TTFT improvements with omlx, vllm-mlx,or mlx-lm.
jonnywhatshisface@reddit
So far from what I’ve seen, MLX has been a bit unstable.
My setup is a bit older - M2 Max with 64gb ram. The token generation speed is insane compared to running GGUF, but that’s offset by both the slower prompt processing times and the instability of triggering the interactivity watchdog (I’m running on a laptop).
The issue I’ve seen is that with GGUF, you’re pre-allocating all of the memory up front. This doesn’t initialize it by any means, but it gives you a few advantages. Firstly it lets you see what your resource usage is going to be between both the model and the kv cache. Secondly, even though a few memory allocations here and there is not going to break the banks, doing tens and hundreds of thousands of them back to back will. You’ll notice MLX tends to allocate and wire the memory on-demand, and as your context size grows the memory continues to grow as well.
What I’ve seen on my setup is that this is triggering massive reclaim from the page cache and causing double the load in memory bandwidth. This massively slows down pre fill, but even worse is it hangs up the gpu cores for longer periods of time. The interactivity watchdog in WindowServer eventually shoots the model in the head.
This can be worked around by running the laptop in clamshell mode and not having any display at all - but that defeats the purpose entirely of running the model on a laptop to begin with.
With Qwen3.6 35b a3b q4_k_m I’m getting about 450tk/s on the prompt processing and 65tok/sec on generation. This is, of course, after massive tweaking and tuning in every direction imaginable and with a context size of 131k. I can’t get anywhere near that context size with MLX or it’ll flat out get shot in the head during prefill. Though, I do love the speed difference and the token caching with oMLX really makes a massive difference. I wish llama.cpp would do something similar with GGUF models personally…
roninXpl@reddit
M3 Max 64GB here: unsloth qwen3.6-35b-a3b-mtp Q6_K_XL gives me up to 67toks, lmstudio-community qwen/qwen3.6-35b-a3b Q6_K gives me 57toks.
totosse17@reddit
From what I have seen MTP improves performance on dense. For MoE it reduces performance. Since macs are memory loaded with no sot strong compute, they mostly benefit from large MoE models with small active params count. Since MTP gains are not there yet, MLX is still way to go.
Raregendary@reddit
No idea how you tested that on qwen3.6 35B A3B on a 4090 i get 100-140 t/s depending on quant and with mtp now its 160-210, the more coding intensive the more the gain with draft tokens 5-6 on html/css/js i have reaches 230t/s on a moe model.
ex-arman68@reddit
MTP does not benefit everything: the larger the model, the more to gain from it, hence why MoE of small models is a bad use case. It is meant for medium to large dense model, or MoE or medium to large dense experts. Then there is the task as well: it does well on deterministic tasks such as coding (high token acceptance), and poorly on creative tasks such as brainstorming (low token acceptance, possibly before the benefit threshold).
stavrosg@reddit
Wow it works. i jumped from 15.2 t/s to 24.8...... booya.. now ca nwe get big models versions like glm?
mouseofcatofschrodi@reddit (OP)
what works? do you mean you tried qwen3.6 27B with llama.cpp + MTP, or mtplx + MTP? or what exactly?
stavrosg@reddit
downloaded MTP version of 3.6. copied his paramrs above. works fantastic. models seems sharper too, thats 1000% my imagination. maybe
Likeatr3b@reddit
I haven’t updated my llama.cpp on ToolPiper because a Mac user replied to the GitHub PR that it had degraded his token performance.
mouseofcatofschrodi@reddit (OP)
so you use llama.cpp instead of an mlx engine. Have you compared both? Why do you do so?
Likeatr3b@reddit
Yes, for sure. The result was that MLXE may become the go-to. At this moment llama.cpp has momentum I need. The tradeoffs are huge though and I think MLE is going to win the race.
I'm all about focusing engineering initiative on 1 thing. The Apple ecosystem is about to win this local AI race so MLXE is going places.
Quirky-Persimmon3342@reddit
the memory bandwidth story favors MLX specifically because Apple Silicon is designed around it. MTP on gguf should close the gap as optimizations land but MLX having both the bandwidth advantage and native Metal kernels is hard to beat right now.
sammcj@reddit
M5 Max here, MTP improves llama.cpp performance but it still doesn't keep up with MLX, especially if you use MLX with MTP or a draft model. Hopefully llama.cpp will get there because the ecosystem is nice
Konamicoder@reddit
MLX all the way. You can install oMLX v0.3.9dev2 right now if you want to test MTP. Or just wait for the stable release.