MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon

Posted by YoussofAl@reddit | LocalLLaMA | View on Reddit | 31 comments

TLDR: 28 tok/s → 63 tok/s on Qwen3.6-27B on a MacBook Pro M5 Max. 2.24× faster at real temperature 0.6.

Works for coding, creative writing, and chat

https://i.redd.it/i9x794c0q7zg1.gif

Works on ANY MTP model: No external drafter. No extra memory usage. Uses the model's own built-in MTP heads. Works on any model that ships them.
Not greedy: Unlike similar speculative decoding projects, we use mathematically exact temperature sampling with rejection sampling. Adjustable temperatures for any task. Every other speculative decode project on Apple Silicon is greedy-only.
Custom kernel: Built on a patched MLX fork with custom Metal kernels, compiled verify graphs, innovation-tape GDN rollback, and a draft-only requantised LM head.
Full CLI: mtplx start wizard, model download, model inspection with four-tier MTP compatibility detection, configurable depth 2-7+, OpenAI/Anthropic API server, browser chat, terminal chat, benchmarking suite, health diagnostics, crash-safe fan control with idle-aware auto-restore, and a 562-test suite.
Full serving stack: OpenAI + Anthropic compatible API, browser chat UI, terminal chat. Point your editor at localhost and go.

What Is MTPLX?

MTPLX uses a model's built-in MTP heads as speculative drafters to increase decode speeds on LLMs by up to 2.25x, all while preserving the model's default inference settings, allowing you to do coding or creative writing tasks.

QWEN 3.6 27B @ 63 TPS on a MacBook Pro M5 Max

Using MTPLX I increased decode speeds on Qwen 3.6 27B 4-bit MLX from 28 tok/s → 63 tok/s on a MacBook Pro M5 Max at temperature 0.6 with top_p 0.95 and top_k 20. The exact sampling settings Qwen recommends for coding.

Qwen 3.6 27B ships with built-in MTP heads that support up to depth 5. I ran a sweep across D2, D3, D4, and D5 to find the optimal depth for this model on this hardware:

D3 was the optimal spot, high enough acceptance to verify time ratio to where TPS increased the most. D4 and D5 have good acceptance at the early positions but the deeper positions start costing more in verify time than they save in accepted tokens.

These results are at real temperature 0.6 with exact probability-ratio rejection sampling and residual correction.

This means you can actually use Qwen 3.6 27B for real coding work with a 2.25x speed increase without sacrificing output quality.

How Is This Different From DFlash / DDTree?

DFlash MLX has greater absolute speed, however it is restricted to greedy (temp 0) only sampling which severely restricts its real world use case. It also requires an external drafter model which requires additional memory and needs to be created for every model that is released.

DDTree adds tree-based verification on top of DFlash so it inherits the same limitations: greedy only, external drafter required.

The reason for this comes down to how each system drafts. MTP heads draft sequentially. Each token sees the previous draft tokens, so every position produces a real probability distribution. DFlash drafts all 16 tokens simultaneously in a parallel diffusion pass. Token 8 does not know what token 7 is. Without that sequential dependency, there is no per-token probability distribution, which means you cannot do the rejection sampling maths that makes temperature work.

MTPLX works with any model that retains the MTP heads and gives full customisability to the user to choose the number of MTP heads and run any locally saved or HuggingFace model with MTP heads.

Architecture

Layer 0: MLX Runtime

MTPLX runs on a patched MLX fork. Stock MLX's quantised matrix-vector kernel is tuned for large M (prefill). During MTP verify, M is 3 to 6, one position per draft token. Stock stalls at these shapes. The patch: wider simdgroups, loop unrolling, 10 lines of Metal. Exact, 0.0 diff against stock.

On top of the fork sit four custom Metal kernels registered as MLX primitives:

Innovation-tape GDN capture: records KB-scale (token, gate, state-delta) tuples during draft. On rejection, replays from the tape instead of restoring full recurrent state. Replaces hundreds of MB of state snapshots with tiny deltas. Bit-exact against reference.
GraphBank: a cache of mx.compile-compiled verify graphs keyed by (suffix_length, depth, profile). Each verify shape gets one compiled graph reused across all cycles. Capture-commit overhead: 0.073 ms per cycle versus 47 ms verify per cycle. Three orders of magnitude smaller than the work it manages.
Draft-only requantised LM head: the target's lm_head stays at model precision. A separate 4-bit LM head is built in memory for draft-only use. Cuts draft time by 29% without touching target accuracy.
Small-M verify qmv: direct successor of dflash-mlx's M=16 approach, retuned for MTPLX's M=3 to 6 verify shapes.

Layer 1: Single-model runtime

One checkpoint. The target model and drafter are the same model. Qwen3.6-27B ships native MTP heads and MTPLX uses them. Zero RAM for a second model. The trunk's KV cache uses a committed-history contract verified against the vLLM CUDA reference at cosine > 0.9998 through depth 5.

Layer 2: Speculative cycle (the hot loop)

Per cycle: the MTP head drafts K tokens, each seeing the previous draft. The target verifies all K in one batched forward via a compiled GraphBank path. Probability-ratio acceptance (Leviathan-Chen) decides per position in fp32. Residual correction (p - q)+ emits a clean replacement on rejection. A bonus token falls out free when all K accept. The innovation tape commits accepted GDN state deltas and rolls back rejected ones.

Layer 3: Serving stack

Real API server. OpenAI-compatible /v1/chat/completions and /v1/completions with streaming SSE. Anthropic-compatible /v1/messages. /v1/models, /health, /metrics. Engine sessions with per-chat KV state. Session Bank preserves warm-prefix exact state across turns, verified at logits max_abs_diff = 0.0 against fresh forwards. Browser chat UI at localhost with live tok/s, markdown rendering, code-block copy, and stop button. Terminal chat via mtplx chat.

What I Had To Solve

Native MTP on Apple Silicon did not work by default. There were four stacked problems

1) Recursive depth collapse

Running MTP recursively, accuracy collapses after depth 1: 91% → 63% → 44% → 27% → 17%.

Everyone who tried native MTP saw this and gave up. I SSH'd into my 2x3090 PC running vLLM with MTP-5, traced the exact MTP execution, and compared it against MLX token-by-token. The finding: MLX was resetting the MTP attention KV cache every speculative cycle. vLLM does not. It persists MTP history across cycles. One contract fix: depth 2 acceptance jumped from 49% to 74%.

2) Precision mismatch

Every project was using BF16 MTP heads on quantised 4-bit trunks. The MTP head is more precise than the hidden states it receives, which amplifies quantisation noise through recursive prediction. I grafted calibrated INT4 MTP weights onto the trunk, matching MTP precision to trunk precision. Depth 3 jumped from 30% to 88%.

3) MLX verify bottleneck

Even with high acceptance, stock MLX's verify pass was so expensive that MTP was slower than plain autoregressive decode. MLP operations accounted for 51% of verify time.

I patched MLX's Metal qmv shader for the small verify shapes MTP produces (10 lines, wider simdgroups + loop unrolling), built an innovation-tape GDN capture system for efficient state rollback, batched target probability distributions into a single MLX eval boundary, and deferred MTP history materialisation.

Four stacked optimisations that cut verify cycle time from \~90ms to \~47ms per call, taking MTP from slower than plain autoregressive to 2.24× faster.

4) TPS decay

On long responses (8k+ tokens), throughput collapsed. I spent 16 hours trying to figure out why TPS would decay from 50 to 25, a 50% decrease, investigating 24 different profiles: lazy-eval graph accumulation, cache growth, state provenance, paged attention, owned recurrent caches, two-pass Metal SDPA.

None of them solved it.

The problem was hilariously simple. It turns out the speculative decode loop sustains significantly heavier GPU load than normal autoregressive. Every cycle runs a full batched verify forward plus draft computation plus MTP history maintenance.

The additional sustained workload was pushing the M5 Max SoC to 103°C, and macOS's default fan curve ramps far too late. By the time the fans respond, the GPU has already downclocked.

I introduced a MAX mode into the CLI. Using ThermalForge, fans are locked at full speed before generation starts, with a detached watchdog that restores fans to auto if the process dies for any reason. TPS decay dropped from 50% to 6.7%, and GPU clock retention went from 85.6% to 97.1%.

16 hours of kernel debugging, solved by a fan controller.

Caveats

The 63 TPS figure was achieved on a 160-token high-acceptance prompt. Real workflows on an M5 Max will most likely see 50-55 TPS.
I am currently working on the thermal issue by optimising the kernel. If you do not run MAX mode (100% fan mode) you will see significant TPS decline on long prompts due to thermal throttling.
Unsurprisingly, most MLX quants have MTP heads stripped since they used to be pointless on MLX. Many MLX models are incompatible with MTPLX for now. I am hoping my work with MTPLX will drive more people to create MLX quants with MTP heads present and optimised for inference.

In the meantime you can run my official Qwen 3.6 27B MTPLX Optimised from

HuggingFace

. The CLI makes it easy to set up and download.

If you publish MLX quants, please keep the MTP heads. They are around 200MB on a 27B model, cost almost nothing in memory, and are now worth a 2.25× speedup.

Really looking forward to everyone's thoughts and contributions to this project. Making local LLMs on MLX faster and more viable for everyone.

GitHub: https://github.com/youssofal/MTPLX

[-]

Powerful_Evening5495@reddit

what is happening with mlx , it getting alot of love these days

need to join on the fun

[-]

YoussofAl@reddit (OP)

Haha personally its becuase I got annoyed MTP on Qwen on my 2x 3090 setup was getting 130 tps and wanted the same experience on my new laptop.

[-]

Barry_22@reddit

Waiit what

How

(As a fellow 2x3090 owner)

[-]

justan0therusername1@reddit

“Cheap” big ram that is fairly quick enough.

[-]

Longjumping-Sweet818@reddit

Are you planning on creating a Gemma-4-31B variant with MTP?

[-]

Beamsters@reddit

curling/downloading yours right now, saw only one 4 bits option from your default model. can you release 5/6 bits version ? 4 bits intelligence doesn't really cut out for me and is it possible to use other model (like oQ from oMLX which has been stripped from mtp layer) with a separated mtp layer file ?

[-]

YoussofAl@reddit (OP)

Way ahead of you!

This is labelled as 4 bit but it is a dynamic 5 bit ish varient i made that preserve the attention heads in 8 bits. Try it out and let me know what you think. I will release a 6 and 8 bit varient soon too: https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized

As for adding your own MTP layer, yes you can. You can use other models with a separate MTP file. MTPLX supports MTP sidecar grafting.

the current repo-side command is:

python scripts/graft_mtp_sidecar.py \

--source /path/to/stripped-mlx-model \

--mtp-file /path/to/mtp.safetensors \

--output /path/to/model-mtp-graft

Then validate it with:

mtplx inspect /path/to/model-mtp-graft --require-mtp

mtplx start --model /path/to/model-mtp-graft

Keep in mind the MTP sidecar has to match the base model architecture. It is not a universal "plug any MTP into any model thing", but stripped compatible MLX trunks can be wrapped this way.

[-]

riceinmybelly@reddit

Sorry if this is a stupid question but in lm studio you can set a model for speculative decoding, I was planning to use 3.6 9B for this because the real bottleneck for me is the prefill. Do you have any tips on making prefill more bearable for ~100k context runs?

[-]

YoussofAl@reddit (OP)

MTPLX focuses on decode speed not prefill. My best tip would be not to use LM Studio. LM studio does not have great optimisations for MLX. Something like vmlx has probaly 2x better prefill with prompt caching and J.I.T granted it is more complex compared to the UX of LM studio but worth it if you are constantly going above 100k context.

I am releasing an incredibly simple (more simple than LM studio) MLX Swift based app for local LLM's but it is still under development!

[-]

Evening_Ad6637@reddit

Do you have a github account where we can check or follow progress?

[-]

riceinmybelly@reddit

Thank you!

[-]

Beamsters@reddit

I'm testing one right now with the default 4 bits mtplx-qwen36-27b-optimized-speed and the no sustain option. I got 13.9 tok/s · 4313 tokens · 2696 thinking · ttft 26.87s · 173.9 ms/verify · 1566 verifies. On M1 Max 32GB this is not an improvement. Since oMLX got this from 4bits/fp16 model. 2262t prompt · 3712t generated · prefill 135 tok/s · gen 19 tok/s · ttft 16.76s. 215.97s total

[-]

YoussofAl@reddit (OP)

Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

In general, this is a preview to prove the viability of MTP on MLX I am aware of the inefficieny of my approach and am working to bring down compute cost and verify time which will make it much better on your hardware.

[-]

Beamsters@reddit

One draft depth still doesn't cut it. Also M1/M2 architecture needs FP16 not BF16 to do prompt processing fast.

16.6 tok/s · 3684 tokens · 1945 thinking · ttft 26.79s · 96.1 ms/verify · 2067 verifies

[-]

lochyw@reddit

brew install?

[-]

nomorebuttsplz@reddit

will this be merged into mlx?

[-]

YoussofAl@reddit (OP)

MTPLX is a separate runtime built on top of MLX, not a feature patch.Think of it like how vLLM is built on PyTorch but isn't part of PyTorch.

You can run it right now at: https://github.com/youssofal/MTPLX

[-]

lochyw@reddit

So no LMStudio integration?

[-]

iansltx_@reddit

Hmm, tried with my M1 Max (64GB so \~400 GB/s of memory bandwidth) and maybe I'm compute bound because LMStudio (Jundot OQ4, which has similar weights size and is mixed precision) was turning in \~12.4 t/s with fans set at full blast via Macs Fan Control while MTPLX turned in 12.9 t/s. Is the default model here full fp4? In which case that would explain how I was getting consistently different answers for the same prompt with the same params.

Happy to test newer builds of this because breaking 20 t/s on Qwen 27B would be cool.

[-]

YoussofAl@reddit (OP)

Just replied to someone else with the same issue, to summarise its a compute issue try lowering MTP heads or using a smaller model:

"Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

[-]

iansltx_@reddit

Sweet, thanks. Yeah, that makes sense. I'll try with a lower depth in the AM. Also with the smaller model just to A/B test though for coding use cases I want the benefits that the 27B dense model provides, otherwise I'd just run 35BA3B at fp8. Also need to try with oMLX.

I'm sure I'll find this out soon enough, but I'm guessing the compute limitation would mean that an fp8 version would be even less likely to have a performance improvement here? Hitting 30 t/s on fp8 on an M5 Max though (given that you're testing with one of those) would be pretty epic.

[-]

leonbollerup@reddit

Can 35b get the same love ?

[-]

YoussofAl@reddit (OP)

I tested 35B (granted an unoptimised varient) and got only a 1.2x speed increase. So it works but since its MoE the TPS increase isn't "wow" like 27B.

I will work on it though and try to make a varient that scores atleast 1.5x

[-]

chollingsbollings@reddit

I’m trying to run this on my openclaw agent via the API key but it keeps spitting out odd behavior rather than actually doing it.

examples:

session_status exec

printing these out instead of actually doing anything.

[-]

YoussofAl@reddit (OP)

I just patched it, it should be working now!

Please reinstall MTPLX and restart OpenClaw. Let me know If OpenClaw still acts weird after that, send me the exact OpenClaw config/request.

[-]

YoussofAl@reddit (OP)

Looking into it now. Thanks for letting me know.

[-]

InternetNavigator23@reddit

A bit over my head but is this context limited like dflash?

Love the speed!

[-]

YoussofAl@reddit (OP)

I just patched it, it should be working now!

Please reinstall MTPLX and restart OpenClaw. Let me know If OpenClaw still acts weird after that, send me the exact OpenClaw config/request.

[-]

YoussofAl@reddit (OP)

Nope! no context limitation beyond the models native context length. TPS will naturally decline over long contexts but you'll still see a speed increase.

[-]

Raredisarray@reddit

Really nice work here 🔥🔥

[-]

YoussofAl@reddit (OP)

Thank you!