2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints
Posted by ex-arman68@reddit | LocalLLaMA | View on Reddit | 19 comments
The recent PR to llama.cpp bring MTP support to Qwen 3.6 27B. This uses the built-in tensor layers for speculative decoding. None of the existing GGUF have it, as they need to be converted with this PR.
I have tested it locally on my mac M2 Max 96GB, and the results are amazing: 2.5x speed increase, bringing it to 28 tok/s! In addition the recent releases of llama.cpp also support turboquants, which helps a lot with memory usage in more constrained environment (including an additional speed boost).
I have converted the most useful quants and uploaded them to HF. Even if you are using apple silicon, you should use those instead of MLX. You can download them here:
https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF
This also includes 7 fixes I made to the original jinja chat template, due to vLLM specificity which broke in other tools:
https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates
For now, you will need to compile your own version of llama.cpp to use them. It is fairly simple to do:
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli llama-server
Then to start serving with the API endpoint, use a command similar to:
llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--mmproj mmproj-Qwen3.6-27B-f16.gguf \
--spec-type mtp --spec-draft-n-max 5 \
--cache-type-k turbo4 --cache-type-v turbo4 \
-c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081
That's it. Three optimizations in one command:
| Flag | What it does | Impact |
|---|---|---|
--spec-type mtp --spec-draft-n-max 5 |
Multi-Token Prediction (built into the model) | 2.5x faster generation |
--cache-type-k turbo4 --cache-type-v turbo4 |
4.25-bit KV cache (instead of 16-bit) | Quarter the KV memory |
-c 262144 |
262K context window | Full native context on 48 GB Mac with turbo4 KV |
Adjust -m, -c, and --cache-type-k/v for your hardware, according to the tables below.
Here are my recommendations based on your hardware:
Apple Silicon
| RAM | Quant | KV cache | Max context | Memory used | Vision |
|---|---|---|---|---|---|
| 16 GB | IQ2_M |
turbo3 |
48K | 11.7 GB | ✗ |
| 24 GB | IQ3_M |
turbo4 |
64K | 15.4 GB | ✗ |
| 24 GB | IQ4_XS |
turbo3 |
48K | 15.9 GB | ✗ |
| 32 GB | Q4_K_M |
turbo4 |
128K | 22.8 GB | ✓ |
| 32 GB | IQ4_XS |
turbo4 |
160K | 23.4 GB | ✓ |
| 32 GB | Q5_K_M |
turbo4 |
80K | 23.1 GB | ✓ |
| 48 GB | Q6_K |
q8_0 |
128K | 33.8 GB | ✓ |
| 48 GB | Q5_K_M |
turbo4 |
262K | 32.8 GB | ✓ |
| 48 GB | Q8_0 |
q8_0 |
80K | 35.0 GB | ✓ |
| 64+ GB | Q8_0 |
q8_0 |
262K | 53.2 GB | ✓ |
NVIDIA GPU
| VRAM | Quant | KV cache | Max context | Memory used | Vision |
|---|---|---|---|---|---|
| 16 GB | IQ3_M |
turbo4 |
48K | 14.6 GB | ✗ |
| 16 GB | IQ2_M |
turbo4 |
80K | 14.0 GB | ✓ |
| 24 GB | Q4_K_M |
turbo4 |
128K | 22.8 GB | ✗ |
| 24 GB | IQ4_XS |
turbo4 |
128K | 21.7 GB | ✓ |
| 48 GB | Q8_0 |
q8_0 |
128K | 39.8 GB | ✓ |
| 48 GB | Q6_K |
turbo4 |
262K | 35.8 GB | ✓ |
| 80 GB | Q8_0 |
q8_0 |
262K | 53.2 GB | ✓ |
24 GB Mac:
IQ4_XSfor quality (48K), orIQ3_Mfor more context (64K).32 GB Mac:
IQ4_XSreaches 160K (imatrix).Q5_K_Mfor quality at 80K.48 GB Mac:
Q5_K_M/turbo4 reaches 262K.Q6_Kat 128K orQ8_0at 80K for higher quality.24 GB GPU:
IQ4_XSenables vision at 128K (Q4_K_M can't fit both).48 GB GPU:
Q6_K/turbo4 reaches 262K.
For coding and reasoning, prioritize higher quants with q8_0 KV. For general chat and RAG, IQ4_XS with turbo4 and larger context is often sufficient.
Vision adds 0.9 GB for mmproj. Recommendations reserve 8 GB for macOS (4 GB on 16 GB) — this is conservative, not OS-enforced. You can increase available VRAM by raising the wired memory limit, e.g. for a 96 GB Mac: sudo sysctl iogpu.wired_limit_mb=90112 (88 GB). Adjust the value for your RAM size. GPU reserves ~1 GB for CUDA.
yes_i_tried_google@reddit
Same success here. RTX 3090 ti
iq4 with MTP enabled (custom build from open PRs), read my success here on an RTX 3090 Ti.
Qwen 3.6 27B. Full 256k ctx, IQ4_XS. q4/q4. 100 tok/sec
Qwen 3.6 35B. 200k ctx, IQ4_XS. q4/q4. 200 tok/sec
https://huggingface.co/localweights/Qwen3.6-27B-MTP-IQ4_XS-GGUF
https://huggingface.co/localweights/Qwen3.6-35B-A3B-MTP-IQ4_XS-GGUF
jacek2023@reddit
When was turbo3/turbo4 merged? Or is this part of MTP PR?
pmttyji@reddit
Custom fork probably.
llama.cpp Links related to TurboQuant here to track progress.
No_Algae1753@reddit
Just went through the release notes and couldn't find anything in tq
No_Algae1753@reddit
Wait the draft is already merged? I thought it was still in "beta"
MrBIMC@reddit
Not merged, that’s why his instructions specify checking out the mtp branch and building yourself.
I’m more surprised that support for turbo4 is merged. Didn’t know it was already available. Though idk how much worse/better it is in comparison to q4 or q8 on llama-server
Glum-Atmosphere9248@reddit
Don't they offer attn-rot which should be similar?
No_Algae1753@reddit
I think they do, however it is still lossy
pmttyji@reddit
That's only for quality I think. Basically that's beneficial only for people who didn't quantize KVCache. They could use Q8 now to save memory. No benefit for people who quantize the KVCache(Still Quality improved for Q8).
No 'speed boost' or 'less memory usage'.
No_Algae1753@reddit
Yeah am just surprised as you are.
MrBIMC@reddit
Just tried building for cuda, and it tells me turbo4 is not supported -_-
And with anything above 64k ctx q4-k-m + kv q8_0 runs out of memory on startup.
Welp.
No_Algae1753@reddit
So what was his claim about? Or is cuda not being supported for tq4?
MrBIMC@reddit
Idk, hugging face page tells to use turbo4 even on nvidia setups, so I assumed it should work.
pmttyji@reddit
Yep. Still in draft
https://github.com/ggml-org/llama.cpp/pull/22673
rerri@reddit
Is 5 really optimal for draft max? I'm mostly seeing 2 and 3 recommended elsewhere.
Also does mmproj work with speculative on llama.cpp? I tried it just now with PR 22673 and it crashes for me.
No_Algae1753@reddit
Hey, i have 2 questions: this one's off topic
You can increase available VRAM by raising the wired memory limit, e.g. for a 96 GB Mac:
sudo sysctl iogpu.wired_limit_mb=90112(88 GB)I have set mine to 96000 (I'm using the same mac) have you noticed some swapping happening as the context I creasingly grows? This is something that happens on my Mac and I don't want to use the swap for cache offloading. Also why do you limit the vram to 88gb ?
Second question: did I read it right that they also released turboquant ? If so are there some benchmarks / more Infos to it?
cleversmoke@reddit
Amazing work! I just downloaded! Thank you!
ResidentPositive4122@reddit
Legend.
Man, these past 6 months have brought us more than the last 2 years combined. On the one hand we've seen really powerful open models (glms, kimis, deepseeks, minimaxs, mimos, etc) and more importantly for this community, really useful "good enough" truly local models in gemmas and qwens.
Now we're seeing lots of inference improvements that can be ran on consumer hardware, and that's what we mostly care about. Insane progress in a very short timespan.
ps5cfw@reddit
I am a fan of your template and truly appreciate your work. Are you using a similar strategie to AesSedai in terms of what you quantize and how? From my experience for coding purposes I find his quants to be the best around, his Q6 Qwen 3.6 35b has actively outmatched unsloth's Q8_K_XL in my usage scenarios, when matched with your template.