2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Posted by ex-arman68@reddit | LocalLLaMA | View on Reddit | 19 comments

The recent PR to llama.cpp bring MTP support to Qwen 3.6 27B. This uses the built-in tensor layers for speculative decoding. None of the existing GGUF have it, as they need to be converted with this PR.

I have tested it locally on my mac M2 Max 96GB, and the results are amazing: 2.5x speed increase, bringing it to 28 tok/s! In addition the recent releases of llama.cpp also support turboquants, which helps a lot with memory usage in more constrained environment (including an additional speed boost).

I have converted the most useful quants and uploaded them to HF. Even if you are using apple silicon, you should use those instead of MLX. You can download them here:

https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF

This also includes 7 fixes I made to the original jinja chat template, due to vLLM specificity which broke in other tools:

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

For now, you will need to compile your own version of llama.cpp to use them. It is fairly simple to do:

git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr

cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli llama-server

Then to start serving with the API endpoint, use a command similar to:

llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --mmproj mmproj-Qwen3.6-27B-f16.gguf \
  --spec-type mtp --spec-draft-n-max 5 \
  --cache-type-k turbo4 --cache-type-v turbo4 \
  -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081

That's it. Three optimizations in one command:

Flag	What it does	Impact
`--spec-type mtp --spec-draft-n-max 5`	Multi-Token Prediction (built into the model)	2.5x faster generation
`--cache-type-k turbo4 --cache-type-v turbo4`	4.25-bit KV cache (instead of 16-bit)	Quarter the KV memory
`-c 262144`	262K context window	Full native context on 48 GB Mac with turbo4 KV

Adjust -m, -c, and --cache-type-k/v for your hardware, according to the tables below.

Here are my recommendations based on your hardware:

Apple Silicon

RAM	Quant	KV cache	Max context	Memory used	Vision
16 GB	`IQ2_M`	`turbo3`	48K	11.7 GB	✗
24 GB	`IQ3_M`	`turbo4`	64K	15.4 GB	✗
24 GB	`IQ4_XS`	`turbo3`	48K	15.9 GB	✗
32 GB	`Q4_K_M`	`turbo4`	128K	22.8 GB	✓
32 GB	`IQ4_XS`	`turbo4`	160K	23.4 GB	✓
32 GB	`Q5_K_M`	`turbo4`	80K	23.1 GB	✓
48 GB	`Q6_K`	`q8_0`	128K	33.8 GB	✓
48 GB	`Q5_K_M`	`turbo4`	262K	32.8 GB	✓
48 GB	`Q8_0`	`q8_0`	80K	35.0 GB	✓
64+ GB	`Q8_0`	`q8_0`	262K	53.2 GB	✓

NVIDIA GPU

VRAM	Quant	KV cache	Max context	Memory used	Vision
16 GB	`IQ3_M`	`turbo4`	48K	14.6 GB	✗
16 GB	`IQ2_M`	`turbo4`	80K	14.0 GB	✓
24 GB	`Q4_K_M`	`turbo4`	128K	22.8 GB	✗
24 GB	`IQ4_XS`	`turbo4`	128K	21.7 GB	✓
48 GB	`Q8_0`	`q8_0`	128K	39.8 GB	✓
48 GB	`Q6_K`	`turbo4`	262K	35.8 GB	✓
80 GB	`Q8_0`	`q8_0`	262K	53.2 GB	✓

24 GB Mac: IQ4_XS for quality (48K), or IQ3_M for more context (64K).

32 GB Mac: IQ4_XS reaches 160K (imatrix). Q5_K_M for quality at 80K.

48 GB Mac: Q5_K_M/turbo4 reaches 262K. Q6_K at 128K or Q8_0 at 80K for higher quality.

24 GB GPU: IQ4_XS enables vision at 128K (Q4_K_M can't fit both).

48 GB GPU: Q6_K/turbo4 reaches 262K.

For coding and reasoning, prioritize higher quants with q8_0 KV. For general chat and RAG, IQ4_XS with turbo4 and larger context is often sufficient.

Vision adds 0.9 GB for mmproj. Recommendations reserve 8 GB for macOS (4 GB on 16 GB) — this is conservative, not OS-enforced. You can increase available VRAM by raising the wired memory limit, e.g. for a 96 GB Mac: sudo sysctl iogpu.wired_limit_mb=90112 (88 GB). Adjust the value for your RAM size. GPU reserves ~1 GB for CUDA.

[-]

MrBIMC@reddit

Not merged, that’s why his instructions specify checking out the mtp branch and building yourself.

I’m more surprised that support for turbo4 is merged. Didn’t know it was already available. Though idk how much worse/better it is in comparison to q4 or q8 on llama-server

[-]

No_Algae1753@reddit

Yeah am just surprised as you are.

[-]

MrBIMC@reddit

Just tried building for cuda, and it tells me turbo4 is not supported -_-

And with anything above 64k ctx q4-k-m + kv q8_0 runs out of memory on startup.

Welp.

[-]

No_Algae1753@reddit

So what was his claim about? Or is cuda not being supported for tq4?

[-]

MrBIMC@reddit

Idk, hugging face page tells to use turbo4 even on nvidia setups, so I assumed it should work.