80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Posted by janvitos@reddit | LocalLLaMA | View on Reddit | 141 comments

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speed with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec on the benchmark found here: https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py

This is on an RTX 4070 Super, so results with other cards might vary.

To run llama.cpp with MTP support, you need to build it from source and add a draft PR that hasn't yet been merged with the master branch. You can find a very nice guide on how to do that here and also download the Qwen3.6 MTP GGUF: https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF

llama.cpp command:

llama-server \
  -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -fitt 1664 \
  -c 131072 \
  -n 32768 \
  -fa on \
  -np 1 \
  -ctk q8_0 \
  -ctv q8_0 \
  -ctkd q8_0 \
  -ctvd q8_0 \
  -ctxcp 128 \
  --no-mmap \
  --mlock \
  --no-warmup \
  --spec-type mtp \
  --spec-draft-n-max 2 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

The most important parameter here is -fitt 1664. Since part of the model is offloaded to CPU because of its size, this tells llama.cpp to properly balance the load on your GPU/CPU to get the best possible performance, and leaves 1664 MB of free memory for the MTP draft model and KV cache. Since I'm running my dGPU as a secondary GPU (monitor plugged in the iGPU), I can use all the available 12GB VRAM for inference. 1664 might be too small if you use your dGPU as your primary GPU.

Benchmark results:

mtp-bench.py

 code_python        pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=80.8
 code_cpp           pred=  58 draft=  40 acc=  37 rate=0.925 tok/s=81.8
 explain_concept    pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=70.0
 summarize          pred=  53 draft=  40 acc=  32 rate=0.800 tok/s=75.4
 qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=77.8
 translation        pred=  22 draft=  16 acc=  13 rate=0.812 tok/s=81.9
 creative_short     pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=69.2
 stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=76.5
 long_code_review   pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=73.2

If you have any questions, feel free to ask :)

Cheers.