Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs
Posted by enrique-byteshape@reddit | LocalLLaMA | View on Reddit | 57 comments
Hey r/LocalLLaMA,
We’ve released our ByteShape Qwen 3.6 35B GGUF quantizations in two families: standard NTP (Next Token Prediction or non-MTP) and MTP.
Blog / Download NTP Models / Download MTP Models
TL;DR
- For NTP, “pick the largest quant that fits” worked surprisingly well.
- Lower bpw was not automatically better: our largest model was very hard to beat on quality/speed, including prompt processing and token generation.
- MTP gave a real GPU generation-speed boost, usually around 20–40%, but the extra memory footprint can change what fits.
- MTP speedup is heavily workload dependent.
- CPU MTP was not attractive in our tests, so our CPU recommendation remains NTP.
- We excluded MMLU from this release because Qwen 3.6 showed answer-format compliance issues in full precision, making it a noisy quantization-comparison signal.
For this release, we tried to make the comparison more of a small hardware study than just a model drop. We benchmarked the original model and a broader set of quantized variants across RTX 4090, 5090, Pro 6000, 4080, 5060 Ti, plus Intel i7, Intel Ultra 7, Ryzen 9, and Raspberry Pi 5. Shoutout to the quantizers we included in the comparisons: Bartowski, Unsloth, Mudler, and AesSedai. We picked a few of the most recommended quants from each of the quantizers, since you probably wouldn’t care about these results if we took the time to evaluate every single quant (or once 3.7 comes out ;) ).
The main NTP result was a bit counterintuitive. Usually, you expect smaller bpw quants to win clearly on speed. Here our largest release variant often stayed competitive not only in quality but also in prompt processing and token generation. So bpw is not something to minimize blindly: if the larger model fits your memory and context budget, it may still be the better choice.
There are hardware-specific exceptions, especially on 16GB devices and Raspberry Pi 5, so we put the full recommendations and plots in the blog rather than trying to compress all of them here.
For MTP, the trade-off is different. On GPUs, we saw a meaningful generation-speed boost, usually around 20 - 40% (this is heavily workload dependent and requires your testing). But MTP also increases runtime memory, so on 16GB GPUs the larger MTP model was no longer practical at our context settings, making model GPU-2 MTP the usable recommendation. The MTP results also support the same bpw observation: in some cases, the larger model basically catches up with the smaller model in throughput.
CPU MTP was not attractive in our tests. Prompt processing is already slow on CPUs, and MTP makes it worse. For now, our CPU recommendation remains NTP.
Methodology note: we found an answer-format compliance issue in Qwen 3.6 that we did not see in the same way with Qwen 3.5. In several MMLU cases, the full-precision model appeared to know the answer, but did not respond in the strict format expected by the benchmark, despite the prompts being 5-shot. Since this was already a baseline-model behavior rather than a quantization artifact, we excluded MMLU from the benchmarking for this release.
So, the important takeaway is:
For this model, “pick the largest quant that fits” worked surprisingly well for NTP. MTP is worth it on GPUs if you have the memory headroom, but it changes what fits and is not automatically better on CPUs.
We’ll keep Reddit short-ish. The blog has the full graphs, experiments, hardware breakdowns, and methodology details.
mukz_mckz@reddit
This is very cool, thanks for your work!
Do you plan on doing something similar to this: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks ? Testing KL divergence across different quants that you've made? Would be interesting to see how they compare to the other community benchmarks. You can probably hit up the unsloth people either here or on their subreddit, they might help you get this up and running :)
I think many people are now looking at and using these benchmarks to select what model fits their use case the most, I also like yours since it benchmarks tps too!
enrique-byteshape@reddit (OP)
Thank you for the suggestion. KLD is a very well known proxy metric. We are not covering KLD or perplexity in this release post. In our experience with instruction-tuned models, they are useful for catching clearly broken quantizations, but once models are roughly in the right range, we have not found them to reliably predict downstream benchmark performance. We will discuss this, along with how we approach evaluation, in a dedicated blog post soon. We have been gathering data to show this behaviour and other behaviours of KLD specifically
mukz_mckz@reddit
That sounds interesting! Thank you for your work, looking forward to that post :)
Serious-Affect-6410@reddit
Thanks for your work! It’s really impressive.
May i ask which one should i pick for Apple Silicon machine? NTP or MTP?
I always get confused by the Apple processor… it’s having GPUs but performance like CPU sometimes….
andy2na@reddit
Thanks! Any plan on qwen3.6-27B?
enrique-byteshape@reddit (OP)
Not right now. We're holding our breath a little bit to see what happens with Qwen 3.7, since it seems like the Qwen team have started to quickly iterate over models. For us, evaluating dense models takes much much longer, so by the time we would release Qwen 3.6 27B, there's a high chance 3.7 is out. Stay tuned though, we're always lurking within our compute possibilities :)
PM_ME_UR_COFFEE_CUPS@reddit
Stupid question: I tried the unsloth quant but it was dense so I got like 4tps with mtp. The 35b moe variant I got 40TPS. Is there any moe variant for 27b? 4tps was too slow
PurpleWinterDawn@reddit
You are misunderstanding what "dense" and "MoE" mean. They are intrinsic properties of a model that are non-interchangeable.
The 35b MoE has 3b "active" parameters per token. This means only 3 billion parameters are taken into consideration instead of the whole model, including a common set of early parameters used to decide which experts to pick for infering the next token.
A "dense" model is a model in which all parameters are active for every single token.
MoE models are a trade-off of reducing memory bandwidth by increasing the memory footprint:
√(MoE active params × MoE total params)to get an idea of the parameters count of an equivalent dense model in terms of capabilities. This would mean the 35b-a3b model would be equivalent to a 10.2b dense model (hence why I think it's outdated, it feels much more capable than a 10b model).Another benefit, since the number of active parameters in MoE models are reduced compared to a dense model, they allow hybrid inference on CPU + GPU, putting the common set of parameters in VRAM and leaving the experts in the much larger RAM.
PM_ME_UR_COFFEE_CUPS@reddit
Thanks for such a detailed explanation.
andy2na@reddit
no, 27B is dense and 35b is MoE, its just how the model was built.
For gemma4, 26B is MoE and 31B is Dense
PM_ME_UR_COFFEE_CUPS@reddit
Got it, thanks for explaining. I will also try out the MLX for 27b, which should be faster on my Apple Silicon. Hope it’s usable
Middle_Bullfrog_6173@reddit
What benchmarks are in the accuracy score? Only MMLU is listed as not being benchmarked.
enrique-byteshape@reddit (OP)
Hey, we probably forgot to append it to the end of the blog. It's the same as in previous models: BFCL_V3 for tool calling, GSM8K_V for vision + math, LiveCodeBench V6 and HumanEval for coding, GSM8K for math and IFEVAL for instruction following. We only removed MMLU because it was incosistent even for baseline
Middle_Bullfrog_6173@reddit
Thanks, sounds good for covering relevant capabilities.
cato_gts@reddit
Thank you. I have been running the IQ3 3.0 bpw model on the BC250 with kv cache q5/q5 since yesterday,
and I haven't encountered a dead loop yet.
enrique-byteshape@reddit (OP)
That's awesome! Maybe we should look into KV cache compression 0.0
EggDroppedSoup@reddit
Amazing release! Wanted to ask if there have been any benchmarked results on off loading setups. I expect an increase in tps compared to unsloth but I wanted to ask first before I test (when i get back to my setup)
enrique-byteshape@reddit (OP)
We don't currently have benchmarks for off loading setups since we focus on purely keeping the model on-device for now, but someone posted a comment on our HuggingFace page saying that with MTP active our 4.19bpw model got faster with more context on a Strix Halo. We'll see if we can introduce some off-loading benchmarking in the future without extending the scope of releases too much.
EggDroppedSoup@reddit
like medium (32) ram, low (8) vram scenarios
73td@reddit
thanks for this. I didn’t think i’d get this model to run on my rtx4090 and now even with GPU5 I can use at least 80k context.
enrique-byteshape@reddit (OP)
Love to hear that! Let us know how it behaves and if the quant works well
janvitos@reddit
Benchmarked (mtp-bench.py) Qwen3.6-35B-A3B-IQ4_XS-4.19bpw (GPU-5) MTP model with ik_llama.cpp. It's blazing fast! Getting 110.24 tok/s average, which is almost 20 tok/s higher than Qwen3.6-35B-A3B-UD-IQ4_XS MTP:
ik_llama.cpp command:
The secret sauce is using ik_llama.cpp and --fit --fit-margin 1664. You might need to tweak --fit-margin depending on your VRAM.
Cheers.
inddiepack@reddit
How are you getting 110 tk/s, on a 12 GB VRAM, which means you have to offload about a third of the model's layers to the CPU? Especially as the MTP models, as described in the description as well, run poorly when offloaded partially to the CPU?
enrique-byteshape@reddit (OP)
This is great to hear! Squeezing performance out of llama.cpp and ik_llama.cpp is hard, but still our goal, so this is awesome. Thank you for the benchmarking, really appreciate it!
moahmo88@reddit
Amazing!Thanks!
kiwibonga@reddit
Thanks for this, I used your 3.5 35B for a while, has been pretty solid
VoidAlchemy@reddit
Pretty graph! I looked at the blog methodologies section but don't see your full llama-server command? I assume by "NTP" you mean
--spec-type ngram-modbut don't see it explained in detail anywhere.Also I believe on mainline llama.cpp you can run both ngram-mod and MTP at the same time e.g.:
So it might not be a simple "either/or" ?
Anyway, thanks for sharing some more data points for consideration!
ali_byteshape@reddit
Thank you for the kind words, and for taking the time to look through the blog and share this!
By NTP, we mean the simple vanilla next-token prediction setup, without any speculation such as n-gram or a draft model.
For MTP, yes, you are right that you can chain n-gram speculation and a draft model, and some people have reported that this improves effective TG speed. In our tests, using the same type of data we use for our quality benchmarks, the ngram-mod option did not help much.
That said, this is very workload-dependent, so there is no one-size-fits-all answer. It may very well help in other settings.
machrider@reddit
The x-axis on this graph is super misleading!
enrique-byteshape@reddit (OP)
What would you suggest to make it less misleading? We plot the full range of TPS from the model that has the least TPS to the one that has the most. Otherwise the graph would not be useful in our opinion
ps5cfw@reddit
Hey this Is pretty nice!
I am One of those CPU Hybrid users Who only sees incredibile slowdowns from using MTP with this model, so I can relate with your findings pretty well!
I'd be interested to try your quants, do you intend to release any Q6 GGUF? I avoid going lower than Q6 for this model
enrique-byteshape@reddit (OP)
We let ShapeLearn learn the datatype selections at different aggressiveness, but always traversing the quality-TPS curve. So technically speaking with our models you shouldn't need Q6, as our highest BPW model has better accuracy than all Q5 and Q6 quants. You can give it a go and test its reliability, we're really proud of it.
ps5cfw@reddit
I Guess I'll give It a try and see how pi works with your GGUFs, Hope for the best!
enrique-byteshape@reddit (OP)
Please do! And let us know how it goes! We're always excited to hear feedback from you guys to improve our quants.
ps5cfw@reddit
I wasn't expecting much, but I have to say I am extremely impressed so far on what is essentially real world typescript and C# tasks! It seems to work decently, but not at longer contexts.
Mind I am running unquantized cache. I can provide the full settings if that helps you.
It starts to show signs of tear as early as 100k tokens in, with misplaced thinking tags and stopping on it's own mid task.
Still fairly impressed, might use this for whatever needs less than 64K tokens perharps! But I still think there would be a lot to gain from a Q5 and Q6 quant of this. Maybe give it a chance?
OsmanthusBloom@reddit
You didn't compare against any Q6 quants afaict
enrique-byteshape@reddit (OP)
We compared mostly in our BPW range, but there are a couple of Q5_K_XL quants in there, which will be very similar
EsotericTechnique@reddit
Same, it does speed up tg about 60% , but I go from 1k TPS pp to 200. Not worth it in the slightless sadly
enrique-byteshape@reddit (OP)
Exactly. In the end MTP is trading off PP for TG. It's the whole premise. CPU is already slow on PP, so you end up just adding overhead. You might get some TG speedups with certain workloads (maybe shorter prompts+short replies), but at longer context lengths and longer replies, you will see a slowdown most of the time. It is very workload dependent though, from what we've observed.
seemaze@reddit
What is your takeaway on low-bandwidth high-capacity UMA hardware like Strix Halo, DGX Spark, and Apple Silicon?
enrique-byteshape@reddit (OP)
We haven't benchmarked any of them since we don't have access to them, but it should be similar to just CPU, since in the end the bottlenecks are compute and memory bandwith.
joakim_ogren@reddit
For DGX Spark (and other GB10), how do they compare to NVFP specialized Qwen 3.6 MTP models?
enrique-byteshape@reddit (OP)
We can't really tell you for sure since we can't test it, but it depends on how good the underlying kernels and setup is. History tells us that any datatype that is natively supported in hardware should be better for performance as you don't need quant-dequant stubs, but it really depends
vastaaja@reddit
Do you test the models for accuracy at long context? My issue with the Qwen 3.6 35B quants is that it hits thinking loops and tool call issues with a long context. I'll see if it reproduces with the GPU-5 recommendation here.
The tool call one seems weird - the model can tell the correct tool call parameters when asked, but is still unable to put the same values into an actual tool call. I guess this is the kind of behavior that loss of precision in the model or kv cache can cause?
enrique-byteshape@reddit (OP)
Our models are trained and benchmarked on all sorts of context lengths, so that the resulting quant is as good as it can be for general use. We usually do better than other quants at longer context lengths. Do test our model and let us know whether it has the behaviour you are describing. Keep in mind it is known that Qwen 3.6 in general tends to not follow instructions and sometimes overthink, so that might have to do with it
hackiv@reddit
Didnt know ByteShape is crazy good
enrique-byteshape@reddit (OP)
lol "I never ask my clients to judge me on my winners, I ask them to judge me on my losers, because I have so few"
Interpause@reddit
from what i can tell, yall only benchmark at short context? im a bit concerned about the long context coherence for agentic stuff (haven't tested yet) since i noticed the sensitive ssm_alpha/beta weights got quantized quite heavily in the gguf.
enrique-byteshape@reddit (OP)
We do benchmark a long contexts as well. Our benchmarks are a mix so that they are representative. The fact that we quantize alphas and betas is just an artifact of our quantization algorithm. If they are quantized, then they are fine to be quantized ;)
Skystunt@reddit
This is the kind of comparison we need ! Speed vs quality of different model quants !
lkarlslund@reddit
Thank you for the effort. The Qwen 3.6 27b model really changed everyones perception on what's doable locally in 2026.
Can you share what you did differently from the other models you've tested?
enrique-byteshape@reddit (OP)
The idea is that we use gradient descent itself with a callibration dataset to learn the optimal quality-TPS curve for our models. Our tech, ShapeLearn, learns at a per layer granularity and will always find the best trade-off for the objective we set it to run for. In the case of our model releases, we've always believed that what matters the most to users is best quality at the best token generation/prompt processing speeds you can get. As long as the model fits in your hardware, you no longer care about BPW because the speed-BPW trade-off isn't linear at all.
OsmanthusBloom@reddit
Am I right that these quants are optimized mainly for small size and high speed, not quality? The largest model GPU-5 is just 4.15bpw, comparable to smaller Q4 quants from others.
I'm currently running 35B-A3B Q5 partially CPU-offloaded on 16GB VRAM, but considering switching to a higher quant to get better quality. Higher generation and PP speeds would also be nice of course, with or without MTP, whatever works best. But these ByteShape quants don't seem to offer anything in this direction.
enrique-byteshape@reddit (OP)
We optimize for the quality-TPS curve. Larger BPWs don't necessarily translate into better quality. In fact, the trade-off curve is not uniform at all and sometimes a good 4b or even 3b quant will outperform a 6b model because quantization can act as a regularizer of sorts. If your worry is quality, you should test our highest BPW quant, I think you'll be surprised by its reliability versus higher bitlength quants.
OsmanthusBloom@reddit
Thanks for the explanation!
Icy-Degree6161@reddit
Been waiting for this. Love you guys!
enrique-byteshape@reddit (OP)
We've also been waiting for this :) Hope you enjoy the models and let us know of any feedback you might have