Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

Posted by Jorlen@reddit | LocalLLaMA | View on Reddit | 65 comments

In my opinion, this is 100% game changer for local LLMs.

In terms of speed, I was getting around 1.5x the tok/sec of previous tests.

The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context and raised it to 300k. This is at KV Q8_0 quant. I use VSCodium and Roo. The idea was to see how far I can push the context window and measure (by feel) if a large context window with a multi-file project slows it down too much to be effective.

Model used: Qwen3.6-35B-A3B-UD-Q5_K_S (MTP version) - link

OS/Software: Ubuntu 24.04 - Vulkan - To use MTP I had to use MTP prototype of llama.cpp server (image: havenoammo/llama:vulkan-server)

My current window is 300k context but I feel like I can go even higher as my VRAM used is 28.3gb / 32gb. Likely 400k is viable.

GPU: Asus Radeon R9700 AI Pro card

Just want to shoot my appreciation for the local LLM community and everyone responsible for enabling us to run these kinds of powerful models at home. Amazing when I think where we were just a year ago.

[-]

CatTwoYes@reddit

MoE and MTP feel like a natural pairing — the active parameter count is already low, so the extra decode head cost per step is proportionally smaller too. The "whoops I was on Q4 KV cache the whole time" edit is genuinely useful data though. If the model held up at 200k context on Q4 KV without immediately falling apart, that's a real-world datapoint for people trying to stretch VRAM.

[-]

Southern_Sun_2106@reddit

I just ran qwen 3.6 35B in LM Studio on a Mac to full 265K context. The model itself is just amazing - no sign of slowing down, no mistakes when calling tools, if I didn't know the number, I would think I am still in the first 5K. I remember the days when context doubled from 2K to 4K, and it felt like a miracle. Now with 300-500 'pages' in context, this is some sort of crazy mind-blowing reality we live in. All on my little laptop. Anyway, didn't mean to go off-topic - just wanted to say, Qwen 3.6 is a historic model for local inference.

[-]

Jorlen@reddit (OP)

I have tested 40+ models and none work better than it, I agree. With VScodium and roo, it's just fucking perfect. It's doing all the tool calls and knows exactly when to call APIs and everything else.

My fucking mind is blown at how good this is. Local LLM be eating good in 2026.

[-]

polawiaczperel@reddit

I was testing this model yesterday with my codebase. I am heavy Claude Code and Gpt Pro user with a lot of years of experience in engineering. For a local model it is really great. I do not know how they did it. It is fast, has a big context and it is really good. Definetely usable for speeding up a workflow for professionals that know what they need. Of course it is not Opus, but sometimes good enough with 100% of privacy is good enough. It is a great gift for the people.

[-]

cinnapear@reddit

Yeah, in five years things are going to be cray.

[-]

openSourcerer9000@reddit

What quant? I'm running qwen 397 q2 and it gets real confused after a few k tokens. Like it will re-answer a previous question instead of the latest message

[-]

Southern_Sun_2106@reddit

Surprisingly, one of the smallest ones - q4_K_M gguf from the official qwen hugging face.

[-]

Tiny_Recording6633@reddit

Local models, hardware, data - making crazy progress on quality of output and speed.

I am not a dev, but building a tech business and can see ridiculous economic value in the future app and knowledge base built on local models for business and consumer.

Inspiring how this trend is accelerating and separating from the monolithic models.

All on this sub, keep building, it will pay-off real soon !

[-]

Iajah@reddit

On RTX Pro 6000 with vLLM running Qwen 3.6 MoE FP16 through VS Code Copilot it could not complete tasks anymore from around 150k tokens. It starts working on the issue but then bails out pretty fast. Compacting the conversation fixes it. I've since reduced the size of the context. I assume the Dense model has similar issues but I have not tested it with larger contexts.

[-]

wojtek15@reddit

MTP is big progress. But can we can expect DFlash/DDTree in llama.cpp?

[-]

msrdatha@reddit

Thanks for sharing the progress on this.

May I ask what was the token and pp speeds you received with MTP (and before)?
Could you also try Qwen3.6-27B for the improvement?

[-]

Jorlen@reddit (OP)

I actually just switched (I had downloaded both) and am trying Qwen3.6-27b (MTP) version now. It's taking longer of course (MoE are faster) but my last session ended around 200k context with the other LLM getting stuck in loops.

I don't have metrics because the /metrics panel is disabled in this prototype. I had made a small webserver that tracked it, it works beautifully for the normal llama but does not work with this version.

I can however do some back to back tests with just llama.cpp's basic GUI for some accurate measurements. As accurate I can provide, being new to all this..

[-]

Snoo_81913@reddit

The metrics don't show up in the server window?

[-]

Jorlen@reddit (OP)

When I tried it said this prototype (MTP) version of llama cpp doesn't support it. Or I did something wrong. Did you get them working on the prototype image / version?

[-]

Snoo_81913@reddit

No I was just curious some forks don't have metrics in the console window to keep it clean but they usually have a - - verbose flag or some sort of summary at the end you can print out. Seems wild that you can't see the prefill.

[-]

Snoo_81913@reddit

To be clear I'm talking about the actual console window not the metrics panel.

[-]

msrdatha@reddit

"being new to all this.." - I think you are doing quite good.

[-]

Jorlen@reddit (OP)

Three weeks ago I was hitting KV cache walls in Windows with LM studio at 16k context. I decided to learn Linux, Docker, built an AI stack with comfyui, speaches (SST-TTS), Open WebUI and a few other tools. It all works in tandem. 16 hour days including my day job. It was very frustrating not knowing fuck all about all of this but I've also never really felt this excited about a new technology. I'm totally sucked in.

I am learning to code as well at the same time but just in Python for now, it's really easy to pick up, actually.

I hit many walls along the way. I installed Linux 26.04 at first, not realizing that a lot of software packages don't support it. I had already set everything up but went to 24.04 and rebuilt everything there. Through repetition, many mistakes and failures, I've got everything working finally. It's a lot of fun!

[-]

suprjami@reddit

For me, 27B is running at least 160% speed at all times. Usually somewhere between 160% and 200%. I have had as high as 260%.

That's baseline 25 tok/sec, increased up to 40 tok/sec in worst case, usually 45-50 tok/sec in most cases, and up to 62 tok/sec in best case.

2x RTX 3080 20Gb

[-]

Jorlen@reddit (OP)

This was my experience as well with my first test just comparing llama.cpp normal model to the MTP version of the same model. I was on the lower end, around 1.5x but I underclock my card just to keep thermals reasonable and the fan not sounding like a fuckin' jet engine.

[-]

suprjami@reddit

As Unsloth show, 35B has less improvement from MTP than 27B. See benchmarks at the bottom: https://unsloth.ai/docs/models/qwen3.6

I also set power limit down from 320W to 250W. It only makes a few percent difference to pp. Blowers always sound like a jet engine :P

[-]

msrdatha@reddit

Thanks, that was informative. Congrats, and enjoy the new speed with MTP.!

[-]

Jorlen@reddit (OP)

NEW TEST: Q8_0 quant - with Qwen3.6-35B-A3B-UD-Q5_K_S (MTP version) - at 300k context (vram 30/32gb used)

We are doing post-MVP on my project (multiple files, lots of code, design docs, etc.) and it's flying. Quality is 100% perfect, zero mistakes at 81k context so far, will update.

[-]

Honest-Kangaroo-1830@reddit

Can you report speeds on 27B w/ MTP? I just ordered a setup identical to yours and I'm very curious, there's no benchmarks I can find on R9700 and 27B with MTP.

[-]

BeautyxArt@reddit

which uncensored qwen3.6 27b model would you recommend ?

[-]

MrClickstoomuch@reddit

Stupid question, but is there a process written up somewhere how to convert a model to a MTP one? Or is that just not feasible for most models? I was looking for a MTP Model of the smaller Qwen models 9b and below) with the idea that the higher tokens per second would be good for smart home hardware like mini PCs. I know the focus is on bigger models while it is still in its infancy, but curious if there is a clear workflow to convert to a MTP model.

[-]

akmoney@reddit

I'm testing 27B and seeing an improvement of roughly 26 -> 31 tps (\~20%).

R9700 Pro 32GB running havenoammo/llama:vulkan-server docker image. Here's my specs.

[Qwen3.6-27B]
model = /models/mtp/Qwen3.6-27B-Q5_K_M.gguf
alias = Qwen3.6-27B
ctx-size = 163840
threads = 12
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
ngl = all
presence-penalty = 0.0
repeat-penalty = 1.0
chat-template-kwargs = {"enable_thinking": false, "preserve_thinking": true}
flash-attn = 1
; no-mmap = 1
cache-ram = 0
cache-type-k = q8_0
cache-type-v = q8_0
spec-type = draft-mtp
spec-draft-n-max = 3

[-]

akmoney@reddit

Yuge increase seen with Qwen 35B. 66 -> 98 tps (\~48%).

[Qwen3.6-35B-A3B]
model = /models/mtp/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf
alias = Qwen3.6-35B-A3B
port = 8080
ctx-size = 163840
threads = 12
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
ngl = all
presence-penalty = 0.0
repeat-penalty = 1.0
chat-template-kwargs = {"enable_thinking": false, "preserve_thinking": true} 
cache-ram = 0
; no-mmap = 1
cache-type-k = q8_0
cache-type-v = q8_0
spec-type = draft-mtp
spec-draft-n-max = 3

[-]

akmoney@reddit

And setting spec-draft-n-max = 2 bumps it to 105tps!

[-]

redblood252@reddit

I hope this will work for my iq3 dense 27b on my 16gb vram

[-]

ea_man@reddit

I dunno if I wanna trade IQ4 for IQ3 + MTP, I guess you can't run NGRAM with MTP at the same time?

[-]

jtjstock@reddit

someone did both, but the PR is a bit of a minefield right now

[-]

ea_man@reddit

Well that would be intriguing.

I don't wanna sound negative but poor peeps like me with 16GB or 12GB GPU may have to skip the MTP revolution because it adds even more weight tin VRAM and performance wise it may be worth it only for coding.

You know, in such scenario NGRAM would be better.

That's because 27B is too much for 16GB IQ4 or IQ3 on 12GB, we could use a \~22B model to get context + nice features.

[-]

jtjstock@reddit

The MTP heads are more sensitive to quantization too, I've found best luck with Q8 MTP attached to something like a Q4K_M. For coding, it is a dramatic improvement. On coding tasks you can see 4 tokens ahead still being 75-80% accurate, on things like HTML it jumps higher. Where as writing tasks you are stuck around 2 tokens ahead and 50-65% accuracy.

The other thing I've noticed in my own testing is the MTP heads don't need much context at all to work well.

For coding though, 24GB isn't really enough either lol. Q4K_M is ok'ish, Q5K_M is better.

[-]

ea_man@reddit

Oh yeah, that's what I mean by significant increase in size: in order to use MTP you should have q8 or even better q16, that's worth \~600MB on system that have to reduce context to \~60k to run on 16GB, it would make it not worth it.

Yet you know, you gotta use what you have, if you have little peasant VRAM you gotta reduce context and compromise your temperature / top-k, it ain't like getting the best result it's about spitting something useful without going bananas after 30k of context.

Ha.

[-]

redblood252@reddit

I’m intrigued. PR is submitted to llama.cpp or vllm?

[-]

jtjstock@reddit

In the lama pr thread there is someone claiming they did both and posted results. I don't know if they used the pr at some particular state, or if they modified things locally.

[-]

Jorlen@reddit (OP)

Test it! I'd love to know how the 3-bit quant fares.

[-]

redblood252@reddit

As soon as I get a PC !

[-]

sprinter21@reddit

Just curious, does lmstudio support MTP already?

[-]

PrometheusZer0@reddit

Also curious about this

[-]

relmny@reddit

I was running it today with 27b-q6k with open webui and after 3rd turn I started to get more than twice the usual TG (from 20t/s to 42t/s and some times 46t/s).

But I also started to see some "truncated" answers or loops in aider/pi (I started testing the today).

I set "-np 2" (had it at 4, even when I had about 6gb VRAM free), but still some random loops.

But this really feels like a big deal... and something I'll keep trying.

[-]

leonbollerup@reddit

its deff. great.. but i would like to see this in LM studio or unsloth studio.. i dont want to be forced/limited to ssh'ing into my LLM servers everytime i want to change or test something.. its such an extreme waste of time

[-]

Jorlen@reddit (OP)

It's definitely annoying. My understanding is most of these all-in-one software packages like LM studio use llama.cpp as a backend anyway so it's just a matter of time before the mainline images get updated to support them. As I'm new to this realm, I couldn't tell you what a good guess would even be in terms of timeline.

[-]

leonbollerup@reddit

they do.. but first you need to see that code end up in llama.. and then over to eg. LM studio.. it will take time. and not all code ends up in llama

[-]

Enough-Astronaut9278@reddit

MTP on Vulkan with 28gb VRAM usage at 300k context is pretty tight, would be interesting to see if KV cache offloading helps push it further.

[-]

Jorlen@reddit (OP)

The 27b dense model has my VRAM at 31.1gb / 32gb lol... at 300k context. Near the danger zone, but it's stable so far.

[-]

caetydid@reddit

How can you use 300k context? ROPE?

[-]

Jorlen@reddit (OP)

The system SHOWS 300k but whether or not I can actually get into that high window and have the LLM work as it should might be another thing. I'll be doing more testing today into the 200k+ context range at Q8_0 quant (for real this time) and I'll report my findings back here, since many are curious.

[-]

tmvr@reddit

300K context is ambitious, I have my doubts that the model can successfully handle that. Even if it can support 256K (262144) without trickery, that is probably already very high, cracks in the wall can (and do) appear close to and over 128K.

[-]

Jorlen@reddit (OP)

You might be right. My session with the MoE Qwen (35b) hit a dead end around 200k. I switched to the dense 27b model but with this larger model/quant I cannot exceed 300k as I'm at 31.2gb used VRAM.

I would be more than happy to report back if I can fill that 300k as I continue working on my pygame project. Tomorrow though, been a long day.

[-]

tmvr@reddit

I did not mean that the context fits into VRAM or not, I mean the model to lose it's plot once you reach a certain threshold of active context. Even if it fits 256K for example and you have everything in VRAM with proper speed, I would not fully trust it's output after the context length is over 128K for example. Most models have issues with proper functionality and coherence when reaching higher context, including frontier SOTA ones like Claude Sonnet and Opus.

[-]

Jorlen@reddit (OP)

Right before 200k USED context, things were fine. I think it depends on the model. I grabbed a smaller quant of the 27b MTP so I can jam even more Q8_0 context in there, now that I've actually flipped my yaml to USE Q8_0 that is.

I will be in the 150-300k range of used context (not just loaded context) and I will be able to tell when it falls apart. In my tests, it has to keep track of like 10 files, update 5 of them on the fly and remember all the logic. I will know instantly when it loses its mind.

[-]

Maleficent-Ad5999@reddit

Llamacpp used to be my only inference engine. Now that I’ve moved to vLLM, I could try turboquant and MTP on Qwen 3.6 27B models that gets me 100tps on a single 5090

[-]

Pleasant-Shallot-707@reddit

What server are you running?

[-]

jacknjill101@reddit

I’m running the 27B-Q8-UD-MTP-M_k_xl on a Mac mini pro and it’s very slow like 4.5tok/s generation. I get 6tok/s using oMLX with the MLX-Q8.

[-]

Pleasant-Shallot-707@reddit

The M4 isn’t great for dense models. My M5 Max gets around 24 on standard Qwen 27b. I’m working to set up MTP and see how it improves from there.

[-]

MisticRain69@reddit

MTP Q8 Qwen3.6-35B-A3B 200k context f16 kv cache on strix halo plus 3090 ti egpu that I make sure it uses 23gb if then the other 35gb or so goes on the 8060s results in average token gen of 66tk/s-100tk/s and around 790tk/s PP.

[-]

techlatest_net@reddit

That’s a really cool stress test and the 1.5x tok/sec part is the bit I had mention too. A short reply could be: That’s seriously impressive 300k context on a local setup is wild. Curious how the MoE version behaved once you got deep into the session.”

[-]

techlatest_net@reddit

Nice write-up — the main takeaway for me is that chat templates matter way more than people admit, especially when tool calls start getting weird. Also glad to see MTP getting the nod over DFlash for quantized setups.

[-]

iamapizza@reddit

Could you share some of the llama flags for mtp? If I compile from latest in the git repo will it have mtp enabled?

[-]

Jorlen@reddit (OP)

Here's my llama prototype snippet. I have it setup so that I can call the profile based on if I want to enable the prototype in my existing stack, or just the regular one (since MTP cannot use normal LLM ggufs).

My setup is rather unique so you can likely ignore most of it, and just take what you want out of it.

llama-mtp: image: havenoammo/llama:vulkan-server container_name: llama-mtp-prototype profiles: ["mtp"] # <--- ADD THIS restart: unless-stopped devices: - /dev/dri/card1:/dev/dri/card1 - /dev/dri/renderD128:/dev/dri/renderD128 volumes: - ./models-MTP:/models-MTP ports: - "8081:8080" # Avoid clashing with your main 8081 stack environment: - GGML_VK_VISIBLE_DEVICES=0 # Explicitly target your R9700 - AMD_VULKAN_ICD=RADV # Common fix for AMD Vulkan stability command: - "--model" - "/models-MTP/MTP--Qwen3.6-27B-UD-Q6_K_XL.gguf" # Example MTP model - "--host" - "0.0.0.0" - "--port" - "8080" - "-ngl" - "99" - "--ctx-size" - "300000" - "--no-mmap" # --- THE CRITICAL MTP FLAGS --- - "--spec-type" - "mtp" - "--spec-draft-n-max" - "3" # Reddit testing shows 3 is the "sweet spot" for speed # ------------------------------ - "--cache-type-k" - "q4_0" - "--cache-type-v" - "q4_0" networks: - ai_network

[-]

OsmanthusBloom@reddit

You are using q4_0 KV cache quantization. That will probably hurt quality in long context tasks. q8_0 would be safer, though obviously takes more VRAM.

[-]

Jorlen@reddit (OP)

Ah shit! That's no good! That wasn't supposed to be in there. Small wonder I had so much context room, that's disappointing. I knew it was too good to be true lol.

I have to say though, the model is doing remarkably well so far with the project I'm working on, considering the Q4 KV quant. I would have not intentionally tried this so it turned out to be a cool little test.

In short - don't sell Q4 KV quant short! Whether or not it's worth just resetting context and diving it up into more sessions and sticking with q8 remains to be seen, at least for my purposes. I'll do more tests.

[-]

iamapizza@reddit

Very nice thanks for sharing that and highlighting the important bits. I'll be trying these out later tday when I get home.