Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant
Posted by Defilan@reddit | LocalLLaMA | View on Reddit | 51 comments
Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM. Not a workstation card in sight.
I ran the same bench harness across three configs back to back so the comparison is at least fair on the hardware side. Stock ghcr.io/ggml-org/llama.cpp:server-cuda13 for the MoE runs, our TurboQuant build for the dense. Sequential: 10 iterations, 128 max tokens, 2 warmup. Stress: 4 concurrent workers, 256 max tokens, 5 min. Prompt is the same for all.
The MoE flags:
--cpu-moe
--no-kv-offload
--cache-type-k q8_0
--cache-type-v q8_0
--ctx-size 90112
--flash-attn on
--n-gpu-layers 99
--split-mode layer
--tensor-split 1,1
Results:
| Model / Config | Generation | P50 latency | Stress (4 concurrent) |
|---|---|---|---|
| Qwen 3.5-27B dense (full GPU, TurboQuant KV) | 18.3 tok/s | 7,196 ms | 10.4 tok/s, 52 req/5min |
| Qwen 3-Coder-30B-A3B (--cpu-moe hybrid) | 31.1 tok/s | 2,286 ms | 12.0 tok/s, 113 req/5min |
| Qwen 3.6-35B-A3B (--cpu-moe hybrid) | 21.7 tok/s | 6,160 ms | 6.8 tok/s, 38 req/5min |
A few things I did not expect.
The jump from dense 3.5 to Coder hybrid is basically free performance if you have a MoE model. 70% faster generation on the same two GPUs, P50 latency cut to a third. I always knew hybrid offloading was useful on paper but seeing the raw numbers side by side made me wish I had tried it sooner.
Qwen 3.6 is slower than the Coder variant even though both are 3B active. The extra 5B of total params means more expert weight traffic through system RAM per token. But the quality delta is not subtle, 73.4% vs 50.3% on SWE-bench Verified and +11 points on Terminal-Bench 2.0. For anything agentic or multi-step I am grabbing 3.6. For fast code completion the Coder is still the move.
Dense wins prompt processing by a mile, 160 tok/s vs 30-95 for the hybrid runs. If you live in long-context RAG or heavy prompt ingestion that is not going away. Generation speed is where hybrid pulls ahead because the PCIe round trip only happens for the active experts.
Tried pushing further. Wanted to combine --cpu-moe with our TurboQuant KV cache build (tbqp3/tbq3) to get to 131K context with a much smaller KV footprint. Crashed on warmup, exit code 139. Stack pointed at fused Gated Delta Net kernels in the TurboQuant fork. Looks like that optimization path has not been updated for the Qwen 3 MoE architecture yet. Stock llama.cpp with q8_0 at 90K is fine for now.
What I actually used it for once it was running: gave it a spec doc for the next feature of the K8s operator I wrote to deploy it and let it rip overnight. 56 tool calls, 100% success, 9 unit tests, all verification commands green. Merge-ready PR when I woke up. The model I deployed ended up shipping the operator's next feature. Bit of a recursion moment. Full writeup here if you want the longer version.
Happy to share more of the config, the bench harness, or the raw numbers if anyone wants them.
Bulky-Priority6824@reddit
Curious about the excitement around 21.7 tok/s on Qwen3.6 with --cpu-moe on the same dual 5060 Ti hardware I'm getting 83-86 tok/s full GPU without hybrid offloading. What's the advantage you're seeing that justifies the 4x speed tradeoff?
Traditional_Half2443@reddit
Hey could you send your command? also are you running vllm or llama cpp?
Defilan@reddit (OP)
Fair point, honestly. If you fit the whole thing on GPU at modest context, you're gonna crush hybrid on throughput. No argument there. I should have run that config as a proper baseline in the post. I've been noticing that hybrid starts earning its keep is when you push context hard. 90K at q8_0 KV is \~20GB alone, can't fit that plus a 20GB model on 32GB VRAM. My setup was built around long agentic coding, so context mattered more than raw speed for my use case. The other thing is VRAM headroom. Full GPU eats basically the whole budget. With hybrid I'm only using \~10GB for the active path, so I've got 22GB free for embeddings or a second service on the same box. Doesn't matter if Qwen 3.6 is all you're running, but useful if you're juggling.
Ultimately what I'm trying to test is a pattern that would work with other models in this space, so it wasn't optimized 100% for this sole use case.
Confident_Ideal_5385@reddit
90k is ~20GB in qwen3 because no deltanet. You'll have a ton of free memory with 3.5/3.6 at the same 90k context size.
Bulky-Priority6824@reddit
ok makes sense. We were jsut lacking a little "context" no pun intended lol
Defilan@reddit (OP)
hehe, I see what you did there
specify_@reddit
If you can take advantage of tensor parallelism and speculative decoding, the throughput is insane. Qwen 3.5 27B was my goto but I think I might stick with this until they release a Qwen 3.6 27B variant.
4x 5060 Ti 16GB, VLLM v0.19.0 with MTP speculative decoding:
VLLM launch command:
Dear_Training_4346@reddit
motherboard model please :)
Defilan@reddit (OP)
That's a serious stack! I'm drooling a bit over here. MTP spec decoding with 4-way tensor parallel is a very different optimization target than what I was testing, and the 55 to 205 tok/s range with the acceptance rate swings is fascinating. I will say that LLMKube does have a vLLM runtime (added a couple weeks back), but right now it only exposes the basic flags (tensor-parallel-size, max-model-len, quant, dtype). Speculative config and expert parallelism aren't plumbed through yet, and there's no extraArgs passthrough on that runtime to bypass it either. Seeing your config is putting that squarely on my todo list.
Going to run a scaled-down version of your setup on my dual 5060 Ti (basic tensor parallel, no spec decoding yet) as a sanity check and share numbers. Expect I'll be nowhere near your 205 tok/s given half the hardware and no MTP, but the gap itself is interesting data. Thanks for posting the full config, it's useful as a north star! Love seeing the setups others are using.
specify_@reddit
It's so fast that it's a pleasant feeling knowing that you have a SOTA-like model running locally all for yourself. I've also noticed that pipeline parallelism also works pretty fast, using Q8_0 in llama.cpp, achieving around 80-100 toks/sec and this is without spec decoding.
Qwen 3.5 27B also works very nicely with tensor parallelism+MTP, achieving around 60-80 toks/sec. When I had 3 RTX 5060 TI's, and ran it with pipeline parallelism, that number hovered around 23 tokens/sec.
Defilan@reddit (OP)
It really is a great feeling and the way things are going, models keep getting better and the barrier to entry is lowering too. Love not having to send everything to the cloud providers!
80-100 tok/s pipeline parallel at Q8 is a nicer number than I would've guessed without spec decoding. On my side I just retested Qwen 3.6 Q4 on dual 5060 Ti without --cpu-moe: 107.8 tok/s sequential at 90K. Ballpark of your pipeline numbers despite very different configs. Apparently getting out of the way and letting everything sit in VRAM was most of the battle...oh well, always learning something new
What do you think was the bottleneck with your 3 RTX test? Was it bandwidth-limited at PCIe handoff or something else?
specify_@reddit
Not too sure, I noticed that the more tokens there are as input, the throughput degrades very slowly. I don't think it could be PCIe bandwidth but maybe just being more memory-bound. I did a retest and I found that it initially gets 95 tokens/sec but it slowly degrades the more tokens are generated.
inrea1time@reddit
I got 2 x 5060TI, picking up a 3rd today and I got a 5070TI, just on on the same machine. How much vram does the 35B actually need?
specify_@reddit
35B for q4 is around 22GB. You'd want to use the remaining for context. I believe you can achieve full context limit size with 3 x 5060 ti
StardockEngineer@reddit
This AI slop bot doesn’t know what it’s talking about.
tmvr@reddit
But why though?...
Maybe I'm missing something, but this makes little sense to me. You have 32GB VRAM, why are you putting experts into the system RAM? Even with the 22.5 GB / 20.7 GiB Q4_K_XL quant it does not use the full 32GB VRAM with the maximum of 256K (262144) context when you set KV to q8_0 while still getting the 63 tok/s decode performance.
Bulky-Priority6824@reddit
thats what im asking too, there must be some reason for the hybrid approach that i am missing?
Defilan@reddit (OP)
I replied a minute ago but really testing a wider pattern. It wasn't optimized for this one use case. Great callout though. Trying to test for a variety of use cases. For mine, it was long running, agentic coding (think overnight jobs that would be too pricey for cloud use but still helpful). All about experimenting.
StardockEngineer@reddit
Huh. This isn’t even an answer. The model fits at full context. What are you doing
tmvr@reddit
What quant are you using and why the 88K context? With that even with 4 users and non unified KV (so 352K total context memory needed), you would still just about fit into 32GB, so no need for pushing the experts to system RAM?
Defilan@reddit (OP)
Appreciate the question. Q4_K_M, and honestly yeah the math does work out. Single slot at 88K with q8_0 KV is \~7GB plus the 20GB model, still leaves room on 32GB VRAM. I was really testing the hybrid pattern more than optimizing for this specific config. 88K wasn't a hard constraint, it was what I had open for a Qwen Code session that night. What I was actually after was VRAM headroom, keeping experts on CPU frees up \~22GB for running a second service on the same box (embeddings, second inference target, whatever else). For single-user Qwen 3.6 you're right though, straight GPU wins. Should've been clearer in the post about what I was benching for.
tmvr@reddit
Well, if you need space for other stuff there, then still don't do the full offload with
-cmoe, but rather calculate what is the total requirement of that other stuff and use-ncmoeto put some of the experts back into VRAM. or use the--fit-targetparameter to adjust how much max VRAM it should use when fitting it all in.Defilan@reddit (OP)
That's actually a great point. Full --cpu-moe when partial via --n-cpu-moe sized to actual headroom would've been smarter. Scalpel not hammer. Appreciate you mentioning this because I haven't tried --fit-target yet. Declaring a VRAM budget and letting llama.cpp figure out the split sounds much cleaner than calculating it manually.
Appreciate the callout. Going to revisit with this approach and keep testing. Lots of great stuff coming out of this, feeling super pumped to keep digging.
Bulky-Priority6824@reddit
concurrency, ok makes more sense.
Defilan@reddit (OP)
Bingo. Again, should have mentioned that. Great question.
Adventurous-Paper566@reddit
I'm using a Q5_K_M quant (AesSedai) and my hardware is a bit lower-end than yours (4060 Ti 16GB + 5060 Ti 16GB).
I just tested it with a 90k context window, by offloading 39/40 layers to the GPU and without KV cache quantization, I'm still getting over 60 tok/s.
Something is wrong with your settings. Normally, with Q4, everything should fit on the GPU and you should be getting around 70 tok/s.
Defilan@reddit (OP)
Tested it, and you were right by a lot. Same dual 5060 Ti, same Q4_K_M, same 90K context and q8_0 KV, just without --cpu-moe and --no-kv-offload:
- Sequential: 107.8 tok/s (vs 21.7 with cpu-moe). About 5x.
- P50 latency: 1288ms (vs 2286ms). Roughly 2x faster.
- 4-concurrent stress test over 5 min: 45.2 tok/s per-request, 231 total requests (vs 6.8 and 38 with cpu-moe). About 6x.
The --cpu-moe override was costing \~80% of available perf on this specific model. Qwen 3.6's DeltaNet hybrid attention keeps the KV cache small enough that offloading solves a problem that doesn't actually exist here.
Thanks for taking the time to post your config and push me to retest. Should've just benched the straight GPU path as the baseline in the original post! Appreciate the guidance.
Defilan@reddit (OP)
You're right. Honestly I over-engineered the config. Q4 fits on 32GB VRAM at 90K context no problem, --cpu-moe was unnecessary for this specific model and context combo. Qwen 3.6's DeltaNet hybrid attention keeps the KV cache small too (only \~25% of layers carry standard KV), which is why you're fitting Q5 with FP16 KV at 90K on the same class of hardware. Good catch.
I was testing the hybrid pattern broadly rather than optimizing for Qwen 3.6 itself. A couple other users made the same point earlier, but your numbers on actual hardware are the cleanest version yet. 60+ tok/s at Q5 with 39/40 offload is where this should land. Going to redo the benchmark without --cpu-moe and share updated numbers. Thanks for testing and sharing the details. I'm planning on redoing this with some lessons learned from today. All part of the journey. Appreciate you sharing your setup and results.
Adventurous-Paper566@reddit
J'utilise un quant Q5_K_M (AesSedai) et j'ai un hardware un peu moins bon que le votre (4060ti 16Gb + 5060ti 16Gb).
Je viens de tester avec un contexte de 90k, en chargeant 39/40 couches sur GPU, sans quantification du cache KV, je suis toujours au dessus de 60 tok/s.
Quelque chose ne va pas avec votre réglage. Normalement en Q4 tout devrait passer sur GPU et vous devriez avoir \~70 tok/s.
tecneeq@reddit
Strix Halo board, 50 t/s:
Defilan@reddit (OP)
Love it! Totally tracks for the Halo. That unified memory is pure magic. Haven't had a chance to play around with that myself yet. I have this rig and Mac Studio that I work with. Can see how my dual rig is getting hit with those PCIe round trips. Thanks for sharing!
tecneeq@reddit
No need to buy the 128GB either. Models as large are pretty slow (qwen 3.5 122b is 22 t/s). If i could buy again, i would get a 64GB one.
ProfessionalSpend589@reddit
I agree.
I think if clustering or running a desktop OS alongside a MoE is not the intended purpose, then the lesser variants are good too.
MrHighVoltage@reddit
Just upvoting for the dog.
jopereira@reddit
My setup is a 5070Ti 16GB VRAM + 96GB DDR5:
llama-server -m "J:\LM_Studio_Models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" --jinja -c 131072 -ctk q8_0 -ctv q8_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --reasoning-budget 0 --host 127.0.0.1 --port 4321
Prompt: "ramble about USA history"
Defilan@reddit (OP)
Nice! A single 5070 Ti with 96GB DDR5 is basically the case for big system RAM on consumer builds. I'm a big fan of seeing what folks can do on the consumer side so this is cool setup. 35B at 131K on a single 16GB card has to be leaning on auto-fit or partial offload under the hood. What's your tok/s looking like on that prompt? Also curious about setting --reasoning-budget 0. Is that disabling thinking entirely or just lifting the cap?
Far_Cat9782@reddit
Disabled it/ sets the cap st 0. Same thing
Defilan@reddit (OP)
Makes sense, thanks
jopereira@reddit
It's in the image: 67t/s.
Disabling thinking (entirely) is because I don't have hard problems to think about. As it is, it debugs and finds solutions for problems just fine. The model stills thinks!
Thinking can output a lot of tokens (delay...) that I find totally unnecessary (for my kind of coding problems).
Defilan@reddit (OP)
Ah that makes sense! For straightforward coding work the visible thinking is pure latency tax, the model's still doing the actual reasoning under the hood, just not spitting it out. Clever tune for your workflow. Appreciate you sharing all the details!
ComprehensiveJury509@reddit
How's that possible? I get 35tk/s on a single RTX3060 with some offloading.
Defilan@reddit (OP)
Depends on the config. What quant, context size, and flags are you running? My 21.7 was at 90K context with q8_0 KV plus --cpu-moe and --no-kv-offload, which eats bandwidth hard on the expert traffic. Shorter context or smaller quant would flip the numbers significantly. This was for long running, agentic coding sessions that would run overnight (where tk/s wasn't a huge concern). Curious what you're running. Always love learning more about how folks are using their setup.
andy2na@reddit
why not use the following? You should be getting 90+t/s on 3.6-35B with two 5060s, you can also try the new tensor split mode (--split-mode tensor)
Defilan@reddit (OP)
Great idea! Layer split with auto-calculated --tensor-split is actually the default path I use for multi-GPU when running through my operator, --cpu-moe was an explicit override I set just for this particular test. I haven't tried --split-mode tensor yet, that's a newer variant I haven't exposed yet. Adding it to the list. Looking forward to testing this. Going to be beating up my system pretty good this afternoon with the recommended tweaks :)
RIP26770@reddit
I achieved the same speed with a 1M context on an Intel Arc IGPU.
Defilan@reddit (OP)
Wait, 1M on an Arc iGPU is wild! I haven't worked with an Arc yet but keep up to date on them. What's the setup? Specific chip (Lunar Lake, Arrow Lake, one of the Core Ultras?), model quant, KV cache settings? Doing the math I'd expect 1M context KV alone to blow past any Arc iGPU's unified memory budget unless you're on a really aggressive quant or doing something clever with context streaming. Genuinely curious, not skeptical. If this actually works I'd love to know how. Part of what I'm trying to build is something that not only works on Nvidia/AMD but also Intel, so this is super interesting.
RIP26770@reddit
XccesSv2@reddit
I can run Q8 k xl with full context on a Radeon pro w7800 48gb with 70-80tok/s
lacerating_aura@reddit
RTX A4000, ampere, 16gb, cpu-moe on ddr4, bf16 context, f32 mmproj, full 262k context: 508 tk/s pp, 23tk/s tg, 83k context.
Bulky-Priority6824@reddit
What am i missing? Same setup I'm getting 82tk/s
Suspicious_Bit_3106@reddit
Wow i did not expect to see this, fantastic!