Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant

Posted by Defilan@reddit | LocalLLaMA | View on Reddit | 51 comments

Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM. Not a workstation card in sight.

I ran the same bench harness across three configs back to back so the comparison is at least fair on the hardware side. Stock ghcr.io/ggml-org/llama.cpp:server-cuda13 for the MoE runs, our TurboQuant build for the dense. Sequential: 10 iterations, 128 max tokens, 2 warmup. Stress: 4 concurrent workers, 256 max tokens, 5 min. Prompt is the same for all.

The MoE flags:

--cpu-moe
--no-kv-offload
--cache-type-k q8_0
--cache-type-v q8_0
--ctx-size 90112
--flash-attn on
--n-gpu-layers 99
--split-mode layer
--tensor-split 1,1

Results:

Model / Config	Generation	P50 latency	Stress (4 concurrent)
Qwen 3.5-27B dense (full GPU, TurboQuant KV)	18.3 tok/s	7,196 ms	10.4 tok/s, 52 req/5min
Qwen 3-Coder-30B-A3B (--cpu-moe hybrid)	31.1 tok/s	2,286 ms	12.0 tok/s, 113 req/5min
Qwen 3.6-35B-A3B (--cpu-moe hybrid)	21.7 tok/s	6,160 ms	6.8 tok/s, 38 req/5min

A few things I did not expect.

The jump from dense 3.5 to Coder hybrid is basically free performance if you have a MoE model. 70% faster generation on the same two GPUs, P50 latency cut to a third. I always knew hybrid offloading was useful on paper but seeing the raw numbers side by side made me wish I had tried it sooner.

Qwen 3.6 is slower than the Coder variant even though both are 3B active. The extra 5B of total params means more expert weight traffic through system RAM per token. But the quality delta is not subtle, 73.4% vs 50.3% on SWE-bench Verified and +11 points on Terminal-Bench 2.0. For anything agentic or multi-step I am grabbing 3.6. For fast code completion the Coder is still the move.

Dense wins prompt processing by a mile, 160 tok/s vs 30-95 for the hybrid runs. If you live in long-context RAG or heavy prompt ingestion that is not going away. Generation speed is where hybrid pulls ahead because the PCIe round trip only happens for the active experts.

Tried pushing further. Wanted to combine --cpu-moe with our TurboQuant KV cache build (tbqp3/tbq3) to get to 131K context with a much smaller KV footprint. Crashed on warmup, exit code 139. Stack pointed at fused Gated Delta Net kernels in the TurboQuant fork. Looks like that optimization path has not been updated for the Qwen 3 MoE architecture yet. Stock llama.cpp with q8_0 at 90K is fine for now.

What I actually used it for once it was running: gave it a spec doc for the next feature of the K8s operator I wrote to deploy it and let it rip overnight. 56 tool calls, 100% success, 9 unit tests, all verification commands green. Merge-ready PR when I woke up. The model I deployed ended up shipping the operator's next feature. Bit of a recursion moment. Full writeup here if you want the longer version.

Happy to share more of the config, the bench harness, or the raw numbers if anyone wants them.

[-]

Bulky-Priority6824@reddit

Curious about the excitement around 21.7 tok/s on Qwen3.6 with --cpu-moe on the same dual 5060 Ti hardware I'm getting 83-86 tok/s full GPU without hybrid offloading. What's the advantage you're seeing that justifies the 4x speed tradeoff?

[-]

Traditional_Half2443@reddit

Hey could you send your command? also are you running vllm or llama cpp?

[-]