RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

Posted by marlang@reddit | LocalLLaMA | View on Reddit | 148 comments

Spent an evening dialing in Qwen3.6-35B-A3B on consumer hardware. Fun side note: I had Claude Opus 4.7 (just the $20 sub) build the config, launch the servers in the background, run the benchmarks, read the VRAM splits from the llama.cpp logs, and iterate on the tuning — basically did the whole thing autonomously. I just told it what hardware I have and what I wanted to run.

Sharing because the common --cpu-moe advice is leaving 54% of your speed on the table on 16GB GPUs.

Hardware

The finding — --cpu-moe vs --n-cpu-moe N

Everyone’s using --cpu-moe which pushes ALL MoE experts to CPU. On a 16GB GPU with a 22GB MoE model that means only \~1.9 GB of your VRAM gets used — the other \~12 GB sits idle.

--n-cpu-moe N keeps experts of the first N layers on CPU and puts the rest on GPU. With N=20 on a 40-layer model, the split uses VRAM properly.

Benchmarks (300-token generation, Q4_K_M)

Config Gen t/s Prompt t/s VRAM used
--cpu-moe (baseline) 51.2 87.9 3.5 GB
--n-cpu-moe 20 78.7 100.6 12.7 GB
--n-cpu-moe 20 + -np 1 + 128K ctx 79.3 135.8 13.2 GB

+54% generation speed, +54% prompt speed vs. naive --cpu-moe. Jumping to 128K context is essentially free thanks to -np 1 dropping recurrent-state memory.

Startup command that works

llama-server.exe ^
  -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
  --n-cpu-moe 20 ^
  -ngl 99 ^
  -np 1 ^
  -fa on ^
  -ctk q8_0 -ctv q8_0 ^
  -c 131072 ^
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 ^
  --presence-penalty 0.0 --repeat-penalty 1.0 ^
  --reasoning-budget -1 ^
  --host 0.0.0.0 --port 8080

That’s Unsloth’s “Precise Coding” sampling preset. For general use: --temp 1.0 --presence-penalty 1.5.

Gotchas I hit (well, that Opus hit and fixed)

How to tune N for your GPU

Each MoE layer on GPU costs \~530 MB VRAM. Non-MoE weights are \~1.9 GB fixed. For a 40-layer model:

GPU VRAM Recommended N
8 GB stay with --cpu-moe
12 GB N=26
16 GB N=20 (sweet spot)
24 GB N=8 (fits almost everything)

Start conservative, watch VRAM during a long-context generation, then step N down by 2-3 until you have \~2 GB headroom.

TL;DR

Replace --cpu-moe with --n-cpu-moe 20, add -np 1, and you get 79 t/s + 128K context on a 5070 Ti. The 9800X3D’s V-Cache carries the CPU side effortlessly.

And Claude Opus 4.7 on the $20 Pro sub is genuinely good enough now to run this kind of hardware-tuning loop end-to-end — launch servers in background, parse logs, iterate — without hand-holding. Kind of wild.

Happy to test other configs if anyone wants comparisons.