Is there a way to speed up prompt processing with some layers on CPU with qwen-3-coder-next or similar MoEs?

[-]

Borkato@reddit (OP)

Llamacpp’s `llama-server`. I just send it an api request. Other models work perfectly fine 🤔 but then again, they aren’t offloading much to cpu!

Reply

[-]

During generation only a couple experts fire per token so it's fast, but during prompt processing the whole batch routes tokens to different experts — so on CPU layers you're hitting almost all of them at once. That's your bottleneck. But wait, at 30B in MXFP4 the model should be like \~15-18GB. With 30GB VRAM you might be able to fit all or nearly all layers on GPU. Have you tried cranking \`-ngl\` higher? If you can get everything on the GPU the prefill problem basically goes away. \`-ub 64\` or \`-ub 128\` instead of the default. Smaller micro batches = less expert activation per pass = way better CPU cache utilization. Biggest single improvement for prefill \`-fa\` (flash attention) if not already on \`-t\` set to physical cores only, hyperthreading usually hurts here \`--override-tensor\` for more granular control over what sits where instead of just \`-ngl\` But seriously check if you can just load the whole thing into VRAM first. At that size it should be close.

Reply

[-]

Xantrk@reddit

> `-ub 64` or `-ub 128` instead of the default. Smaller micro batches = less expert activation per pass = way better CPU cache utilization. Biggest single improvement for prefill > > `-fa` (flash attention) if not already on > > `-t` set to physical cores only, hyperthreading usually hurts here > > `--override-tensor` for more granular control over what sits where instead of just `-ngl` Am I missing something on my test? I'm getting mnuch better PP speeds with bigger batches with some experts offloaded? llama-bench -m "...Qwen3-Coder-Next-UD-IQ3_XXS.gguf" with -ngl 99 -fa 1 --n-cpu-moe 42 | model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 64 | 64 | 1 | pp512 | 61.60 ± 12.74 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 64 | 128 | 1 | pp512 | 80.03 ± 2.80 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 64 | 256 | 1 | pp512 | 80.91 ± 2.42 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 64 | 2048 | 1 | pp512 | 84.93 ± 1.19 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 128 | 64 | 1 | pp512 | 85.57 ± 1.17 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 128 | 128 | 1 | pp512 | 126.93 ± 2.65 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 128 | 256 | 1 | pp512 | 126.67 ± 3.31 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 128 | 2048 | 1 | pp512 | 124.17 ± 3.03 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 256 | 64 | 1 | pp512 | 88.09 ± 2.13 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 256 | 128 | 1 | pp512 | 125.50 ± 2.56 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 256 | 256 | 1 | pp512 | 195.99 ± 5.55 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 256 | 2048 | 1 | pp512 | 197.63 ± 4.36 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 2048 | 64 | 1 | pp512 | 89.29 ± 0.57 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 2048 | 128 | 1 | pp512 | 132.23 ± 2.80 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 2048 | 256 | 1 | pp512 | 201.18 ± 2.79 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 2048 | 2048 | 1 | pp512 | 316.59 ± 7.16 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 512 | 1 | pp512 | 262.28 ± 44.30 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 1024 | 1 | pp512 | 311.39 ± 9.56 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 512 | 2048 | 1 | pp512 | 307.72 ± 10.48 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 1024 | 512 | 1 | pp512 | 308.95 ± 9.91 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 1024 | 1024 | 1 | pp512 | 307.18 ± 6.28 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 1024 | 2048 | 1 | pp512 | 318.72 ± 7.90 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 2048 | 512 | 1 | pp512 | 318.29 ± 12.45 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 2048 | 1024 | 1 | pp512 | 314.56 ± 11.92 | | qwen3next 80B.A3B IQ3_XXS - 3.0625 bpw | 30.45 GiB | 79.67 B | CUDA,Vulkan | 99 | 2048 | 2048 | 1 | pp512 | 313.81 ± 3.65 |

Reply

[-]

Borkato@reddit (OP)

Wait, but qwen 3 coder next mxfp4 is 43GB file size. The model itself is 80B A3B. But I’m redownloading and will try again with your suggestions!! Thank you so much,

Reply

[-]

Possible_Statement84@reddit

i think you used 30b version lol

Reply

[-]

Borkato@reddit (OP)

Wha? I’m downloading noctrex’s qwen-3-coder-next-mxfp4_moe.gguf

Reply

[-]

Possible_Statement84@reddit

[**https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct**](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct) **30b variant exists btw**

Reply

[-]

Borkato@reddit (OP)

Oh, right, but that one is older and not the one I’m asking about lol; that’s not the Next version!

Reply

[-]

Possible_Statement84@reddit

im blind xD

Reply

[-]

Borkato@reddit (OP)

Omg no worries. They are annoyingly similarly named!!

Reply

[-]

Useful-Process9033@reddit

This is the best explanation of the MoE prompt processing bottleneck I've seen on here. People keep comparing MoE generation speed to dense models and missing that the prefill phase hits every expert. For agentic workloads where you're constantly injecting large tool outputs, this makes MoE on partial CPU offload basically unusable.

Reply

[-]

DistanceAlert5706@reddit

What speeds your GPU ports are? Anything lower than x4 PCIe4 will lower PP speed drastically. I swapped to single GPU as I run one at Pcie3 x1 and speeds were sad. Moe models with CPU offload need very high bandwidth on Pcie lanes

Reply

[-]

ABLPHA@reddit

Well... this is quite unfortunate to read after I've decided to save up for a couple of eGPUs for running MoE models via USB4 + CPU lol

Reply

[-]

lemondrops9@reddit

If you can full off load to Vram Egpus are great.

Reply

[-]

ABLPHA@reddit

Was planning to run giant models like Qwen3.5 397B with non-expert layers on the eGPUs and expert layers on CPU (256GB RAM and potential NVMe PCIe 5.0 offload too), so I guess that isn't going to happen without a massive preprocessing penalty

Reply

[-]

lemondrops9@reddit

4 of my GPUs run off of PCIe 3.0 1x. The real trick is running Linux.

Reply

[-]

ABLPHA@reddit

I am :)

Reply

[-]

notdba@reddit

Qwen3.5 397B A17B might be fine, since it has more always active parameters (9.8B) than sparsely/selectively activated parameters (7.5B). I have a strix halo + a 3090 eGPU via oculink. By keeping the routed experts on CPU (IQ2\_KL, \~121 GiB) and the rest on the eGPU: * without GPU offload, i.e. no weight transfer over PCIe during prompt processing, PP is about 140 t/s * with GPU offload and a batch size of 4096 (\`-ub 4096\`), PP is also about 140 t/s * while the 3090 has a lot more compute, it takes about 18 seconds to transfer \~121 GiB over the slow PCIe 4.0 x4 Agentic usage typically has a lot of small exchanges that are way smaller than 4096 tokens. In such cases, without GPU offload, PP is still above 100 t/s with Qwen3.5 397B A17B. With the default of \`-ub 512\`, the compute buffer can also stay very small, such that I can even fit the full 256k context at F16.

Reply

[-]

Borkato@reddit (OP)

👀 Claude said that since I’m running one on pcie x2, it may be worth using just one gpu. Will absolutely try this, thank you. I’m gonna get a whole darn table of every combo haha

Reply

[-]

Borkato@reddit (OP)

This is a great point, I’m gonna look into it, thank you!

Reply

[-]

D9scene@reddit

I have 16GB VRAM and 64GB RAM Thru batch testing i figured out the optimal config I get \~450 prompt processing and \~25 tg t/s Also while having 8c/16t processor it is better to leave threads at 8 E:\qwen\llama-b8087-bin-win-cuda-13.1-x64\llama-server.exe ^ -m E:\qwen\qwen3-coder-next\Qwen3-Coder-Next-MXFP4_MOE.gguf ^ --n-gpu-layers 999 ^ -ot ".ffn_.*_exps.=CPU" ^ --ctx-size 32768 ^ --cache-type-k q8_0 ^ --cache-type-v q8_0 ^ --threads 8 ^ --threads-batch 8 ^ --batch-size 4096 ^ --ubatch-size 1024 ^ --flash-attn on ^ --mlock ^ --host 0.0.0.0 ^ --port 8080 ^ --parallel 1 ^ --cont-batching ^

Reply

[-]

Responsible_Pain3278@reddit

Qwen3-Coder-Next is highly optimized for context size. Have you tried removing context quantization? That should give you an additional speed boost.

Reply

[-]

Borkato@reddit (OP)

👀 this is super fucking helpful thank you!!! Can’t wait for the model to finish redownloading so I can try it haha

Reply

[-]

D9scene@reddit

happy to help!! share your results after testing

Reply

[-]

Borkato@reddit (OP)

WTF. I’m getting 130T/s prompt processing. I don’t get it. Maybe it’s my ram sticks?

Reply

[-]

D9scene@reddit

Try this huge test and then feed the output into Claude or ChatGPT to analyze the data E:\qwen\llama-b8087-bin-win-cuda-13.1-x64\llama-bench.exe -m E:\qwen\qwen3-coder-next-q5\Qwen3-Coder-Next-MXFP4_MOE.gguf -ot ".ffn_.*_exps.=CPU" -ctk q8_0 -ctv q8_0 -fa 1 --numa isolate -ngl 49,999 -t 8,16 -b 2048,4096 -ub 512,1024 -p 512,1024,4096 -n 128 -r 3 -o md

Reply

[-]

Borkato@reddit (OP)

I tried llama bench and it says it can’t work with this model, maybe I need to update llama cpp? But I thought I just did 😭

Reply

[-]

D9scene@reddit

Try to get latest llama release, what does it say in cmd after you start the test and what is your rig(cpu gpu ram)

Reply

[-]

Borkato@reddit (OP)

“Failed to load model Qwen3-Coder-Next-MXFP4_MOE_F16.gguf”. I just updated and built too, that’s why it took me a while to answer lol But it loads just fine when I run llama server or similar.

Reply

[-]

DistanceAlert5706@reddit

If you're on Intel and have E cores, use taskset and set amount of threads to amount of P cores.

Reply

[-]

D9scene@reddit

I have Ryzen 7 5700x, claude said to me 16 threads performing a little bit worse due to "contention" within the MoE, but i don't know the actual reason

Reply

[-]

ABLPHA@reddit

30GB VRAM and 43GB RAM seems very very very oddly specific. Are you mixing GPUs and/or RAM sticks? If so, are you sure the PCIe connection is fast and wide enough between the GPUs, and the RAM sticks don't fallback to a very low frequency?

Reply

[-]

National_Meeting_749@reddit

It's MX. He's on a Mac almost certainly with unified memory.

Reply

[-]

ABLPHA@reddit

Pretty sure MXFP4 has nothing to do with Macs?

Reply

[-]

National_Meeting_749@reddit

I might be crazy, but I'm like 90% sure that means it's an apple Metal optimized model. The FP4 has nothing to do with macs, but I swore MX meant metal optimized

Reply

[-]

ABLPHA@reddit

Yeah, MXFP4 is just Microscaling FP4, it's an OCP standard, not exclusive to Metal

Reply

[-]

National_Meeting_749@reddit

Still think bro is on a unified memory machine tho

Reply

[-]

Borkato@reddit (OP)

Nope, I’m on a Linux desktop. Rtx 3090, 2060. Disabling the 2060 did raise my pp to 200 though which is much better than the 100. I think my ram is slow and bottlenecked at 2770 or whatever tho

Reply

[-]

Borkato@reddit (OP)

That’s a good question! I will look this up, as I am not sure!

Reply

[-]

suicidaleggroll@reddit

What kind of prompt? Have you actually benched it to get the pp speed? What context are you using and how many layers are you offloading to the CPU? What GPU and what CPU/memory?

Reply

[-]

Borkato@reddit (OP)

Pp speed is 100T/s. Can’t seem to get it higher than that. Tried various values for things like ub (64, 1024, 2048, etc). My gpu is a 3090 and a 2060. Disabling the 2060 did speed it up to 230 which is great and interesting! But I know it can go higher. I’m wondering if I should disable my two lower ram sticks.

Reply

[-]

Borkato@reddit (OP)

I’ll redownload it and get back to you, as I didn’t write detailed enough notes 😭 downloading now!

Reply

[-]

mr_zerolith@reddit

I also notice that the MoE CPU offloading option reduces prompt processing speed proportionally. I'm using LMStudio so i don't have fine control over how it works.

Reply

[-]

Borkato@reddit (OP)

Interesting!!

Reply

[-]

merica420_69@reddit

MoE seems to be CPU intensive in the reasoning process for me.

Reply

Reply to Post

46 Comments