Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

Posted by TriWrite@reddit | LocalLLaMA | View on Reddit | 25 comments

Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough. 23 tok/s is still rough but honestly noticeably faster when streaming responses.

Tl;dr:

We keep track of which experts get routed to most frequently for the past N tokens. We make a bet that the processing speed-up from loading these frequently routed-to experts into VRAM will outweigh the latency penalty for transferring expert tensors from system RAM (cold) into VRAM (hot). Rinse and repeat every N tokens.

First off, results:

vs. all-CPU experts baseline:
+44.8% token generation (15.65 tok/s -> 22.67 tok/s)
no prompt processing regression
vs. layer-based offload at equivalent VRAM commitment:
+26.8% token generation (17.87 tok/s -> 22.67 tok/s)
very slightly slower prompt processing

Baseline: All experts offloaded to CPU (LLAMA_ARG_OVERRIDE_TENSOR=exps=CPU)

Prompt processing (tok/s, n=2928): 514.93, 534.64, 531.26
Token generation (tok/s, n=\~300): 15.60, 15.67, 15.69

Partial Layer Offload (22.6 GB VRAM used): 8 layers loaded on GPU (LLAMA_ARG_N_CPU_MOE = 40)

Prompt processing (tok/s, n=2929): 556.42, 581.73, 618.08
Token generation (tok/s, n=\~300): 17.93, 17.81, 17.87

Hot expert cache (22.2 GB VRAM used): 44 expert slots in VRAM cache (LLAMA_ARG_MOE_HOT_K = 44, LLAMA_ARG_MOE_HOT_REBALANCE_INTERVAL=60, LLAMA_MOE_HOT_PP_BYPASS_N_TOKENS=64)

Prompt processing (tok/s, n=2929): 557.18, 542.76, 546.77
Token generation (tok/s, n=\~300): 22.26, 22.97, 22.77

Setup:

RTX 4090 24GB + Ryzen 9 7950X 96GB
bartowski's Qwen3.5-122B-A10B Q4_K_L + bf16 vision mmproj
KV Cache 131K tokens @ Q8_0/Q8_0
For prompt processing, ubatch=3072 & batch=3072

Repo here with more details (code only for now, no binaries, still cooking): https://github.com/ParmesanParty/llama.cpp

[-]

Tartarus116@reddit

-ot exps=CPU

My system would also be running slow if I did that. Just let llama-server optimize for you with:

fit = true
fit-target = 1024
fit-ctx = 128000

Also, by offloading non-consecutive layers - e.g. layer 50 in system, then 51 in gpu, then 52 in system - you introduce more graph splits. So, don't do that. Llama's fit starts optimizing by offloading the last few layers first.

[-]

xaocon@reddit

I'm not really clear on what knobs autofit has access to or how "smart" it is in deciding how to use them. Don't the cpumoe options start offloading moe layers from fronts to back as well?

[-]

Tartarus116@reddit

You can generate the static `ot` params from CLI with `llama-fit-params`. Saves you time on future restarts.

E.g. for Qwen3.5-397b, this offloads the last 3 layers' ffn up/down/gate exp tensors and runs at near-native speed:

```

```

`cpumoe` starts at the front, yes. Meaning, higher time to first token.

[-]

buttplugs4life4me@reddit

Thanks! I didn't know about the fit params tool. Model went from crashing to 500/30 tokens a second

[-]

beneath_steel_sky@reddit

Interesting, so for Qwen3.5-122B this would do?

ot = blk\.46\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.47\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.48\.ffn_(up|down|gate)_(ch|)exps=CPU

[-]

xaocon@reddit

Hasn't even looked at that command yet, very slick. Have been trying to figure out the best way to run the large Gemma 4 moe on my 16G of vram. There are so many knobs to adjust, this at least should help narrow down what I should be benchmarking.

hkdennis-@reddit

Exactly what I am looking for .

How stable it is on Gemma 4 26b?

s1lvs@reddit

Seems similar to https://github.com/vllm-project/vllm/pull/37190.

Long_War8748@reddit

Hot Experts in your VRAM!

That sounds like some kind of Porno Spam Mail lol 😁

AlwaysLateToThaParty@reddit

I use the Qwen3.5 122b/10b heretic mxfp4_MOE version. I've been very impressed with the model, and pretty much the same as yours. I would see about the heretic version though. Second guessing whether the thing doesn't like what you're talking about isn't anything I'm interested in. Heretic fixes that.

Darke@reddit

this is very similar to TiinyAI's PowerInfer. https://github.com/Tiiny-AI/PowerInfer

I would love to see your fork merged into main line llama cpp

sgmv@reddit

Does this work only for single GPU systems ? not clear

DefNattyBoii@reddit

Compared to --fit how and what does this improve(in terms of actual speeds)? If the code is good, I'd recommend keeping it minimal, your other changes are cool but would eventually make it harder for you to make a PR into llama.cpp upstream. btw small request im lazy fuck can you auto wire it into --fit?

If you zero in on a good result, review the code by hand and spend some time understanding it, then maybe send in a PR from a handwritten branch.

Opening-Broccoli9190@reddit

To me the speeds seem similar to those that people are getting on Sparks and similar unified setups with no VRAM whatsoever. Have you tried testing your setup it with these?

Prudent-Ad4509@reddit

Time to try the same in static mode choosing experts on load according to imatrix.

Pentium95@reddit

Sadly, while ik_llama.cpp will gladly merge this, i think llama.cpp is not.

I'm gonna test this and share a few bench with my hardware, it's for sure the best solution for Hybrid inference CPU+GPU, expecially with pcie x16 gen 4+

mumblerit@reddit

Is this similar to hot singles in my area

Chromix_@reddit

No, it's single GPUs near you. Sometimes even hot GPUs, depending on what you do.

Global_Persimmon_469@reddit

There is another project on github that does something similar: https://github.com/brontoguana/krasis

king_of_jupyter@reddit

I tried to get this merged last month, they banned me for it.

BP041@reddit

27% is real enough that I'd care, especially on mixed CPU plus GPU boxes where PCIe thrash is the actual tax.

The thing I'd want to see is latency split by prefilling vs generation, because some optimizations look huge until prompt-heavy workloads hit them. Still, caching hot experts in VRAM feels like the right direction instead of pretending every layer deserves the same treatment.