Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload
Posted by TriWrite@reddit | LocalLLaMA | View on Reddit | 25 comments
Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough. 23 tok/s is still rough but honestly noticeably faster when streaming responses.
Tl;dr:
- We keep track of which experts get routed to most frequently for the past N tokens. We make a bet that the processing speed-up from loading these frequently routed-to experts into VRAM will outweigh the latency penalty for transferring expert tensors from system RAM (cold) into VRAM (hot). Rinse and repeat every N tokens.
First off, results:
- vs. all-CPU experts baseline:
- +44.8% token generation (15.65 tok/s -> 22.67 tok/s)
- no prompt processing regression
- vs. layer-based offload at equivalent VRAM commitment:
- +26.8% token generation (17.87 tok/s -> 22.67 tok/s)
- very slightly slower prompt processing
Baseline: All experts offloaded to CPU (LLAMA_ARG_OVERRIDE_TENSOR=exps=CPU)
- Prompt processing (tok/s, n=2928): 514.93, 534.64, 531.26
- Token generation (tok/s, n=\~300): 15.60, 15.67, 15.69
Partial Layer Offload (22.6 GB VRAM used): 8 layers loaded on GPU (LLAMA_ARG_N_CPU_MOE = 40)
- Prompt processing (tok/s, n=2929): 556.42, 581.73, 618.08
- Token generation (tok/s, n=\~300): 17.93, 17.81, 17.87
Hot expert cache (22.2 GB VRAM used): 44 expert slots in VRAM cache (LLAMA_ARG_MOE_HOT_K = 44, LLAMA_ARG_MOE_HOT_REBALANCE_INTERVAL=60, LLAMA_MOE_HOT_PP_BYPASS_N_TOKENS=64)
- Prompt processing (tok/s, n=2929): 557.18, 542.76, 546.77
- Token generation (tok/s, n=\~300): 22.26, 22.97, 22.77
Setup:
- RTX 4090 24GB + Ryzen 9 7950X 96GB
- bartowski's Qwen3.5-122B-A10B Q4_K_L + bf16 vision mmproj
- KV Cache 131K tokens @ Q8_0/Q8_0
- For prompt processing, ubatch=3072 & batch=3072
Repo here with more details (code only for now, no binaries, still cooking): https://github.com/ParmesanParty/llama.cpp
hkdennis-@reddit
Exactly what I am looking for .
How stable it is on Gemma 4 26b?
s1lvs@reddit
Seems similar to https://github.com/vllm-project/vllm/pull/37190.
Tartarus116@reddit
My system would also be running slow if I did that. Just let llama-server optimize for you with:
Also, by offloading non-consecutive layers - e.g. layer 50 in system, then 51 in gpu, then 52 in system - you introduce more graph splits. So, don't do that. Llama's fit starts optimizing by offloading the last few layers first.
xaocon@reddit
I'm not really clear on what knobs autofit has access to or how "smart" it is in deciding how to use them. Don't the cpumoe options start offloading moe layers from fronts to back as well?
Tartarus116@reddit
You can generate the static `ot` params from CLI with `llama-fit-params`. Saves you time on future restarts.
E.g. for Qwen3.5-397b, this offloads the last 3 layers' ffn up/down/gate exp tensors and runs at near-native speed:
```
ot = blk\.58\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.59\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.60\.ffn_(up|down|gate)_(ch|)exps=CPU
```
`cpumoe` starts at the front, yes. Meaning, higher time to first token.
buttplugs4life4me@reddit
Thanks! I didn't know about the fit params tool. Model went from crashing to 500/30 tokens a second
beneath_steel_sky@reddit
Interesting, so for Qwen3.5-122B this would do?
xaocon@reddit
Hasn't even looked at that command yet, very slick. Have been trying to figure out the best way to run the large Gemma 4 moe on my 16G of vram. There are so many knobs to adjust, this at least should help narrow down what I should be benchmarking.
Long_War8748@reddit
That sounds like some kind of Porno Spam Mail lol 😁
AlwaysLateToThaParty@reddit
I use the Qwen3.5 122b/10b heretic mxfp4_MOE version. I've been very impressed with the model, and pretty much the same as yours. I would see about the heretic version though. Second guessing whether the thing doesn't like what you're talking about isn't anything I'm interested in. Heretic fixes that.
Darke@reddit
this is very similar to TiinyAI's PowerInfer. https://github.com/Tiiny-AI/PowerInfer
I would love to see your fork merged into main line llama cpp
sgmv@reddit
Does this work only for single GPU systems ? not clear
DefNattyBoii@reddit
Compared to --fit how and what does this improve(in terms of actual speeds)? If the code is good, I'd recommend keeping it minimal, your other changes are cool but would eventually make it harder for you to make a PR into llama.cpp upstream. btw small request im lazy fuck can you auto wire it into --fit?
If you zero in on a good result, review the code by hand and spend some time understanding it, then maybe send in a PR from a handwritten branch.
Opening-Broccoli9190@reddit
To me the speeds seem similar to those that people are getting on Sparks and similar unified setups with no VRAM whatsoever. Have you tried testing your setup it with these?
Prudent-Ad4509@reddit
Time to try the same in static mode choosing experts on load according to imatrix.
Pentium95@reddit
Sadly, while ik_llama.cpp will gladly merge this, i think llama.cpp is not.
I'm gonna test this and share a few bench with my hardware, it's for sure the best solution for Hybrid inference CPU+GPU, expecially with pcie x16 gen 4+
mumblerit@reddit
Is this similar to hot singles in my area
Chromix_@reddit
No, it's single GPUs near you. Sometimes even hot GPUs, depending on what you do.
Global_Persimmon_469@reddit
There is another project on github that does something similar: https://github.com/brontoguana/krasis
king_of_jupyter@reddit
I tried to get this merged last month, they banned me for it.
BP041@reddit
27% is real enough that I'd care, especially on mixed CPU plus GPU boxes where PCIe thrash is the actual tax.
The thing I'd want to see is latency split by prefilling vs generation, because some optimizations look huge until prompt-heavy workloads hit them. Still, caching hot experts in VRAM feels like the right direction instead of pretending every layer deserves the same treatment.
PhotographerUSA@reddit
Just run 35 almost the same.
SadGuitar5306@reddit
Have you tried -ncmoe flag with a value that makes 22gb vram used? It should be better than offliading whole layers?
ThisWillPass@reddit
Thank you chef!
Capable_Diamond_4039@reddit
Good idea!