Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

Posted by TriWrite@reddit | LocalLLaMA | View on Reddit | 25 comments

Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough. 23 tok/s is still rough but honestly noticeably faster when streaming responses.

Tl;dr:

First off, results:

Baseline: All experts offloaded to CPU (LLAMA_ARG_OVERRIDE_TENSOR=exps=CPU)

Partial Layer Offload (22.6 GB VRAM used): 8 layers loaded on GPU (LLAMA_ARG_N_CPU_MOE = 40)

Hot expert cache (22.2 GB VRAM used): 44 expert slots in VRAM cache (LLAMA_ARG_MOE_HOT_K = 44, LLAMA_ARG_MOE_HOT_REBALANCE_INTERVAL=60, LLAMA_MOE_HOT_PP_BYPASS_N_TOKENS=64)

Setup:

Repo here with more details (code only for now, no binaries, still cooking): https://github.com/ParmesanParty/llama.cpp