I’ve been experimenting with MoE inference bottlenecks in llama.cpp, specifically expert movement over PCIe when the model doesn’t fit in VRAM. I implemented a small prototype (~500 LoC) that adds a GPU expert cache + predictive prefetching.

Posted by ongunm@reddit | LocalLLaMA | View on Reddit | 14 comments

Qwen3-30B-A3B (Q4_K_M) — 4070 Ti 12GB
33.74 → 64.45 tok/s (1.91×), 99.5% hit rate

I've been working on a llama.cpp fork called FATE: a small (\~500 lines of C++/CUDA) extension that adds a GPU-resident expert cache with cross-layer + temporal predictive prefetching for sparse MoE models.

The idea is simple:

In MoE inference, most of the model sits in expert FFN weights, but only a small subset of experts is active for each token. If the model is larger than VRAM, vanilla offloading keeps pulling those expert weights from system RAM over PCIe again and again.

FATE tries to break that bottleneck by:

Benchmark

Model: Qwen3-30B-A3B Q4_K_M

GPU: RTX 4070 Ti 12 GB

Model size: 18.6 GB

Vanilla llama.cpp offloading

FATE

So on this setup, it nearly doubles decode speed on a model that does not fit in VRAM.

Scaling to larger models

We're also testing on Qwen3-235B-A22B (132 GB at Q4_K_M) on an RTX 4090 (24 GB VRAM). Early results show the cache architecture working — 99.8% hit rate with the pool comfortably holding the full expert working set. The architecture scales; the bottleneck shifts from VRAM size to prefetch pipeline throughput, which is actively being worked on.

Why this works especially well on Qwen3-30B-A3B

Qwen3-30B-A3B is a very sparse MoE:

That matters because for systems like this, what matters is often not total parameter count, but the active working set that has to be moved for each token.

On this model, the per-token expert working set is small enough that the cache pool can actually hold it with room to spare:

So instead of churning constantly, the cache can keep hot experts resident across tokens. That is why sparsity is such a big lever here.

Why I think this matters beyond one benchmark

A lot of the biggest open models are moving toward sparse/MoE architectures rather than purely dense scaling. Mixtral is sparse MoE. Qwen3 includes MoE variants. DeepSeek-V3 is also MoE.

That means optimization work like this becomes more relevant as models get larger:

In other words, this is not just about one Qwen benchmark. It is about whether MoE sparsity can be turned into a practical local inference advantage.

Important limitations

Prefill/prompt evaluation is currently much slower than vanilla in this prototype. Right now this is mainly a decode-side optimization, not a full end-to-end win yet. The fix is to bypass the cache during prefill (which is compute-bound, not memory-bound) and only activate it for decode — this is being implemented and should bring prefill back to vanilla speed.

On models where the expert working set exceeds the cache pool (e.g., a 235B model on 16 GB VRAM), FATE can actually be slower than vanilla due to cache churn overhead. The sweet spot is models where the per-token working set fits inside the available VRAM pool.

Repo

https://github.com/ongunm/llama-moe-cache

Would love feedback from people working on llama.cpp, MoE serving, or low-VRAM inference. If people want, I can also post: