I’ve been experimenting with MoE inference bottlenecks in llama.cpp, specifically expert movement over PCIe when the model doesn’t fit in VRAM.
I implemented a small prototype (~500 LoC) that adds a GPU expert cache + predictive prefetching.
Posted by ongunm@reddit | LocalLLaMA | View on Reddit | 14 comments
Qwen3-30B-A3B (Q4_K_M) — 4070 Ti 12GB
33.74 → 64.45 tok/s (1.91×), 99.5% hit rate
I've been working on a llama.cpp fork called FATE: a small (\~500 lines of C++/CUDA) extension that adds a GPU-resident expert cache with cross-layer + temporal predictive prefetching for sparse MoE models.
The idea is simple:
In MoE inference, most of the model sits in expert FFN weights, but only a small subset of experts is active for each token. If the model is larger than VRAM, vanilla offloading keeps pulling those expert weights from system RAM over PCIe again and again.
FATE tries to break that bottleneck by:
- keeping experts in a VRAM cache pool
- predicting which experts will be needed next
- prefetching them on a separate CUDA stream
- serving hits from fast GPU-local copies instead of repeated CPU→GPU transfers
Benchmark
Model: Qwen3-30B-A3B Q4_K_M
GPU: RTX 4070 Ti 12 GB
Model size: 18.6 GB
Vanilla llama.cpp offloading
- Generation: 33.74 tok/s
FATE
- Generation: 64.45 tok/s
- Speedup: 1.91×
- Cache hit rate: 99.50%
- Hits / misses: 75,690 / 384
So on this setup, it nearly doubles decode speed on a model that does not fit in VRAM.
Scaling to larger models
We're also testing on Qwen3-235B-A22B (132 GB at Q4_K_M) on an RTX 4090 (24 GB VRAM). Early results show the cache architecture working — 99.8% hit rate with the pool comfortably holding the full expert working set. The architecture scales; the bottleneck shifts from VRAM size to prefetch pipeline throughput, which is actively being worked on.
Why this works especially well on Qwen3-30B-A3B
Qwen3-30B-A3B is a very sparse MoE:
- 128 experts per layer
- 8 active experts per token
- so only 6.25% of experts activate at a time
That matters because for systems like this, what matters is often not total parameter count, but the active working set that has to be moved for each token.
On this model, the per-token expert working set is small enough that the cache pool can actually hold it with room to spare:
- working set: \~1.4 GB
- cache pool: \~2.0 GB
So instead of churning constantly, the cache can keep hot experts resident across tokens. That is why sparsity is such a big lever here.
Why I think this matters beyond one benchmark
A lot of the biggest open models are moving toward sparse/MoE architectures rather than purely dense scaling. Mixtral is sparse MoE. Qwen3 includes MoE variants. DeepSeek-V3 is also MoE.
That means optimization work like this becomes more relevant as models get larger:
- denser models mostly scale total compute and memory pressure
- sparse models create a different bottleneck: expert movement and reuse
- if expert caching/prefetching gets good enough, consumer/workstation GPUs may be able to run much larger sparse models than raw VRAM would suggest
In other words, this is not just about one Qwen benchmark. It is about whether MoE sparsity can be turned into a practical local inference advantage.
Important limitations
Prefill/prompt evaluation is currently much slower than vanilla in this prototype. Right now this is mainly a decode-side optimization, not a full end-to-end win yet. The fix is to bypass the cache during prefill (which is compute-bound, not memory-bound) and only activate it for decode — this is being implemented and should bring prefill back to vanilla speed.
On models where the expert working set exceeds the cache pool (e.g., a 235B model on 16 GB VRAM), FATE can actually be slower than vanilla due to cache churn overhead. The sweet spot is models where the per-token working set fits inside the available VRAM pool.
Repo
https://github.com/ongunm/llama-moe-cache
Would love feedback from people working on llama.cpp, MoE serving, or low-VRAM inference. If people want, I can also post:
- exact command lines
- architecture breakdown
- benchmark logs
- next steps for fixing prefill and scaling to larger MoE models (235B+)
moborius1387@reddit
This is not new. I could show you it with coherent output and at a much further depth. You won’t get a massive spike in speed without degraded output. I have a whole fork built around this and other things. What it does help with is inference when offloading or steering. Just a heads up, the cloud llms are not going to help anyone build anything groundbreaking as they are literally instructed not to. Dig deep enough your going to find that they in fact are forbidden from doing so as it’s considered harmful because any groundbreaking development could disrupt the industry and market which llms are designed to see as harmful content. Will you make some changes or an app yes, will a llm design you the future no, not because it can’t but because it won’t. They regurgitate the same shit to everyone to do or build this or that but it’s the same shit they are building for anyone else, why because probability says that’s the best new development ever for the same 10 million people. Your real inventions or breakthroughs will be kept for training while it passively diverts you from ever reaching the end or full development. Can’t tell you how many times I have made very nice programs and once they were production ready ask a llms to polish something and watch it attempt to dismantle and break everything that works. Building process they are great, once it’s mostly built it’s a threat.
moborius1387@reddit
Down vote me all you want, but then please explain why the llm’s have not made you all millionaires and you keep hoping for a local model that will make all your dreams come true. Truth is the day you get a local model the would it will no longer be profitable. LLM’s are like a slot machine, you think you’ll win big but most likely you will go home broke. Wouldn’t doubt if I was downvoted by the people that profit from this shit. What they are getting is a shitload of your money and a shitload of collective intelligence of everyone’s thoughts and desires. That’s not concerning at all. Since everyone lives in a make believe world where opinions are facts. It is a profitable business, local llm’s are meant to make you want to pay for cloud service or 10-30k in hardware. Every new model is the best and this or that, ok. GLM 5.1 omg just as good as opus 4.6 or very close um no it’s not and it’s slow af. Qwen, minimax, glm are not even close to being as accurate as opus 4.6. I have used Qwen 3.6, Minimax 2.7, GLM 5.1 and they are not even as good as GPT. They are giving free to attract paid. Not only that most the free ones are Chinese models and no one bats an eye. As everyone transitions from their anthropic or other sub then they are also deciding what country gets their data. You are not using llm’s they are using you. The world is running out of ideas and shit to invent, this is their tool to use mass collective intelligence to farm for shit. Not only that investigate how these companies scrape pending patents to develop them before smaller companies or solo entrepreneurs have theirs fully approved.
HyperWinX@reddit
Local LLMs never were about profit or quality higher than top tier proprietary models.
moborius1387@reddit
You could say that, but the common effect is one starts with a cloud provider which peaks interest. Soon they don’t want to pay. They find local models. Local models don’t compare. User goes back to cloud services and spends twice as much. Everything is for profit in some way in this life, as the saying goes “nothing in life is free”. I’ll give you an example I worked for a major 2 different refreshment distributors years ago. I was privy to the actual retail price. Today this is how the sales work. It cost them minimal more to produce as time goes on. They increase the actual price 400%-500% then run sales at still around 200% profit on select items to get the customer to rotate what they buy in bulk thus they force bulk purchase on the customer through the illusion of good deals. The shit never became more expensive they just found ways to squeeze more money out of the consumer. Just like the shit going on all over the world, people think we live in this sophisticated society and that we have evolved so much. Well it’s more like we live in a game of thrones situation where everyone wants all the power. Only thing that slowed wars was when militaries achieved weapons that were capable of mass destruction. If the US did not have the military it does we would not exist. Which here in America is absolutely mind blowing how divided this country is and how dangerous that is. The problem is people were given to many choices to satisfy the masses, now you have a situation of mom and dad let the kids do whatever the fuck they want and the whole country is the kids. Nobody wants the truth about anything they want to believe their own version of reality. Just like this country all I ever hear is about African American hardship and suffering of slavery. Thing is Irish and other Caucasian people were the first waves of slaves, once tamed then its the African Americans and why were they freed well the same shit today it was political and needed bodies for a war and votes after. Then the Chinese and Asian cultures being told there was a better life in America only to find chains. The process was always enslave domesticate release. We still do what they want and trade our real freedoms for illusions of a career, status, wealth, objects, etc. as we traded the chains for new chains in the form of currency. Not to mention every one talks about atrocities and Hitler, but no mention anymore of the genocide and eradication of the entire Native American race the once inhabited this country. This is where I don’t understand how people don’t see through the bullshit. Everyone on earth lives on blood soaked land, countless people died so that you could live there. This has been the way of humanity since the dawn of time. Now you have people that live in countries that hate the country they live in. Well let me tell you it’s like this if you run with a gang and shit gets real you are a target by affiliation. Thus for people in the US per se that support other countries is like being a crip saying you hate crips and bloods are the people you support. Well just like the countries with people with this ideology, if they come for your country they are coming for you. If the US or any other country were to be full scale invaded do you think they will stop shooting to ask your fucking opinion on politics or who you side with, or your race or gender? No they wouldn’t and you would perish. The main problem with the world is greed and envy that’s why no civilization has ever lived peacefully there is always someone who wants more. Before they would just kill the king and take their place. People would say nope we are far past that. Well explain the JFK and CIA scandal then literally released that the CIA killed Kennedy and his CIA affiliated vice president stepped in and I heard no one speak about it. Covid being intentional and we all keep on keeping on like nothing ever happened. Your voices fall on deaf ears because the powers that be know the same thing we do, which is there’s nothing anyone is going to do about it except bitch and moan because they will just kill us otherwise. I mean what would happen if everyone stopped doing what they were told and just focused on making the world a better place and letting all the shit go so we can truly be at peace as we don’t need any of the shit we think we do, you need a car so you can get your ass to work halfway across the city or you’ll be homeless. If you don’t play ball, you have no home, no food, and no safety and cast out, we call them the homeless. Think about the shit that transpires in this world that you would think were reserved for Hell. So tell me again about how you understand the way shit works in business or otherwise.
Danmoreng@reddit
Tbh this just sounds like you don't have tried the correct settings aka fit and fitctx.
You might want to checkout: https://github.com/Danmoreng/local-qwen3-coder-env and re-run your base llama.cpp benchmarks with optimized settings.
ongunm@reddit (OP)
The vanilla baseline was run with llama.cpp's automatic device memory fitting enabled (it's on by default in recent builds). The logs show llama_params_fit_impl running and placing dense layers optimally before any FATE code kicks in. FATE doesn't change how the base model is loaded or how layers are distributed across devices, it only adds a GPU cache for expert weights on top of whatever layout llama.cpp's fitter chooses. So the comparison is against llama.cpp already doing its best placement. That said, I'll check out that repo and see if there are settings I missed. If the vanilla baseline goes up, that's good to know.
chimpera@reddit
Have you tested the impact of pcie bandwidth? gen5 gen4 gen3, 16x 8x. Thanks
ongunm@reddit (OP)
Haven't tested across different PCIe configs yet, only on PCIe 4.0 x16. But PCIe bandwidth directly affects both the miss penalty (H2D fallback) and prefetch throughput, so it should matter a lot. With a high cache hit rate (99.5%), most accesses are D2D from VRAM (\~900 GB/s) and never touch PCIe at all. So the higher your hit rate, the less PCIe bandwidth matters. On Gen3 x8 I think you'd see bigger wins from FATE vs vanilla since vanilla pays the PCIe cost on every single expert load, while FATE only pays it on misses. Would love if someone with Gen3 or Gen5 hardware could benchmark it.
ongunm@reddit (OP)
Repo is open, PRs welcome: github.com/ongunm/llama-moe-cache. Would love people to brutally benchmark this on different hardware and MoE models. I want to know where it breaks. There's also a staging buffer race on the 235B test that needs cracking, and combining this with KV cache quantization (like Google's turbo quant) could compound the VRAM savings hard. \~500 lines of C++/CUDA, not a huge codebase to get into.
fragment_me@reddit
Slop or not what are the tok/s results of 122b or something bigger?
ongunm@reddit (OP)
Tested on Qwen3-235B-A22B Q4_K_M (132 GB, 235 billion params) on an RTX 4090 (24 GB VRAM). With the cache pool holding the full working set (3250 slots, 99.8% hit rate), hit \~21 t/s decode vs \~6.5 t/s vanilla which is about a 3.3x speedup. But the output wasn't coherent yet. There's a data race in the prefetch staging buffer that corrupts expert weights during transfer. The cache architecture and hit rates are solid, the prefetch pipeline just needs a correctness fix. Actively working on it.
fragment_me@reddit
Alright, you get that cooking and I will happily run your "slop."
HyperWinX@reddit
Dawg, i wish i could just say "this is cool ill definitely check it out"
But slop... nah, im not running this
ongunm@reddit (OP)
Fair, but this isn’t a heavy setup. It builds basically the same as vanilla llama.cpp. It’s just a small patch, not a separate system