What is the point of MoE models, beyond being faster?
Posted by ihatebeinganonymous@reddit | LocalLLaMA | View on Reddit | 119 comments
Hi. Besides the fact that an xByA MoE models runs as fast as a yA models but produces better results, what are other benefits of pursuing an MoE architecture and not a dense one with e.g. x/2 (or x/3) parameters?
Given that we need enough RAM for xB parameter anyway, aren't MoEs at a disadvantage when RAM is scarce, like the current situation?
And thinking of limit cases, is there a limit on x/y, so that it doesn't make sense e.g. to train a 100B1A MoE model?
Thanks.
ProposalOrganic1043@reddit
If you are in production environment, the gains of MOE combined with KV cache hit rate are much higher.
mikael110@reddit
The main usecase is actually the fact that they require way less compute, both during training and inference. That's not that useful for local users like us given we are mainly memory constrained (both in terms of bandwidth and size) but for the large inference providers and labs that matters massively.
It's what allows them to offer a huge 1TB+ model at high speeds. As when you are serving a lot of users at once and can greatly parallelize your requests then, the compute becomes a much bigger bottleneck.
ihatebeinganonymous@reddit (OP)
So you assume frontier models are more likely to be MoE than dense?
Ok_Appearance3584@reddit
Yes, all of them are. If we're talking OpenAI, Claude, DeepSeek etc.
DanzakFromEurope@reddit
Clause is probably/maybe a dense model actually.
-p-e-w-@reddit
No way. They would be wasting insane amounts of compute for no benefit.
DanzakFromEurope@reddit
I mean, I don't know about the "no benefit" part. Until recently Opus had a pretty big lead on competition.
But yeah, until we get some leak or they release it, we won't know. As I said, I just read some analyzes and reverese-engeneering evaluations some time back who pointed in the direciton of it being dense with some hybrid architecture.
Qwoctopussy@reddit
any time you read an analysis by mckinsey or rand or w/e, take it with a huge grain of salt
somerussianbear@reddit
Source?
DanzakFromEurope@reddit
Just some analysis and evaluations I've read some time ago about training compute, sparsity etc.
Also looks like with Mythos they are actually switching to MoE.
somerussianbear@reddit
No offense, but hard to believe you mate. No big models are dense. Nothing on the 1T range would be runnable, and definitely not with that speed. Maybe a batch processor model would do the job with all time in the world, but not real time.
DanzakFromEurope@reddit
Yeah, np. I am definitely not putting it in as a fact. Obviously even I don't know.
cibernox@reddit
I am certain that anyone doing AI at the scale of Anthropic is not doing dense models. It may be moe, it might be a router paired with submodels. But at scale with the amount of money they are throwing into compute it would be asinine to not do performance optimization at the model side.
lacerating_aura@reddit
If you want a heavy dense example, just try to look at llama3 405B model and try to get compute requirements for its training and inference and performance compared to smaller llama! 70B model. Then you can get an idea why going larger with dense is not so good.
Neilblaze@reddit
mostly, nowadays
Blizado@reddit
Look at the last OpenAI GPT-OSS open weight models from last year. Both MoE. And I think OpenAI didn't do much different on their closed weight models.
Fedor_Doc@reddit
Yes. It is impossible compute-wise to run 1T dense models at decent speeds
MeganDryer@reddit
It's enormously useful, because it reduces the RAM _speed_ requirements.
krzyk@reddit
And reduce RAM requirements, that is the most important benefit. Not everybody is sitting on. 128/256 GB of unified memory.
Odd_Science@reddit
That's precisely what MoEs don't do. They need more memory for the same performance.
krzyk@reddit
I stand corrected. Somehow I thought that was the main adventage of MoE (where some "experts" sit in VRAM and the rest is dorman on normal RAM).
That will make my laptop 6GB VRAM dGPU quite sad.
Equivalent-Costumes@reddit
In some sense you're right, but I think you're mistaken about what that actually imply.
In a dense models, if the model can't fit into VRAM, part of it will be loaded into RAM (or even left on disk). Your inference will happen partly on GPU and partly of CPU.
For MoE, you probably thought that only some experts are activated per prompt, so you just load the experts you need into VRAM and use RAM for dormant experts. If this was true MoE would had been a huge improvement, but unfortunately no: MoE mean only a small fraction of parameters are activated PER TOKEN. Across a single prompt it's quite possible that almost all parameters are activated. If you have low VRAM, what's going to happen is the the most used parameters will be loaded into VRAM and the less used one is in RAM. Your inference will happen a lot on CPU---swapping would take too much time since the parameters you need will change per token---which mean you have the exact same problem as a dense model that can't fit.
Ariquitaun@reddit
You can do that using offloading, but all of the model still needs to be loaded into memory somewhere.
AmazinglyNatural6545@reddit
They're misguiding you. Do your own research đ
PrinceOfLeon@reddit
Yes you are correct, no idea what these other folks are talking about. The VRAM is your operating memory and the system RAM acts like a swap disk or file (much slower but makes it possible to work at all).
6GB is still going to be starved, you'll only practically be able to get a single layer of most MoE's, so will almost certainly have to "swap" constantly, but it will work at all which is still good.
The only folks who don't see as much benefit in that regard are those with unified memory (such as all Apple Silicon systems) since all of the memory operates at the same speed anyway.
Puzzleheaded_Base302@reddit
it is not practical to run 400B or 1T dense model. The MoE is a workaround for limited bandwidth and compute capacity.
anykeyh@reddit
MoE equivalent to Dense model intelligence is usually Sqrt(ExpertWeight * TotalWeight) (not a rule, more like empirically).
So Qwen MoE 35B intelligence would be around Sqrt(35*3) = 10B parameters dense model; but it uses 3x less compute to generate a token. You get 3x more intelligence per compute power.
Also, it allows more knowledge (embedded into weight).
It's important to understand that having RAM is not the problem in running a LLM. It's about retrieving data from the ram or memory bandwidth. When a model needs only 3B parameters to generate a token, that's way less bandwidth needed.
A hypothetical 10T parameter with 5B expert firing would be cheap to manage for a company. Storying 10T weight is "relatively simple", while moving those parameters to the GPU processing core to compute on top of it is what is expensive today.
I simplify a lot, but I think you get my point.
RevolutionaryWater31@reddit
What about compute for prompt processing? Let's say Qwen 3.5 35A3B vs Qwen 9B, i'm seeing quite slower prompt processing for the MoE model
No_Block8640@reddit
Mistral 128B is the perfect example of why no one trains dense models. Total parameters arenât intelligence theyâre just a memory pool. Kimi has 1T params to remember everything, but only uses 32B active compute for inference. Itâs the best of both worlds
ihatebeinganonymous@reddit (OP)
Wouldn't that make much more sense if we didn't need to load the whole model into memory, but only some "active" subset? Why isn't anyone pursuing such an architecture? (or are they?)
BigYoSpeck@reddit
With MOE they route to different experts on a per token basis not for the entire prompt. They aren't experts in the sense of one is good at maths, one at pirate haiku's, but instead at the token level
You don't know what token you need until the router selects an expert and if it's not in memory then rather than taking milliseconds or less per token it now takes several seconds to pull from disk
ihatebeinganonymous@reddit (OP)
That's my point. Is it impossible/infeasible to have a model that decides on experts per prompt instead of per token?
blakeman8192@reddit
Nope, because LLMs only generate one token at a time architecturally.
When you input a prompt, the system is actually feeding the entire input context into the model each time it generates a single output token, adding the last/newest output token as input to every subsequent iteration, until the model generates a special "stop" token.
Equivalent-Costumes@reddit
You might as well just have completely different models at that point.
If you can make a model that have 1 router that route to a choice of 1000 experts per prompt, then what you basically have is 1 embedding models and 1000 generation models.
BigYoSpeck@reddit
You don't need to do that at the model level, an agent harnesses is more than capable of routing to different models
Cline does this at a simple level with different agents for plan and act
And if you wrap them within workflows you can have automatic handoff to subagents. Then it's a matter of your inference backend like llama.cpp in routing mode only allowing one loaded at a time and hot swapping them
kaisurniwurer@reddit
In that case you would be using 3B model to generate a response, completely bypassing moe advantages.
Besides it's not really "per token" but rather per router/layer. Segments might change even per token.
Pristine-Woodpecker@reddit
Given the fact that no-one is doing this, it's safe to assume it doesn't work as well.
kweglinski@reddit
even if it was possible it would mean catastrofic failure if anything goes sideways. Currently model is capable of catching it said something dumb in single response. If "expert" gets wrong (or task requires multiple) you endup with absoluty bad answer. Not to mention things like unrelated data (i.e. other languages than the one you use with the model) actually improving model intelligence which would get lost with experts. What seems to be somewhat close to what is being looked for here is mixture of models, the clown car could have a router to pick a model most suitable for a task but it's better solved at infrastucture not model level.
dsanft@reddit
Maybe, but none of them are trained that way. There's a shared expert that's active for every token, and a router that selects which experts should be active for each token, then for that token you route to those experts to get your prediction for token n+1. Repeat.
kweglinski@reddit
that's the case of model offloading to ram (not vram)
ihatebeinganonymous@reddit (OP)
So it's done, not between disk and RAM, but between RAM and vRAM?
kweglinski@reddit
well, most common pattern yes. There are some people who do this with disk but it kills performance dramatically and the disk as well. In general MoE doesn't mean there are unused paths and the overhead just hangs around but there are more and less active parts. Putting the significant parts in gpu and less significant parts (should I say less active?) somewhere else (in moe) allows you to reasonably run bigger models than the gpu allows.
grunt_monkey_@reddit
Whats wrong with mistral 128b?
t4a8945@reddit
128B, dense, awful to run locally, doesn't match Qwen 3.6 27B. Quite the useless modelÂ
a_beautiful_rhind@reddit
Not awful for me. On creative stuff qwen 3.6 sucks. Mistral takes it to the cleaners.
MoE is only as smart as the active parameters. Takes a model with much larger total memory footprint to match dense. This is admitted by the labs themselves in what models they compare with their little charts.
Gemma 31b demonstrates the previous and the value of good training data over architecture spectacularly. Many a "100b' MoE falls to it on many real world uses; not just benchmark go up and coding.
A ton of dense hate is simply sour grapes because people can't run it or run it fast.
Double_Cause4609@reddit
MoE isn't "as smart as its active parameters". In a few specific synthetic tasks that's true but it's not the full story. Almost any real-world task is some combination of memory and reasoning (in fact the two are hard to truly disentangle cleanly).
Almost all MoE models perform somewhere inbetween their active and total parameter count in real world usage.
For creative tasks I do agree that models are generally closer to their active count than typical, partially because often you're creating a world that doesn't exist in-memory, but it's really hard to evaluate how much of that is the attention bottleneck (some MoE arches are measurably sub-pareto in their allocation of attention vs FFN FLOPs; see the Ling Lite 2 ablations) versus how much is a consequence of a sparse FFN.
For syntax-heavy tasks (some aspects of coding for example) models to seem to basically scale with their active parameter count in isolation, yeah, but MoEs do still have a few modest advantages even here. If you recall Engram, one of the chief arguments for it was that it effectively "deepens" the model by freeing early layers from the burden of mapping text to the latent space (and even if you don't like Engram this applies to basically any of the papers that showed scaling input embeddings is an orthogonal sparse axis of scaling), but the same thing actually applies here. MoEs have more capacity to memorize information with less interference (in fact, for memorization, they basically scale with their total parameter count), which actually reduces the burden on parameters for reasoning. So, even in the worst case they get at least some modest scaling beyond their active parameter count.
Also: Compute isn't free. If a lab has X FLOPs and X FLOPs would give you a massively undertrained dense model versus a well-trained MoE that matches current training regimes, then the MoE can actually be preferable in many tasks.
a_beautiful_rhind@reddit
I'm sort of with you there. They can do better than root mean or pure active, but nooooo where near the total parameters. I've found you need a lot more total params to make up for the lack of active and it becomes a bum deal.
Yes, that's where many of those extra parameters go. My first experience with this was dots, it had so much trivia but shit understanding. Labs have been playing with the ratio ever since and that's how we have things like deepseek and kimi. As a rule of thumb though, stuff <~20b active hasn't performed well for me while taking up more space and coincidentally, this mirrors plain dense models. What other conclusion should I take away?
That goes without saying. Well trained model is always better, but I'm also tired of labs dropping MoE that don't perform at the size of their memory footprint. Theoretical advantages are nice until they end up in weights on your drive.
t4a8945@reddit
I'm pleased to learn it beats it on creative stuff, that's at least one thing going for it.
From your experience, compared to sub 128B other models, does it eclipse Gemma 31B for instance?
I own 2x Spark and wouldn't touch a 128B dense model, as they're not made to handle this kind of dense size.
Running it with context requires a very fast GPU, fast VRAM and lots of it. How are you achieving that?
a_beautiful_rhind@reddit
I just have 4x3090. So I've been using the mistral models for a long time. It's actually very similar to gemma compute-wise. Q8 gemma == Q4 mistral in terms of requirements.
Gemma is kinda close to it in the raw, but gemma has a variety problem. I have way more choices on mistral models to sound natural and vary the writing. Gemma I only have the IT (base has small model smell). Anyone finetuning will probably undo the intelligence.
In terms of speeds, mistral fits in vram and something like qwen397b doesn't so the dense model is faster on my system.
kaisurniwurer@reddit
Dense models have a lot more activation paths possibilities allowing for more nuance in both understanding and in reply. This is especially noticeable in small activation moe models.
If you want insights over structure try comparing a dense model to a moe one.
Dense models can be good at structure if they are rigorously trained, they just need a lot more effort to become effective at it.
Murgatroyd314@reddit
The point is to get the output quality of a medium-sized model at the speed of a small model. The cost of this is that it has the resource demands of a large model.
graypasser@reddit
More like, what is the point of dense model? just make it a MoE model with active parameter of dense model.
ihatebeinganonymous@reddit (OP)
My main "issue" was that you need to fit the total parameters in RAM anyway and not just the active parameters.
Based on some very useful answers and discussions, it makes more sense from a "data center economy" point of view (versus a single-user point of view) and also the distinction between RAM and vRAM.
graypasser@reddit
It still makes sense to be MoE even from single user PoV if you don't mind spending cash, to be honest.
Generally speaking, 3x faster model with 80% intelligence is not just 80% intelligence with 3x throughput, intelligence itself is heavily reliant on thinking (total token count) itself, thus even though it's not exactly 240% smartness, in the end it's still like 95% of smartness and 1.5x throughput or something like that.
aaaqqq@reddit
that's it. Is that not enough?
Do you also want there to be a need to have more compute in addition to the high RAM?
ihatebeinganonymous@reddit (OP)
Considering a RAM-constrained setup, I prefer to get the best quality out of my xB of memory usage. Doesn't that make sense?
ea_man@reddit
Yes because you are a local user with a GPU with little VRAM pool, so you want to get the most out of it.
In datacenter is different: they have to load the weights for a multitude of users and want them to be traversed fast.
People with Macs with unified RAM are in a similar position: they have larger RAM pools but little compute to traverse it all, better to have just few active parameters.
cakemates@reddit
mate, MoEs were made to save on vram and compute because vram is way more constrained than ram. Dense models often suck at running from ram.
xadiant@reddit
Reasoning can take 20k tokens. 10 tps is gonna take forever.
mtmttuan@reddit
Problem is if you're ram-constrainted then you're very likely also compute-restrained. It's also simpler to scale ram than to scale compute.
danishkirel@reddit
Unless itâs too slow to be practical. Say you have a 64gb Mac. No Chance you code interactively on 27b dense at highest quantization that fits. With MoE you do.
FullOf_Bad_Ideas@reddit
I think the main point is training cost.
You need the same amounf of FLOPS to train 32B dense model as you do to train 1T A32B MoE. It's a really good deal a lot of the time.
nacholunchable@reddit
^This^ Dont get me wrong, as a proud APU user i do really appreciate the inference speed and have lower vram restrictions myself, but thats a happy side effect. Large dense models are a bitch to train, and the theoretical benefits are lost if you cant bake em in properly. Even as we move toward inference cost optimization from training cost optimization, the scale of compute and time for training is still incredible, and if youve got a fixed gpu-hour training budget, a large MOE will likely come out better, simply due to absorbing more knowledge in the same amount of time.
ihatebeinganonymous@reddit (OP)
Really? How? Any source I can read?
FullOf_Bad_Ideas@reddit
Yes, sparsity is their premise. Each token, both trained and inferenced, activates only the "activated parameters", so you train only those 32B activated parameters on each token. There are complexities arising from the fact that you need more memory to store the weights if your MoE is sparse and has a lot of experts, so it may be harder to avoid idle time and get good utilization rate on the training cluster, but the amount of computation that goes into the model is indeed correlated with activated parameter count.
Formula for calculating the number of FLOPS that were used for training is:
FLOPs = 6 x tokens trained on x active parameters
For dense models, all parameters are active. 6 is fixed since during training you need to do one backward pass (4 FLOPs) and one forward pass (2 FLOPs). As you can see, training a 1T model, when only 32B params are active, on a given dataset, has the same theoretical computational cost to training 32B dense.
Here's a good deep-dive exploration of the premise that MoE training is more compute efficient - https://arxiv.org/abs/2501.12370
ea_man@reddit
training is about 7x faster than moe: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd
GoldenX86@reddit
They are great at running fine on my 8GB GPU, that's good enough of a reason for me.
ihatebeinganonymous@reddit (OP)
What models do you run? Is the 8GB vRAM that you have a limit on active parameters or total parameters?Â
GoldenX86@reddit
Qwen3.6-35b-a3b, q5_k_s and 131k of context barely fits in 8gb of vram and 32gb of ram. Extra experts on ram lets the GPU only use around 3gb of vram, so the rest can be used for context.
All of that is good enough for 33 t/s with a 3060ti and a 5600x.
Internal-Science2137@reddit
the underrated benefit is specialization â different experts actually learn different skills, not just random partitioning. some handle code, some handle math, etc.
Silver-Champion-4846@reddit
That's not what experts do until recently some lab tried forcing the model to split its subnetworks per domain
Internal-Science2137@reddit
fair, clean domain splits dont happen naturally. token-level specialization does though - and DeepSeek's auxiliary loss was the first real push for domain-level
Silver-Champion-4846@reddit
Like calling them experts was misleading since they're just grammatical element subnetworks as far as I understand it.
twack3r@reddit
MoEs did just that: they shifted the bottleneck from compute to memory bandwidth. As a result, the price for memory and memory bandwidth went up, and as we all know, massively so.
Still, even at current memory prices, this is way cheaper both on CAPEX as well as OPEX compared to a compute-constrained, equivalent dense model.
So MoE made models cheaper to run and using less energy.
Serprotease@reddit
Memory bandwidth? If anything MoE are less bandwidth intensive and make older hardware with tons of slow-ish ram/vram actually usable. Like, you can probably cobble together something with 4/6 channels of ddr3 or slow ddr4 + a second hand ampere gpu with 16gb of vram and run 120b MoE at 10+ tk/s tg.
twack3r@reddit
Yes you can do that and it will work.
But use more channels and/or faster RAM and it will run faster without any change to compute requirement.
Hence: MoEs moved the bottleneck [for performance] from compute to memory bandwidth.
sdkgierjgioperjki0@reddit
All the big models were MoE from the beginning (original GPT-4 was MoE), it was only Llama that was dense which is why people on here thinks that MoE is some new thing and are under the misconception that most proprietary models when they never have been.
MoE didn't move any bottleneck and didn't cause any prices of hardware to shift, just general demand for more hardware across the board, especially everything involved in Nvidia GPUs and Google TPUs, increased most in price.
Mguyen@reddit
The evolution of LLMs naturally leads to MoE models. Of all the xB weights of a dense LLM, not all of them will be strongly activated. If you look into it, you will find that a large number of weights do not meaningfully contribute to each token (but this is different for each token). This is what's referred to as activation sparsity and it is a naturally emergent behavior. We've known about this for at least a few years. Interesting enough, an analog to this is the old saying that people only use 10% of the brain (the brain has different regions that are active at different moments in time. It's probably not actually 10% but what's important is that it's also a sparse activation)
Trimming a model and optimizing it so that it knows which experts to send a token prediction to is the hard part. You need to balance it so that your chosen experts all get similar usage and that you're not trimming away parameters that are important. This would get similar results to a heavily quantized model in that it preserves the parameters that correspond to trained knowledge but that their weights are modified. The activations won't be exactly as trained.
Pleasant-Shallot-707@reddit
âWhat is their point beyond their point?â
droptableadventures@reddit
vasileer@reddit
My rule of thumb:
- MoE for Macs (more RAM less FLOPS)
- Dense for GPU (less RAM more FLOPS)
ihatebeinganonymous@reddit (OP)
Interesting point of view, thanks. Can't we slightly generalise it to CPU vs GPU indeed?
my_name_isnt_clever@reddit
You could describe it as dense for raw VRAM (GPUs) vs MoE for unified RAM (DGX Spark, Mac, AMD Halo Strix).
Ok_Appearance3584@reddit
No, unified memory systems like DGX Spark do include a GPU, but benefit more from MoE due to high RAM, decent processing power and low bandwidth.
Adventurous-Paper566@reddit
Ils coûtent moins cher à l'inférence.
Monk_Boy@reddit
MoE uses less RAM for the KV, so you get a larger effective context, relative to the amount of RAM used for the model.
Evgeny_19@reddit
The breadth of knowledge is better on a bigger model. So depending on your use case, a MoE model could produce better results than a dense one. Just the other day there was a discussion where one person shared that for them Qwen3 Coder Next performs better than Qwen 3.6 27b. Same could be applied to other models. It is quite possible that Qwen 3.5 122b would be better for some use cases than 3.6 27b despite having only 10B of active parameters.
mild_geese@reddit
"What is the point of it, beyond the whole point of it"
Being faster *is* the point. Dense models are often just too slow to run at practical speeds without expensive hardware. Dense models do give you better results for their memory footprint, but that doesn't matter if they aren't running at usable speeds.
Confusion_Senior@reddit
âJustâ speed? The crucial factor in this economy is intelligence/ energy and they improve that by what 400%?
Also it is known that there is diminishing returns around some density values so they use these for the size of the experts
Kahvana@reddit
It requires less GPU compute and more RAM/NVME.
Most costumer devices do come with a decent amount of RAM, but dedicated GPUs are usually reserved for gaming enthusiasts. NPUs are a very recent thing.
With MoE models, the compute requierement is low enough for low-end NPUs, low-end GPUs and CPUs to run the model while still being âgood enoughâ in intelligence.
For models larger than ~150GB, the GPU compute start to give diminishing returns, for 500B+ its almost a requirement to make it cheap enough to run.
SillyLilBear@reddit
That is all they are for, they are inferior in every way but speed. As models get bigger, they are harder and harder to maintain usable performance.
ParaboloidalCrest@reddit
As simple as that. Not sure why do we have 100x comments * 2 paragraphs each when the answer was in OP's question.
VoiceApprehensive893@reddit
crazy s p e e d
at the cost of intelligence but not knowledge, also a dense model with the same benchmarks will likely feel "smarter" but with less knowledge
ImportancePitiful795@reddit
Seems you are confused and this directly stems from your statement
"MoEs at a disadvantage when RAM is scarce" when VRAM is more scarce and expensive for the dense models...
MoE make WAY better managing of resources, hence are ideal when RAM is scarce not the other way around.
Think MoE like this.
You want to make a cake and you have a recipe library of 1000 books. Do you read the whole library and then go back to the page to make the specific cake? Nope. You go straight to the book containing cake recipies.
After that when you want to make Stifado, you do not go through the whole library also, you go straight to the Greek Cuisine book.
If you want to continue on Greek cuisine with another dish, the model doesn't have to go back to the library to search again for the Greek cuisine book, nor having loaded in memory French & Italian cuisine also, like a dense model.
This is how MoEs work.
The second part is how large is the Experts loaded. 4B or 8B is ideal, as it performs like an 8B dense model when you stick to the current "subject". It has the knowledge of the whole library eg 120B but only the book needed is been pulled forward. However if something is needed from another book, it goes to get it. (and here we have copy flags etc)
A dense model because it has to be loaded completely, you are limited by RAM/VRAM.
So has to load the whole 1000 book library when trying to make a cake. Not just the book for the cakes.
So if cannot load the whole 1000 book library, needs to load a 300 book library, and immediately is 700 books short, without having access to them nor knowledge that they do exist.
Or worse, has very brief knowledge of all those 1000 books, and cannot give you exact detailed steps how to make the cake, or stifado. Just "general" overview, which might lead to a crap result.
Pristine-Woodpecker@reddit
I don't understand how this is upvoted, it's massively misleading. MoE doesn't save anything on RAM, only compute. The experts are all over the network and every layer triggers different ones almost for every token, they're not subject matter experts as we would think of in a human sense.
ImportancePitiful795@reddit
You forget few fundamentals.
RAM is cheaper than VRAM and Bigger model at higher quant are better.
Pristine-Woodpecker@reddit
No serious deployment runs models out of RAM. Nobody is running production on your cheap at home server that runs at 4 tg/s.
a_beautiful_rhind@reddit
They might save your KV prompt cache to ram :P
But yea.. people here are struggling with 10 active params so their views will be skewed. They only see model goes faster at better than Q2K.
I would actually take R1 in your scenario because it has way more social intelligence than qwen, has more active params and more knowledge.
ImportancePitiful795@reddit
You are talking about PRODUCTION.
Yet you forget the most successful PRODUCTION systems composed of HBM RAM and LPDDR5X running in HYBRID MODE.
GH200, GH300, GB300, MI300A etc.
All of them have CPU + RAM (LPDDR5X) + GPU with much smaller VRAM.
That's how whole AI infrastructure is been build in PRODUCTION.
And these are scaled in clusters.
You won't find single cards with 1TB HBM3.
edsonmedina@reddit
> aren't MoEs at a disadvantage when RAM is scarce, like the current situation?
I think the market is already moving past that with unified memory systems like Macs, DGX Spark, Strix Halo, etc
These systems have A LOT more VRAM but less memory bandwidth, which makes them perfect for MoE.
Clear_Subconscious@reddit
Main point of MoE isnt just speed itâs scaling capacity without scaling compute.
You get many specialized âexpertsâ but only a few run per token, so better quality per FLOP than dense models. Tradeoff is more RAM/storage and harder routing/training stability. Extreme splits eventualy give diminishing returns.
danihend@reddit
MoEs are amazing for those with a setup like mine. I have 10GB RTX3080. I can offload any number of experts to the CPU and reserve the VRAM for the rest plus KV cache. This means I can run qwen3.6-35B-A3B at like 30 tps with 100k+ context at q8 or Q4 with even more context or faster.
27b model I can run at like 5 tps
ihatebeinganonymous@reddit (OP)
Thanks. Do I understand correctly that 10GB is the limit on the number of "active" parameters your setup handle, and not total parameters, which is bound by system RAM?
Pristine-Woodpecker@reddit
It doesn't work this way. A 35B model will have 35B parameters active for any given prompt, but 3B active per token. You can't swap them in/out per token without killing performance.
The reason it works at all is because 3B active is small enough having to run it from system RAM/CPU doesn't *totally* kill performance.
Scared-Tip7914@reddit
Its all about the speed, and don't think about MoE with our GPU poor mindset, I mean just look at Kimi K2.6, its a 1 trillion parameter MoE model, isn't no one around here (or very few lucky bastards) running that thing at home.
This is so that the data centers serving these models can get very good speed to quality ratios, because they can get reasoning and depth of a, lets say for Kimi K2.6 1 trillion parameter model (I know thats not the exact MoE to dense conversion ratio but lets assume) while paying for the compute and enjoying the speed of an "only" 32B model. Even though the model is actively occupying hundreds of gigs of VRAM, it doesn't really matter bc the throughput speeds make up for it and then some as opposed to having a dense model on that same VRAM.
So its more big datacenter economics, but it trickles down to us as well, hence we get to enjoy the likes of qwen3.5-35B.
Eyelbee@reddit
It's hard to train a moe, things start going south when you go very sparse like 100BA1B, it creates architectural challenges. Otherwise I'd train a 1.6T A0.8B omega-sparse MoE in my 3090.
And yeah, going faster is important, especially for larger models, if you had a 1T dense model you'd get extremely slow t/s even with a GB300 cluster. Fitting all in vram doesn't matter when you have to push all parameters through in every token.
havnar-@reddit
I use MOE on my MacBook m5 Pro 64GB. Qwen 3.6 MOE 8bit. 50-60 Tok/s
27b is only pulling like 8 or 10 tokens per second.
So itâs perfect for me.
datbackup@reddit
the point is increasing the ratio of intelligence to compute/energy. Huge dense models are âsmarterâ but also wasteful. If you can get equally good answers using a fraction of the active parameters, your answers will be smarter per unit of compute (and electricity)
Speed is also extremely important. While I wouldnât want to sacrifice intelligence for speed, actually building things that use AI (like agents) requires many iterations. If each iteration can be sped up by just 50% that adds up to massive time savings.
Regular_Car_9458@reddit
This is for poor people who can only afford 5060 with 16GB VRAM.
Otherwise_Economy576@reddit
moe is not just speed - you get wider "expert" specialization without activating the whole model. tradeoff is routing instability and weird failures on edge prompts. for local use i still default dense models under 14b for predictability.
Formal-Exam-8767@reddit
The main issue of running LLMs is not the amount of memory or even compute but memory bandwidth.
You can ignore the amount of memory as it should be given that you have enough memory, if you don't, you can't run it, period.
MoE have less active parameters so they better utilize memory bandwidth and require less compute since they are working with lower amount of data than dense models.
Zeeplankton@reddit
Everything is simply down to economics assuredly. MoE runs faster.
The faster you can fill a user query the faster compute releases for the next user. MoE probably saves billions.
LLM race it's not about intelligence. It's about "enough intelligence to just stay competitive, and be compute optimized"
Every company lives and dies by infra costs right now.
fishyfishy27@reddit
There is a technique called MoE offload, where only the active experts are pulled into VRAM to process each token. If you have a lot of system RAM but not much VRAM, it allows you to run much larger MoE models than you could with dense models.
Herr_Drosselmeyer@reddit
With a single user, or just a few, VRAM is indeed the bottleneck. Once you have enough of that, you'll be able to run most everything at a conversational pace.
However, once you're trying to serve thousands of concurrent requests, compute is not longer negligible. Same if you're running complex agentic setups that churn through tokens at very high rates. In these cases, you'll want to prioritize throughput, hence why MoE models are quite popular.
Better-Struggle9958@reddit
sometimes you lucky and get good answer
Euphoric_Emotion5397@reddit
To me, I find Qwen 3.6 MOE and Dense model as equivalent at Q4-KM and Q8 KV Cache.
I did those impromptu cut and paste of the whole reasoning process to Gemini and Claude to rate and both seems to think Qwen 3.6 MOE is better at agentic workflow.
So, I used MOE all the time, it's super fast with 200k context.
Routine_Plastic4311@reddit
RAM is the bottleneck for sure, but the point is you get better inference speed without proportional memory cost. Dense models that match quality would need way more params, so MoE is a practical tradeoff. The extreme ratios stop making sense because expert collapse and routing overhead eat the gains.
DeltaSqueezer@reddit
Besides faster? Cheaper.
123vovochen@reddit
Great questions for AI actually, Mistral is really cheap