How faster is Gemma 4 26B-A4B during inference vs 31B?
Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 16 comments
I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed.
One link on the web (I have posted with it and post been removed):
Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs significantly slower than Qwen 3.5's equivalent.
I guess it could be due to early versions of backend engine. How now with newest llama.cpp, what is inference speed of 26B-A4B vs 31B?
Poha_Best_Breakfast@reddit
Both on RTX 3090.
I’m getting 117 tok/s on 26B Q4_K_XL with 256k ctx (30-40k filled usually)
38 tok/s on 31B iQ4_XS with 128k ctx
With speculative decoding the 31B dense gets around 30-40% speed up in coding overall (worst case 0 speed up, best case 3x). No gain on MoE
alex20_202020@reddit (OP)
It's new to me. How to you do that for 31B exactly? Web search found e.g https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf/speculative-decoding, but the instructions are not clear:
But later
--device-draft CUDA0,CUDA1instead of--model-draftmtmttuan@reddit
4b active vs 31b active so ~8x faster. Maybe a bit more if you can offload to gpu.
putrasherni@reddit
That’s the wrong way to calculate MoE performance
It’s sqrt ( total * active )
which is 10.1B param model for 26B A3B
Confident_Ideal_5385@reddit
For "intelligence", sure. For speed, no. Your main bottleneck is a trip through the active parameters for each token (or token batch, if you're ingesting or doing batched decoding), so a factor of 8 is about right.
putrasherni@reddit
always laern new things , ty
alex20_202020@reddit (OP)
Have you fact-checked already?
alex20_202020@reddit (OP)
TIL, sqrt ( total * active ) for intelligence. It is close to truth, yes?
jacek2023@reddit
Source: trust me bro
stddealer@reddit
On my system it's a bit more than 4x faster.
SadGuitar5306@reddit
3-4 times faster
ttkciar@reddit
Pure-CPU inference on my dual E5-2660v3 Xeon (DDR4-2133):
Gemma-4-31B-it: 1.6 tokens/second
Gemma-4-26B-A4B-it: 11.5 tokens/second
That's roughly proportional to the number of active parameters (31B for the dense, 4B for the MoE) which is exactly what is expected.
PrysmX@reddit
The "A4B" means only 4 billion parameters are active at any given time. That's why it's so much faster. A dense model will be more capable, but much slower. It's a matter of weighing the need for a given task. An MoE model may be just fine in some cases.
jojorne@reddit
16gb vram and 16gb ram
26b-a4b at full context size was 33tkps.
chensium@reddit
You're comparing the 2 Gemma models but you're quoting a comparison to Qwen3.5?
In any case, Gemma4 26b is MoE. Gemma4 31b is dense. MoE is way way faster, by a lot
vSphere-Cluster-1234@reddit
The quote you are citing suggests that gemma4's moe is slower than qwen 3.5's moe.
But you are asking inferece speed of gemma4 moe vs gemma 4 dense, I have no idea what you are trying to say here.
If you are doing pure CPU no gpu than moe is the only realistic choice for workable speeds.