How faster is Gemma 4 26B-A4B during inference vs 31B?

Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 16 comments

I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed.

One link on the web (I have posted with it and post been removed):

Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs significantly slower than Qwen 3.5's equivalent.

I guess it could be due to early versions of backend engine. How now with newest llama.cpp, what is inference speed of 26B-A4B vs 31B?

[-]

Poha_Best_Breakfast@reddit

Both on RTX 3090.

I’m getting 117 tok/s on 26B Q4_K_XL with 256k ctx (30-40k filled usually)

38 tok/s on 31B iQ4_XS with 128k ctx

With speculative decoding the 31B dense gets around 30-40% speed up in coding overall (worst case 0 speed up, best case 3x). No gain on MoE

[-]

It's new to me. How to you do that for 31B exactly? Web search found e.g https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf/speculative-decoding, but the instructions are not clear:

Speculative decoding in llama.cpp can be easily enabled via llama-cli and llama-server via the --model-draft argument. Note you must have a draft model, which generally is a smaller model, but it must have the same tokenizer

But later --device-draft CUDA0,CUDA1 instead of --model-draft

[-]

mtmttuan@reddit

4b active vs 31b active so ~8x faster. Maybe a bit more if you can offload to gpu.

[-]

putrasherni@reddit

That’s the wrong way to calculate MoE performance

It’s sqrt ( total * active )

which is 10.1B param model for 26B A3B

[-]

Confident_Ideal_5385@reddit

For "intelligence", sure. For speed, no. Your main bottleneck is a trip through the active parameters for each token (or token batch, if you're ingesting or doing batched decoding), so a factor of 8 is about right.

[-]

putrasherni@reddit

always laern new things , ty

[-]

alex20_202020@reddit (OP)

always laern new things , ty

Have you fact-checked already?

[-]

alex20_202020@reddit (OP)

For "intelligence", sure.

TIL, sqrt ( total * active ) for intelligence. It is close to truth, yes?

[-]

jacek2023@reddit

Source: trust me bro

[-]

stddealer@reddit

On my system it's a bit more than 4x faster.

[-]

SadGuitar5306@reddit

3-4 times faster

[-]

ttkciar@reddit

Pure-CPU inference on my dual E5-2660v3 Xeon (DDR4-2133):

Gemma-4-31B-it: 1.6 tokens/second
Gemma-4-26B-A4B-it: 11.5 tokens/second

That's roughly proportional to the number of active parameters (31B for the dense, 4B for the MoE) which is exactly what is expected.

[-]

PrysmX@reddit

The "A4B" means only 4 billion parameters are active at any given time. That's why it's so much faster. A dense model will be more capable, but much slower. It's a matter of weighing the need for a given task. An MoE model may be just fine in some cases.

[-]

jojorne@reddit

16gb vram and 16gb ram
26b-a4b at full context size was 33tkps.

[-]

chensium@reddit

You're comparing the 2 Gemma models but you're quoting a comparison to Qwen3.5?

In any case, Gemma4 26b is MoE. Gemma4 31b is dense. MoE is way way faster, by a lot

[-]

vSphere-Cluster-1234@reddit

The quote you are citing suggests that gemma4's moe is slower than qwen 3.5's moe.

But you are asking inferece speed of gemma4 moe vs gemma 4 dense, I have no idea what you are trying to say here.

If you are doing pure CPU no gpu than moe is the only realistic choice for workable speeds.