Lower inference speed of Gemma4 26BA4B on vllm.
Posted by everyoneisodd@reddit | LocalLLaMA | View on Reddit | 8 comments
For my earlier use case I used to host qwen 2.5 vl 7b gptq int4. Now I was looking to switch to Gemma4 26B A4B, as it would improve performance as well as improve latency considering only 4B parameters are active.. however it seems that Gemma4 is slower. What could be the reason of this?
Special-Lawyer-7253@reddit
It's a 26B params, the other number IS active B per token. So you must fit It on your VRAM. If model don't fit, gets offloaded to RAM, and you get slower responses.
everyoneisodd@reddit (OP)
All params are on GPU. Even the Gemma4 is gptq int4 quantised.
Special-Lawyer-7253@reddit
All the Q4 quants exceed 12GB on size. No way It fits all on VRAM. Change layers until you have about 1GB free on VRAM. Rest Will be offloaded to RAM.
everyoneisodd@reddit (OP)
I have access to h100
Special-Lawyer-7253@reddit
No clue then, but you can try qwen3.5 9B or Gemma 4 E4B (it's really 8B) to see if you get better resulta there. I got 25 t/s on a 1070 8GB with those. 6.5 t/s with Gemma4 26B
Jester14@reddit
I used a 2 year old 7B model. Now I use a brand new 26B MoE and it's slower. I refuse to give any other information. What's wrong with my setup?
everyoneisodd@reddit (OP)
Just want to understand if this is expected behaviour or not
Ok_Ocelot2268@reddit
For me the vllm gemma4 docker image (rocm!) is fast (based on 0.18), 0.19 is slow on bigger context (very slow..). Latest ollama is fast too.