Lower inference speed of Gemma4 26BA4B on vllm.

Posted by everyoneisodd@reddit | LocalLLaMA | View on Reddit | 8 comments

For my earlier use case I used to host qwen 2.5 vl 7b gptq int4. Now I was looking to switch to Gemma4 26B A4B, as it would improve performance as well as improve latency considering only 4B parameters are active.. however it seems that Gemma4 is slower. What could be the reason of this?