Anyone got Gemma 4 26B-A4B running on VLLM?

Posted by toughcentaur9018@reddit | LocalLLaMA | View on Reddit | 8 comments

If yes, which quantized model are you using abe what’s your vllm serve command?

I’ve been struggling getting that model up and running on my dgx spark gb10. I tried the intel int4 quant for the 31B and it seems to be working well but way too slow.

Anyone have any luck with the 26B?

[-]

pfn0@reddit

  vllm-gemma:
    <<: *vllm-template
    profiles: ["ignore"]
    command: >
      vllm serve
        /models/gemma-4-31B-it-NVFP4
        --served-model-name gemma-4-31B-it
        --max-model-len 262144
        --enable-prefix-caching
        --gpu-memory-utilization 0.6
        --port 8000
        --host 0.0.0.0
        --load-format fastsafetensors
        --kv-cache-dtype fp8_e4m3
        --enable-chunked-prefil
        --max-num-batched-tokens 8192
        --trust-remote-code
        --mm-encoder-tp-mode data
        --distributed-executor-backend ray
        --tensor-parallel-size 2
        -O3

is what I throw at my compose.yaml on my gb10; it runs on top of the spark vllm w/ transformers5 image generated by eugr/spark-vllm-docker

[-]

uacode@reddit

Why not used MoE which will be faster?

[-]

traveddit@reddit

I have the 5090 and ran it on vLLM 19 with:

https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-NVFP4

I just patched my vLLM 19 with their fix and it worked although vLLM still has template issues that are being actively worked on.

[-]

have you tried eugr's community build of vllm for the spark?? it has a recipe for it... https://github.com/eugr/spark-vllm-docker/tree/main i haven't tried it for that specific model, but it works pretty well for other ones..

[-]

toughcentaur9018@reddit (OP)

Hmmm I did look at it but didn’t really try implementing it after reading that it was an on the fly fp8 quantization but I guess it’s time to try that out since it’s one of the only working examples.

[-]

Status_Record_1839@reddit

The 26B-A4B works on VLLM but needs at least 0.8.5+ and the right dtype flag. Try this:

```

vllm serve google/gemma-4-26b-it \

--dtype bfloat16 \

--max-model-len 16384 \

--tensor-parallel-size 1

```

The MoE architecture means only \~4B parameters are active per token, so it fits comfortably in 24GB VRAM. If you're on the DGX Spark (GB10) you should have plenty of headroom.

For quantized, the BitsAndBytes int4 on the 26B works better than the intel int4 quant you tried for 31B — different quantization path. Alternatively check if there's an AWQ or GPTQ version on HuggingFace which tends to integrate more cleanly with VLLM's engine.

[-]

toughcentaur9018@reddit (OP)

Thanks for the AI generated response but I’d really prefer a quantized model (that isn’t AWQ cause my usecase wouldn’t really show up much in the alignment datasets) that could potentially replace the Qwen3.5 35BA3B fp8 quant

[-]

Cferra@reddit

I’ve been trying to use Gemma 4 MoE and turboquant / so far I can’t get it to work.