5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp

Posted by see_spot_ruminate@reddit | LocalLLaMA | View on Reddit | 8 comments

Hey all,

While nvfp4 still seems to be a work in progress, the latest version of vllm 0.21 finally has mtp working for gemma. With all the talk of qwen being badass I thought I would revisit gemma.

Here is my working set up in a venv with uv:

cuda 13.1 && nvidia driver 590.48.01 (driver 595 and ubuntu 26.04 had difficulty finding all the cards and would only show 3/4 for some reason)

Environment="CUDA_HOME=/usr/local/cuda"

Environment="LD_LIBRARY_PATH=/usr/local/cuda/lib64"

Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"

Environment="VLLM_SKIP_P2P_CHECK=1"

vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \

--kv-cache-dtype fp8 \

--tensor-parallel-size 4 \

--max-num-seqs 2 \

--max-model-len auto \

--enable-auto-tool-choice \

--tool-call-parser gemma4 \

--chat-template examples/tool_chat_template_gemma4.jinja \

--language-model-only \

--reasoning-parser gemma4 \

--speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}' \

--port 9999

Now, I got this off of the vllm recipes website with some caveats. In the speculative config, the recipe website does not list "method":"mtp" as being needed but the github documentation does say it is needed. It also seems that either will work and there is a closed issue with current comments about mtp and gemma documentation being inconsistent.

I have some environmental variables set. This is because on ubuntu 24.04 there is a mismatch with what cuda version it comes with and what I installed. So you need to declare it. I am also skipping the p2p check for right now, since I didn't go through the trouble of installing it and it has a slight speedup in boot.

Other issues. The kv cache is at fp8, I tried changing it but it crashes at start. This is from the recipe and I guess it might be in the model card or something. Probably something I have been too lazy to look into. Right now it is working well.

Unlike tool calling with qwen, gemma seems to do okay with mtp of 4 tokens (instead of 2, at least for me). You will also need to template in a template folder, see the vllm recipe website. I gave up after like 2 minutes with mistral-vibe and using it. There is an issue on their github (mistral-vibe) talking about issues with tool calling and vllm. I switched over to pi dev and it is so much faster that I probably wont go back.

Overall I am able to reach ~60 t/s on generation with this setup as a single user. Random generation is around 40 t/s and there are bursts up to 90 t/s sometimes, but these are just bursts.

I have my concurrency at 2, but this is because my wife sometimes uses it through openwebui and she never uses a lot of context. Context with the current settings says I can load up around 470k tokens or around 1.85x. For me and my setup this is fine. You may need more vram and probably wont use a 5060ti setup if you have like a company with a lot of users or something anyway.

While nvfp4 support is not all ironed out, it seems to be doing okay right now with the latest vllm. Have fun.

[-]

farkinga@reddit

I vacillate between Gemma-4-31b and Qwen3.6-27b. Qwen3.5 to Qwen3.6 is a bigger jump than I realized. Gemma-4 is very good at following instructions. But so is Qwen3.6; far beyond 3.5.

I have 2x 5060ti and I am getting 75 t/s generation up to 90 t/s (and as low as 60 t/s). It's so good. Gemma was just slightly too big; so I was quantizing the cache and it was just not worth it. I could do 64k on Gemma-4 but 128k on Gemma3.6.

I am running qwen3.6 27b without quantizing the cache. I have the weights quantized to Q4_k_m, which is harsh enough.

This one is hard ... I really like Gemma-4 but the practical dimensions of Qwen3.6 are my current north star.

[-]

chocofoxy@reddit

vllm eats the vram like eating cake i ran Q4 qwen 3.6 35b on LM studio i get 100 - 60 t/s on my 2 5060ti full context while in vllm i can't pass 1/4 and i get worst token generation ( i think its a tp and pp problem) but even in llama-server same thing by default i get 70 60t/s ( mtp didn't help much) it's weird i thought that lm studio also run on llama.cpp

[-]

see_spot_ruminate@reddit (OP)

vllm and/or mtp do better with the dense models from what I have read. Those are some good numbers on the MOE though.

[-]

see_spot_ruminate@reddit (OP)

Oh, qwen 3.6 is soo good. I just wanted to try something different.

It is not on the vllm recipe website, but there is a Qwen3.6 27b nvfp4 on unsloth's huggingface.

https://huggingface.co/unsloth/Qwen3.6-27B-NVFP4

Last I tried, I didn't get mtp working with it, but I haven't tried in a bit. Maybe that would work out with you too.