5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp
Posted by see_spot_ruminate@reddit | LocalLLaMA | View on Reddit | 8 comments
Hey all,
While nvfp4 still seems to be a work in progress, the latest version of vllm 0.21 finally has mtp working for gemma. With all the talk of qwen being badass I thought I would revisit gemma.
Here is my working set up in a venv with uv:
cuda 13.1 && nvidia driver 590.48.01 (driver 595 and ubuntu 26.04 had difficulty finding all the cards and would only show 3/4 for some reason)
Environment="CUDA_HOME=/usr/local/cuda"
Environment="LD_LIBRARY_PATH=/usr/local/cuda/lib64"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
Environment="VLLM_SKIP_P2P_CHECK=1"
vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 4 \
--max-num-seqs 2 \
--max-model-len auto \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--language-model-only \
--reasoning-parser gemma4 \
--speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}' \
--port 9999
Now, I got this off of the vllm recipes website with some caveats. In the speculative config, the recipe website does not list "method":"mtp" as being needed but the github documentation does say it is needed. It also seems that either will work and there is a closed issue with current comments about mtp and gemma documentation being inconsistent.
I have some environmental variables set. This is because on ubuntu 24.04 there is a mismatch with what cuda version it comes with and what I installed. So you need to declare it. I am also skipping the p2p check for right now, since I didn't go through the trouble of installing it and it has a slight speedup in boot.
Other issues. The kv cache is at fp8, I tried changing it but it crashes at start. This is from the recipe and I guess it might be in the model card or something. Probably something I have been too lazy to look into. Right now it is working well.
Unlike tool calling with qwen, gemma seems to do okay with mtp of 4 tokens (instead of 2, at least for me). You will also need to template in a template folder, see the vllm recipe website. I gave up after like 2 minutes with mistral-vibe and using it. There is an issue on their github (mistral-vibe) talking about issues with tool calling and vllm. I switched over to pi dev and it is so much faster that I probably wont go back.
Overall I am able to reach ~60 t/s on generation with this setup as a single user. Random generation is around 40 t/s and there are bursts up to 90 t/s sometimes, but these are just bursts.
I have my concurrency at 2, but this is because my wife sometimes uses it through openwebui and she never uses a lot of context. Context with the current settings says I can load up around 470k tokens or around 1.85x. For me and my setup this is fine. You may need more vram and probably wont use a 5060ti setup if you have like a company with a lot of users or something anyway.
While nvfp4 support is not all ironed out, it seems to be doing okay right now with the latest vllm. Have fun.
moahmo88@reddit
Can just a single RTX 5060 Ti card run a 31B model?
Pixer---@reddit
What mainboard are you using, what’s your setup ?
see_spot_ruminate@reddit (OP)
7600x3d + asus b650
The cards are not in the best and I recently have been eyeing a plx board... we will see.
Right now it is x8, x1 (due to shitty bifurcation), x4 (nvme to oculink egpu), x4 (nvme to oculink egpu)
Pixer---@reddit
It’s still decent I would say, I have 4 ancient mi50. You could try getting a plx pcie switch for better multi gpu scaling in tensor parallel mode. It’s like 300€ for 4 GPUs. You would need to use the custom p2p NVIDIA driver, but that could double your token generation
farkinga@reddit
I vacillate between Gemma-4-31b and Qwen3.6-27b. Qwen3.5 to Qwen3.6 is a bigger jump than I realized. Gemma-4 is very good at following instructions. But so is Qwen3.6; far beyond 3.5.
I have 2x 5060ti and I am getting 75 t/s generation up to 90 t/s (and as low as 60 t/s). It's so good. Gemma was just slightly too big; so I was quantizing the cache and it was just not worth it. I could do 64k on Gemma-4 but 128k on Gemma3.6.
I am running qwen3.6 27b without quantizing the cache. I have the weights quantized to Q4_k_m, which is harsh enough.
This one is hard ... I really like Gemma-4 but the practical dimensions of Qwen3.6 are my current north star.
chocofoxy@reddit
vllm eats the vram like eating cake i ran Q4 qwen 3.6 35b on LM studio i get 100 - 60 t/s on my 2 5060ti full context while in vllm i can't pass 1/4 and i get worst token generation ( i think its a tp and pp problem) but even in llama-server same thing by default i get 70 60t/s ( mtp didn't help much) it's weird i thought that lm studio also run on llama.cpp
see_spot_ruminate@reddit (OP)
vllm and/or mtp do better with the dense models from what I have read. Those are some good numbers on the MOE though.
see_spot_ruminate@reddit (OP)
Oh, qwen 3.6 is soo good. I just wanted to try something different.
It is not on the vllm recipe website, but there is a Qwen3.6 27b nvfp4 on unsloth's huggingface.
https://huggingface.co/unsloth/Qwen3.6-27B-NVFP4
Last I tried, I didn't get mtp working with it, but I haven't tried in a bit. Maybe that would work out with you too.