VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do?
Posted by superloser48@reddit | LocalLLaMA | View on Reddit | 4 comments
Hi - I want to run unsloth dynamic quant on vllm. Why?
- vllm is giving faster prefill speed
- Llama - i get 800-1000 tokens/sec
- Vllm - i get 5k-10K tokens/sec
Tried using Qwen3.6-35B-A3B FP8 official. Machine is RTX A6000 - ampere 48gb
- Unsloth q8 quant (on llama testing) gives correct pandas code, even official FP8 sucks
Why unsloth quant? For some reason - with my task - writing pandas - unsloth quant at 8bit gives much better results than the official fp8 quant. I dont know why.
(As a side note - all qwen q4 awq/gptq i tried give horrible results for pandas coding)
-
unsloth does not make safetensors/(any non gguf anymore).
-
So key question again - how to make unsloth gguf quant run on vllm? (or any gguf quant run on vllm through conversion or something?) Currently vllm gives error - says unsupported architecture
-
I tried single file gguf for both gemma4 and qwen3.6 moe
Thanks a lot
(edit - deleted old post which did not clearly have performance difference)
Qwen_os_has_died@reddit
For a single thread , llamacpp is the fastest. Vllm wins the parallel game.
DinoAmino@reddit
No, you don't. vLLM is just not optimized for running GGUFs. Llama.cpp is fully optimized for running GGUFs because that's all it does. If you want to run vLLM stick to the usual FP8s or AWQs ... or plain unquantized fp16.
pyroserenus@reddit
The only realistic reason fp8 would do measurably worse than q8 is not the quant, but the engine.
I suspect you are having issues with your prompt structuring on vllm, check that the jinja template is present and being correctly imported.
Fit_Split_9933@reddit
Your prefill speed on llama is definitely wrong. I get over 5k+ tokens/sec on my laptop.
try -ub 2048 or more.