unsloth dynamic quants for vllm?
Posted by superloser48@reddit | LocalLLaMA | View on Reddit | 2 comments
Hi - I want to run unsloth dynamic quant on vllm. Why?
- vllm is giving 5X faster prefill speed
- Llama - i get 800-1000 tokens/sec
- Vllm - i get 5k-10K tokens/sec
Tried using Qwen3.6-35B-A3B FP8 official. Machine is RTX A6000 - ampere 48gb
- Unsloth q8 quant (on llama testing) gives correct pandas code, even official FP8 sucks
Why unsloth quant? For some reason - with my task - writing pandas - unsloth quant at 8bit gives much better results than the official fp8 quant. I dont know why.
(As a side note - all qwen q4 awq/gptq i tried give horrible results for pandas coding)
3. unsloth does not make safetensors/(any non gguf anymore).
4. So key question again - how to make unsloth gguf quant run on vllm? (or any gguf quant run on vllm through conversion or something?) Currently vllm gives error - says unsupported architecture
Thanks a lot
reto-wyss@reddit
You need it to be a single file (you can simply concatenate the shards) and you need to pass
--quantizaion ggufif I recall correctly. It's "supported" in the sense that there's an option to try, but that's it.You can try
--kv-cache-dtype bfloat16that will slow it down, but it may improve quality. I believe it will default to float8 kv-cache if the checkpoint is FP8.superloser48@reddit (OP)
That wasnt an issue - ggufs are single files. it said unsppoported architecture -
I am using dtype - float16 (as recommened by vllm logs)