vLLM Just Merged TurboQuant Fix for Qwen 3.5+
Posted by havenoammo@reddit | LocalLLaMA | View on Reddit | 27 comments
Previously it was throwing a 'Not Implemented' error due to Mamba layers. Going to test it now!
swfsql@reddit
Why do they call it Mamba? Aren't the Qwen linear layers Gated Delta Nets?
trusty20@reddit
Why does it feel like TQ discussions get a bizarre amount of accounts trying to convince people not to try it?
fragment_me@reddit
Am I crazy or are there 0 benchmarks against perplexity and KLD done? Should that not be standard when testing this?
Alex_L1nk@reddit
I saw some test in TheTom's repo and it's on par with Q4_0 while being slightly slower. I honestly don't understand why people hyping this like it was a silver bullet.
CYTR_@reddit
Unless my brain is playing tricks on me, I seem to recall seeing a post here showing that perplexity/KLD were bad, just like regular quants. It might have been dependent of the implementation in the publication... But still. Why do I feel that TurboQuant is overhyped? Especially since with Qwen 3.5/3.6, it doesn't seem essential.
wektor420@reddit
Because it largly is, and look into academic issues brought up in reviews
robertpro01@reddit
Someone mind explaining to this noob?
havenoammo@reddit (OP)
Basically TurboQuant is a compression algorithm for the KV cache, so you get more context for less VRAM. vLLM didn't support it for Qwen3.5 and 3.6 models before but this patch fixes that. Pretty neat if you're running these models locally!
Antique_Dot_5513@reddit
Grosse différence avec ou sans turboquant ?
relmny@reddit
you missed the "although there might be some losses, as the "lossless" claim of it, still needs to be proved"
quickreactor@reddit
Thanks for the explanation!
McSendo@reddit
Which model did you use and what gpu?
havenoammo@reddit (OP)
1x5090 32GB:
https://huggingface.co/cyankiwi/Qwen3.6-27B-AWQ-INT4
https://huggingface.co/Lorbus/Qwen3.6-27B-int4-AutoRound
2x3090 2x24GB:
https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound
https://huggingface.co/TheHouseOfTheDude/Qwen3.6-27B-INT8 << better KLD but missing MTP layers
Also you can see KLD of different models to choose from here: https://www.reddit.com/r/LocalLLaMA/comments/1ssyukx/qwen3627b_klds_ints_and_nvfps/
I tested with Lorbus one, will test with others.
McSendo@reddit
Ok, ampere is not working for me with the 27B FP8 model. I'll try the other ones you tested already.
havenoammo@reddit (OP)
Yep, the official FP8 model doesn't work with Ampere, so I was looking for a good INT8 model that is W8A16. AWQ and GPTQ INT4 models work though.
McSendo@reddit
You mean with turboquant? I've been runnin FP8 using the marlin kernel.
havenoammo@reddit (OP)
Ah, I couldn't run FP8 because I have a 5090 and a 3090, and mixed GPU setups cause issues with it. Tried different backends but not sure if I tried Marlin. Will try again. Also if you use Docker images, this might not be built yet. I installed directly from the git repo.
Apart_Boat9666@reddit
Turboquant for qwen 3.5 will work in vllm, they fixed the issue
Maleficent-Ad5999@reddit
Does this mean we can have higher contexts in 24gb vram?
see_spot_ruminate@reddit
Technically correct, the best kind of correct.
MasterLJ@reddit
Thank you. Is this bound for nightlies? I did peak at the PR I didn't see the tag or the plan (I probably missed it). Thank you again.
havenoammo@reddit (OP)
If you're asking about Docker nightly builds, according to https://hub.docker.com/r/vllm/vllm-openai/tags the last build was 23 hours ago so in a few hours there should be a new build including this. 🤞 I cloned their repo and installed locally rather than using Docker.
retireb435@reddit
So the performance degrade is real, the Google paper was wrong?
No_Conversation9561@reddit
LFG!!!
queerintech@reddit
Does it he'll gemma 4 31b?
ortegaalfredo@reddit
Weird because I tried turboquant with qwen 3.6 27B in vllm 0.20 a week ago and it worked. I saw somewhere in the documentation the perplexity increase and its quite high except for turboquant_k4v4 but then I don't know the difference between it and the old regular fp8 kv quantization.
onyxlabyrinth1979@reddit
Nice, that Not Implemented issue was a blocker. Curious how stable it is under load though. Fixing support is one thing, but long running inference tends to surface edge cases fast. Also wondering if quantization here impacts output consistency in subtle ways or if it is mostly negligible in practice.