vLLM Just Merged TurboQuant Fix for Qwen 3.5+

Posted by havenoammo@reddit | LocalLLaMA | View on Reddit | 27 comments

Previously it was throwing a 'Not Implemented' error due to Mamba layers. Going to test it now!

https://github.com/vllm-project/vllm/pull/39931

[-]

swfsql@reddit

Why do they call it Mamba? Aren't the Qwen linear layers Gated Delta Nets?

[-]

trusty20@reddit

Why does it feel like TQ discussions get a bizarre amount of accounts trying to convince people not to try it?

[-]

fragment_me@reddit

Am I crazy or are there 0 benchmarks against perplexity and KLD done? Should that not be standard when testing this?

[-]

Alex_L1nk@reddit

I saw some test in TheTom's repo and it's on par with Q4_0 while being slightly slower. I honestly don't understand why people hyping this like it was a silver bullet.

[-]

Unless my brain is playing tricks on me, I seem to recall seeing a post here showing that perplexity/KLD were bad, just like regular quants. It might have been dependent of the implementation in the publication... But still. Why do I feel that TurboQuant is overhyped? Especially since with Qwen 3.5/3.6, it doesn't seem essential.

[-]

wektor420@reddit

Because it largly is, and look into academic issues brought up in reviews

[-]

robertpro01@reddit

Someone mind explaining to this noob?

[-]

havenoammo@reddit (OP)

Basically TurboQuant is a compression algorithm for the KV cache, so you get more context for less VRAM. vLLM didn't support it for Qwen3.5 and 3.6 models before but this patch fixes that. Pretty neat if you're running these models locally!

[-]

Antique_Dot_5513@reddit

Grosse différence avec ou sans turboquant ?

[-]

relmny@reddit

you missed the "although there might be some losses, as the "lossless" claim of it, still needs to be proved"

[-]

quickreactor@reddit

Thanks for the explanation!

[-]

McSendo@reddit

Which model did you use and what gpu?

[-]

havenoammo@reddit (OP)

1x5090 32GB:
https://huggingface.co/cyankiwi/Qwen3.6-27B-AWQ-INT4
https://huggingface.co/Lorbus/Qwen3.6-27B-int4-AutoRound

2x3090 2x24GB:
https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound
https://huggingface.co/TheHouseOfTheDude/Qwen3.6-27B-INT8 << better KLD but missing MTP layers

Also you can see KLD of different models to choose from here: https://www.reddit.com/r/LocalLLaMA/comments/1ssyukx/qwen3627b_klds_ints_and_nvfps/

I tested with Lorbus one, will test with others.

[-]

McSendo@reddit

Ok, ampere is not working for me with the 27B FP8 model. I'll try the other ones you tested already.

[-]

havenoammo@reddit (OP)

Yep, the official FP8 model doesn't work with Ampere, so I was looking for a good INT8 model that is W8A16. AWQ and GPTQ INT4 models work though.

[-]

McSendo@reddit

You mean with turboquant? I've been runnin FP8 using the marlin kernel.

[-]

havenoammo@reddit (OP)

Ah, I couldn't run FP8 because I have a 5090 and a 3090, and mixed GPU setups cause issues with it. Tried different backends but not sure if I tried Marlin. Will try again. Also if you use Docker images, this might not be built yet. I installed directly from the git repo.

[-]

Apart_Boat9666@reddit

Turboquant for qwen 3.5 will work in vllm, they fixed the issue

[-]

Maleficent-Ad5999@reddit

Does this mean we can have higher contexts in 24gb vram?

[-]

see_spot_ruminate@reddit

Technically correct, the best kind of correct.

[-]

MasterLJ@reddit

Thank you. Is this bound for nightlies? I did peak at the PR I didn't see the tag or the plan (I probably missed it). Thank you again.

[-]

havenoammo@reddit (OP)

If you're asking about Docker nightly builds, according to https://hub.docker.com/r/vllm/vllm-openai/tags the last build was 23 hours ago so in a few hours there should be a new build including this. 🤞 I cloned their repo and installed locally rather than using Docker.

[-]

retireb435@reddit

So the performance degrade is real, the Google paper was wrong?

[-]

No_Conversation9561@reddit

LFG!!!

[-]

queerintech@reddit

Does it he'll gemma 4 31b?

[-]

ortegaalfredo@reddit

Weird because I tried turboquant with qwen 3.6 27B in vllm 0.20 a week ago and it worked. I saw somewhere in the documentation the perplexity increase and its quite high except for turboquant_k4v4 but then I don't know the difference between it and the old regular fp8 kv quantization.

[-]

onyxlabyrinth1979@reddit

Nice, that Not Implemented issue was a blocker. Curious how stable it is under load though. Fixing support is one thing, but long running inference tends to surface edge cases fast. Also wondering if quantization here impacts output consistency in subtle ways or if it is mostly negligible in practice.