A First Comprehensive Study of TurboQuant: Accuracy and Performance
Posted by MajorZesty@reddit | LocalLLaMA | View on Reddit | 53 comments
TL;DR from the article:
- FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios.
- TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.
- TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint.
- TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments.
llama-impersonator@reddit
even the fp8 numbers are obviously worse. i will keep the kvcache unquantized.
logic_prevails@reddit
Not all of us have 48 gb vram ðŸ˜
seamonn@reddit
48GB is a state of mind. Even a 128GB SSD can be 48GB VRAM if you are patient enough.
fredandlunchbox@reddit
Its pretty tough to run an agentic coding harness at 1t/s.
Look_0ver_There@reddit
Better pray that your SSD has a chonky ram cache, especially in this market
m31317015@reddit
Every tool call takes least 5 min to genðŸ˜
markole@reddit
LLM: Let me do a quick ls to confirm... You: Nooooooo....
Clear-Ad-9312@reddit
The DS V4 architecture is fascinating. 9GB for 1M context is crazy to me. I wish the industry would follow DS more closely because they seem to actually care about efficiency
McSendo@reddit
It's actually quite close for the 27B
ParaboloidalCrest@reddit
Right answer. Quantizing model weights is already a huge compromise and it cannot be compounded by quantizing kvcache too.
Anbeeld@reddit
But you can run higher model quant if you are quantizing KV cache and/or raise context to usable level. BF16 is cool but what's the point if you can't do any real tasks with it?
ParaboloidalCrest@reddit
There's always a decent enough smaller model.
Anbeeld@reddit
Name one for agentic coding that can actually help.
ParaboloidalCrest@reddit
Name your constraints. If it's one 8GB GPU then you're out of luck.
Anbeeld@reddit
RTX 3090
ParaboloidalCrest@reddit
How much context can you fit with Qwen 27b @ q4kxl?
Anbeeld@reddit
Some 60k of BF16 with Q5_K_S in an ideal case, so noticeably less if other apps are open, and barely any when adding speculative decoding.
seamonn@reddit
So that guy with the dog picture that hates Turboquant was right.
Velocita84@reddit
seamonn@reddit
Got the dog picture?
dinerburgeryum@reddit
I know exactly the poster you're talking about and they're always right.
a_beautiful_rhind@reddit
It lost to FP8 cache? Holy shit, that's bad.
saqneo@reddit
From what I'm seeing in the article it trades blows or wins vs fp8 on accuracy and memory usage, while obviously being slower. Don't know why the article author took such a negative tone, definitely seems to have its place.
a_beautiful_rhind@reddit
FP8 is a horrible cache quant method so that's not really a flex. In llama.cpp it didn't show massive gains either.
Theoretically it should have been an improvement all around.
saqneo@reddit
Not saying TQ is amazing or anything, I just meant that comments from the summary or article like this don't track for me:
I don't think the data they share supports that fp8 > tq8v4 outright. Simplifying, but it's basically 20% more context for 20% performance hit in the metrics they show, so how can they say
fp8 remains the bestand k8v4 isnot worth the consistent negative impact. Weird framing.Happy to see a lot of commenters here have still drawn the conclusion that TQ is "sometimes useful".
a_beautiful_rhind@reddit
VLLM won't give rotated intX caches either, last I checked. The whole project is sorta for serving at scale on datacenter GPUs.
BobbyL2k@reddit
Am I missing something? Didn’t the TQ paper say that their approach is lossless for key quantization? Why is everyone running TQ on values?
Constandinoskalifo@reddit
+1
ziphnor@reddit
I was wondering exactly the same, and surprised this is not the main topic in comments.
saqneo@reddit
I have to be reading this wrong but I'm not sure if I agree with their conclusions based on the graphs they share - TQ beats bf16+8 and even bf16 in some of the quality benchmarks? What am I missing?
techlatest_net@reddit
Solid breakdown. FP8 staying the default makes sense good to have real-world numbers backing that up. The 4bit nc option might still be handy for edge cases where memory is tight but yeah the accuracy/latency tradeoffs on the 3-bit variants seem too steep for most setups.
Anbeeld@reddit
I'm sorry but without comparison with Q4 the study is useless. The audience for TurboQuant are VRAM constrained folks who can't run BF16 in any case.
Apprehensive_Win662@reddit
It is still super interesting for businesses having 144GB to 500 GB VRAM in Multi-Tenancy workloads, because I would never use Q4 Quants for KV, but to this day I did not know, if TQs are worth it.
Vllm did not write the blog for the GPU poors like we are in private. We are not vllm's main audience.
Badger-Purple@reddit
Wow, snob. I have 800gb of VRAM power and I would not say something like this. I am still running that Skinny Qwen on a 24GB GPU standalone and its grrrreat
Anbeeld@reddit
Yeah that checks out. Having 800 GB of VRAM power makes you not care about KV quants much.
Badger-Purple@reddit
correction, I care. I use a q4, tq4 and context up to I believe 250k all fits in 1 24gb gpu (using 23.7gb)
Anbeeld@reddit
Lol nah I am the 3090 plebs myself and I use TurboQuant.
cheabred@reddit
now if i can just learn enough to figureout the best way to get higher context with dual 3090 24g.... lol
Middle_Bullfrog_6173@reddit
It's great to have more numbers.
However, the slightly unrealistic part of this for most local users was that model weights seem to have been unquantized. I doubt many of us run KV cache quantized with bf16 models. That only makes sense if having many concurrent users is the limiting factor.
If you are using a q8 or smaller model the situation might be different. Either because errors compound to even more of an accuracy drop or because the memory you save in KV cache can be used to run larger model weights.
Etroarl55@reddit
Danm the comments are pretty negative, I’ve been using fp8 on my system and it’s been fine enough for me a little bit.
It’s free 2x context that didn’t exist a few months ago.
bopbop9876@reddit
I think the things people are launching on to is that there was an impression that TurboQuant was "lossless". It is clearly not under the conditions tested in the article. That said, it very well might be a fantastic trade-off for many people. But if you thought you were getting something for free, and it turns out there's actually a noticeable cost, well there you go.
bopbop9876@reddit
Awesome article and awesome post. Thank you!
TheRealMasonMac@reddit
This paper is also worth reading: https://arxiv.org/abs/2604.19528
dinerburgeryum@reddit
Good on 'em for really putting it through the wringer. I had been skeptical, but yeah, 4bit-nc seems pretty all right if you're really memory strapped.
FatheredPuma81@reddit
This is definitively a good thing but hopefully users won't try to take that as meaning the turboquant forks for llama.cpp have the same implementation and quality without someone checking/verifying first.
dinerburgeryum@reddit
Oh, absolutely, I was strictly referring to the numbers VLLM is putting up, which seem pretty good. Honestly, not sure why anyone would trust a drive-by llama.cpp fork to produce strictly "correct" results. I mean, if it's working for you, cool, but yeah, these are the numbers I'm here for.
FatheredPuma81@reddit
I'm curious how FP8 compares to Q8_0 on llama.cpp.
simotune@reddit
Good sign when quantization work measures throughput and accuracy together. Local inference needs more evals like this, not just one-number wins.
Different-Rush-2358@reddit
I've been using the fork of The Thom with the experimental branch of TurboQuant for quite some time now. I've been using TurboQuant 2-3 and the savings are considerable. I've installed Gemma 4 with a 128k CTX cache, loaded a huge PDF that almost filled the window, asked it questions about the beginning, middle, and end of a conversation, and it's answered them all correctly. In my particular case, TurboQuant gives me outstanding results with absurdly low VRAM consumption compared to the usual kv cache formats. Furthermore, the response time has doubled compared to standard formats.
EbbNorth7735@reddit
Under memory constraints yep, makes sense. If you don't have memory constraints it shows you shouldn't use it. Gemma 4 is a lot more memory intensive than Qwen3.5 or 3.6 so may not be needed
suprjami@reddit
Llama 3.3 and Qwen 3 seem pretty old and irrelevant.
Qwen 3.6 and Gemma 4 tests tell a different story: https://localbench.substack.com/p/kv-cache-quantization-benchmark
LetsGoBrandon4256@reddit
It's not the same thing as vLLM's fp8 KV cache quant.
Toooooool@reddit
3bit-nc was practically lobotomized when i tried it with qwen3.6-27b, but k8v4 works really good.