A First Comprehensive Study of TurboQuant: Accuracy and Performance

Posted by MajorZesty@reddit | LocalLLaMA | View on Reddit | 53 comments

TL;DR from the article:

FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios.
TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.
TurboQuant 4bit-nc is likely the most practical TurboQuant variant: it helps under KV-cache memory pressure, but trades the extra capacity for moderate accuracy, latency, and throughput costs. It may still be viable for edge deployments where memory is the dominant constraint.
TurboQuant k3v4-nc and 3bit-nc show meaningful accuracy drops, especially on reasoning and very long-context tasks, while also substantially degrading latency and throughput. This makes them poor candidates for production deployments.

* FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization: it provides 2x KV-cache capacity with negligible accuracy loss, while matching BF16 on most performance metrics and substantially improving them in memory-constrained serving scenarios. * TurboQuant k8v4 does not provide any significant advantage over FP8: it only provides modest KV-cache savings (2.4x vs 2x) which are not worth the consistent negative impact on throughput and latency metrics.

[-]

llama-impersonator@reddit

even the fp8 numbers are obviously worse. i will keep the kvcache unquantized.

logic_prevails@reddit

Not all of us have 48 gb vram 😭

seamonn@reddit

48GB is a state of mind. Even a 128GB SSD can be 48GB VRAM if you are patient enough.

fredandlunchbox@reddit

Its pretty tough to run an agentic coding harness at 1t/s.

Look_0ver_There@reddit

Better pray that your SSD has a chonky ram cache, especially in this market

m31317015@reddit

Every tool call takes least 5 min to gen😭

markole@reddit

LLM: Let me do a quick ls to confirm... You: Nooooooo....

Clear-Ad-9312@reddit

The DS V4 architecture is fascinating. 9GB for 1M context is crazy to me. I wish the industry would follow DS more closely because they seem to actually care about efficiency

McSendo@reddit

It's actually quite close for the 27B

ParaboloidalCrest@reddit

Right answer. Quantizing model weights is already a huge compromise and it cannot be compounded by quantizing kvcache too.

Anbeeld@reddit

But you can run higher model quant if you are quantizing KV cache and/or raise context to usable level. BF16 is cool but what's the point if you can't do any real tasks with it?

There's always a decent enough smaller model.

Name one for agentic coding that can actually help.

Name your constraints. If it's one 8GB GPU then you're out of luck.

RTX 3090

How much context can you fit with Qwen 27b @ q4kxl?

Some 60k of BF16 with Q5_K_S in an ideal case, so noticeably less if other apps are open, and barely any when adding speculative decoding.

So that guy with the dog picture that hates Turboquant was right.

Velocita84@reddit

Got the dog picture?

dinerburgeryum@reddit

I know exactly the poster you're talking about and they're always right.

a_beautiful_rhind@reddit

It lost to FP8 cache? Holy shit, that's bad.

saqneo@reddit

From what I'm seeing in the article it trades blows or wins vs fp8 on accuracy and memory usage, while obviously being slower. Don't know why the article author took such a negative tone, definitely seems to have its place.

FP8 is a horrible cache quant method so that's not really a flex. In llama.cpp it didn't show massive gains either.

Theoretically it should have been an improvement all around.

Not saying TQ is amazing or anything, I just meant that comments from the summary or article like this don't track for me:

I don't think the data they share supports that fp8 > tq8v4 outright. Simplifying, but it's basically 20% more context for 20% performance hit in the metrics they show, so how can they say fp8 remains the best and k8v4 is not worth the consistent negative impact. Weird framing.

Happy to see a lot of commenters here have still drawn the conclusion that TQ is "sometimes useful".

VLLM won't give rotated intX caches either, last I checked. The whole project is sorta for serving at scale on datacenter GPUs.

BobbyL2k@reddit

Am I missing something? Didn’t the TQ paper say that their approach is lossless for key quantization? Why is everyone running TQ on values?

Constandinoskalifo@reddit

ziphnor@reddit

I was wondering exactly the same, and surprised this is not the main topic in comments.

I have to be reading this wrong but I'm not sure if I agree with their conclusions based on the graphs they share - TQ beats bf16+8 and even bf16 in some of the quality benchmarks? What am I missing?

techlatest_net@reddit

Solid breakdown. FP8 staying the default makes sense good to have real-world numbers backing that up. The 4bit nc option might still be handy for edge cases where memory is tight but yeah the accuracy/latency tradeoffs on the 3-bit variants seem too steep for most setups.

I'm sorry but without comparison with Q4 the study is useless. The audience for TurboQuant are VRAM constrained folks who can't run BF16 in any case.

Apprehensive_Win662@reddit

It is still super interesting for businesses having 144GB to 500 GB VRAM in Multi-Tenancy workloads, because I would never use Q4 Quants for KV, but to this day I did not know, if TQs are worth it.

Vllm did not write the blog for the GPU poors like we are in private. We are not vllm's main audience.

Badger-Purple@reddit

Wow, snob. I have 800gb of VRAM power and I would not say something like this. I am still running that Skinny Qwen on a 24GB GPU standalone and its grrrreat

Yeah that checks out. Having 800 GB of VRAM power makes you not care about KV quants much.

correction, I care. I use a q4, tq4 and context up to I believe 250k all fits in 1 24gb gpu (using 23.7gb)

Lol nah I am the 3090 plebs myself and I use TurboQuant.

cheabred@reddit

now if i can just learn enough to figureout the best way to get higher context with dual 3090 24g.... lol

Middle_Bullfrog_6173@reddit

It's great to have more numbers.

However, the slightly unrealistic part of this for most local users was that model weights seem to have been unquantized. I doubt many of us run KV cache quantized with bf16 models. That only makes sense if having many concurrent users is the limiting factor.

If you are using a q8 or smaller model the situation might be different. Either because errors compound to even more of an accuracy drop or because the memory you save in KV cache can be used to run larger model weights.

Etroarl55@reddit

Danm the comments are pretty negative, I’ve been using fp8 on my system and it’s been fine enough for me a little bit.

It’s free 2x context that didn’t exist a few months ago.

bopbop9876@reddit

I think the things people are launching on to is that there was an impression that TurboQuant was "lossless". It is clearly not under the conditions tested in the article. That said, it very well might be a fantastic trade-off for many people. But if you thought you were getting something for free, and it turns out there's actually a noticeable cost, well there you go.

Awesome article and awesome post. Thank you!

TheRealMasonMac@reddit

This paper is also worth reading: https://arxiv.org/abs/2604.19528

This technical note revisits the relationship between RaBitQ and TurboQuant under a unified comparison framework. We compare the two methods in terms of methodology, theoretical guarantees, and empirical performance, using a reproducible, transparent, and symmetric setup. Our results show that, despite the claimed advantage of TurboQuant, TurboQuant performs worse than RaBitQ in most tested settings of inner-product estimation, nearest-neighbor search and KV cache quantization. We further find that several reported runtime and recall results in the TurboQuant paper could not be reproduced from the released implementation under the stated configuration. Overall, this note clarifies the shared structure and genuine differences between the two lines of work, while documenting reproducibility issues in the experimental results reported by the TurboQuant paper.

Good on 'em for really putting it through the wringer. I had been skeptical, but yeah, 4bit-nc seems pretty all right if you're really memory strapped.

FatheredPuma81@reddit

This is definitively a good thing but hopefully users won't try to take that as meaning the turboquant forks for llama.cpp have the same implementation and quality without someone checking/verifying first.

Oh, absolutely, I was strictly referring to the numbers VLLM is putting up, which seem pretty good. Honestly, not sure why anyone would trust a drive-by llama.cpp fork to produce strictly "correct" results. I mean, if it's working for you, cool, but yeah, these are the numbers I'm here for.

I'm curious how FP8 compares to Q8_0 on llama.cpp.

simotune@reddit

Good sign when quantization work measures throughput and accuracy together. Local inference needs more evals like this, not just one-number wins.

Different-Rush-2358@reddit

I've been using the fork of The Thom with the experimental branch of TurboQuant for quite some time now. I've been using TurboQuant 2-3 and the savings are considerable. I've installed Gemma 4 with a 128k CTX cache, loaded a huge PDF that almost filled the window, asked it questions about the beginning, middle, and end of a conversation, and it's answered them all correctly. In my particular case, TurboQuant gives me outstanding results with absurdly low VRAM consumption compared to the usual kv cache formats. Furthermore, the response time has doubled compared to standard formats.

Q8

It's not the same thing as vLLM's fp8 KV cache quant.

Toooooool@reddit

3bit-nc was practically lobotomized when i tried it with qwen3.6-27b, but k8v4 works really good.