What is the current status with Turbo Quant?
Posted by kickerua@reddit | LocalLLaMA | View on Reddit | 58 comments
It has been hyped ±2 weeks ago and I remember seeing some pull requests into llama.cpp, but what is the current status after the hype faded away?
Astrale321@reddit
The Google paper is a year old, it has probably been in use since around then.
Pleasant-Shallot-707@reddit
Right…that’s why llama cop and others have been racing to integrate it
Astrale321@reddit
https://arxiv.org/abs/2504.19874 Read it yourself
Pleasant-Shallot-707@reddit
I’m fucking aware of the paper, fool. They would have integrated it already and not made a big deal out of the implementation and HF wouldn’t be experiencing a flurry of TQ variants if it had been available for people to use already.
Independent-Let-8988@reddit
You clearly weren't, the technology has been out since the paper launch, just without attention
cnmoro@reddit
TheTom's repo works very well for me. Using q8 for k and turbo4 for v. It's blazing fast and uses small amounts of VRAM. I'm running qwen35ba3b with 128k context on a 5060ti 16gb very well.
dbzunicorn@reddit
3b active should be running fast on a 5060 without turboquant can we test some larger models
hwpoison@reddit
also for me using with CPU Only configuration.
Wisam_Abbadi@reddit
How does It perform? And what specs are you running on?
apollo_mg@reddit
+1 for TheTom's repo, using RDNA4.
Overall-Somewhere760@reddit
Why q8 for k? Better speed or better PPL?
Dr4x_@reddit
Which quant of the model do you use for it to fit on 16gb ?
cnmoro@reddit
Q3ks from byteshape
enrique-byteshape@reddit
<3
EffectiveCeilingFan@reddit
Very limited. The majority of pull requests and implementations for TurboQuant in llama.cpp are entirely vibe-coded and absolute dogshit. It’s all just hype, anyway. Google massively promoted an incremental improvement that draws almost entirely from existing techniques.
Pleasant-Shallot-707@reddit
You’re misunderstanding based on the video that that guy released.
They compared compression rates to fp 32, yes. If you think the awesome thing was compression and are saying it’s hype because they should have compared to bf16, then you missed the point. It’s the fact that they had zero quality loss.
So, yeah, you’re getting 4x compression rather than 8x…you’re still not losing quality.
EffectiveCeilingFan@reddit
What video? I’m basing this off my own analysis of the paper and existing implementations, as well as experiments on those implementations I’ve conducted.
Several vibe-slop implementations that I’ve seen linked here aren’t even TQ, literally just a Hadamard on a Q4 KV. Results with an implementation that I do believe to be correct were unimpressive. TQ4 was in the ballpark of Q8_0 in terms of real-world performance, however it wildly varies. Hybrid models performed the worst, with very minor improvement over baseline Q4_0. My assumption is that the sparsity of the attention means that KV values encode information with a higher intrinsic dimensionality than a traditional full-attention transformer. I lack the expertise to perform a more rigorous analysis of the exact failure mode. It could be that quantized KV interacts weirdly with SSM layers. Either way, my conclusion is that TQ is an improvement, but a very mediocre one.
Pleasant-Shallot-707@reddit
Weird how you repeat, word for word a YT video that just came out
EffectiveCeilingFan@reddit
Why are you being so vague with your accusation? I don’t even know what video you’re talking about dude.
rmhubbert@reddit
There's a turboquant implementation on vLLM nightly now. Added today. Haven't had a chance to try it yet, though - https://github.com/vllm-project/vllm/pull/38479
the__storm@reddit
My god the discussion on the PR is hard to read. It's like 40% humans talking to each other interspersed with 60% LLM text walls lol
rmhubbert@reddit
Ha! Aye, I got lost pretty much instantly.
Conscious_Nobody9571@reddit
It was hype and the sheeple fell for it
fuckingredditman@reddit
nope it's not, i can run gemma4 31b Q4_K_M with 92k context using it (TheTom llama.cpp fork) instead of only 4k without it.
EffectiveCeilingFan@reddit
What kinda hogwash is this? If you could only fit 4k before, you’d be able to fit like 16k now with TurboQuant. These numbers are just wrong. Also, you could do that before anyway with q4 KV cache quantization. The TQ matching F16 performance is largely unproven, in particular for hybrid architectures.
fuckingredditman@reddit
tbh the original number might not be 100% accurate because i didn't fit it properly on the upstream version because it clearly wasn't usable for coding but with 3 bit TQ i get 92k, that's really the main point i'm trying to make.
EffectiveCeilingFan@reddit
Has it outperformed traditional Q4 KV quantization? I’m surprised that fitting Gemma 4 context is so tight. Gemma 4 uses half the KV that other leading models do because the K and V are unified. I don’t know if it’s implemented in llama.cpp yet tho.
fuckingredditman@reddit
i never tried it with regular 4 bit KV quant because in my experience it performs quite poorly on any model i've tried it on so far in coding use cases
Neat_Zucchini_9938@reddit
What is your set up I have 24 vram.
fuckingredditman@reddit
i run this fork https://github.com/TheTom/llama-cpp-turboquant on that branch and occasionally rebase it on upstream if there are any interesting fixes
then just
Cferra@reddit
Not really considering it’s working really well for me
-_Apollo-_@reddit
lol we still don’t have mtp for qwen3.5 in llama.cpp. Some things move slow.
StrikeOner@reddit
https://github.com/ggml-org/llama.cpp/discussions/20969
thank me later.
kickerua@reddit (OP)
Can I thank you now?
StrikeOner@reddit
thank yous get only accepted after you read trough each and every of the 500+ messages there and you can answer any qa about that thread there without looking it up!
Twirrim@reddit
This is localllama, aren't we supposed to just throw the text into our locally hosted LLM to summarise?
Dany0@reddit
This is localllama, only posers throw all the text into an LLM. Real Gs know lingebra
IllegibleCheeto@reddit
My random Hadamard matrices are a little rusty. I took a numerical linear algebra class a long time ago. In addition to the regular linear algebra.
Can I still be a Real G?
maglat@reddit
You are so right. I so often miss to use my actual AI power for that kind of tasks ^^
BriguePalhaco@reddit
This is a localllama, we don't have enough context window. :-/
Twirrim@reddit
I can only run small models, so I got mine to quickly vibe-code me a script to download the comments content, and then used aistudio / Gemini 3 to summarise.
The following is a summary of the key technical findings, implementation milestones, and performance results discussed:
### 1. The Fork Ecosystem
Multiple independent implementations emerged simultaneously, with frequent cross-collaboration:
* **TheTom (Metal/General):** Developed the primary `turboquant_plus` fork, focusing on Apple Silicon and general logic.
* **AmesianX (Independent CUDA):** Focused on high-end NVIDIA hardware (Blackwell/DGX), implementing fused kernels and robust `head_dim` detection.
* **spiritbuun & Madreag (Optimized CUDA):** Developed highly optimized CUDA kernels for Ampere and Ada architectures.
AppealSame4367@reddit
Has anyone seen a dflash pr?
MachineZer0@reddit
Saves me about 7gb on Minimax M2.7. Was able to move up from Q3 to Q4 on 128gb VRAM
https://github.com/richginsberg/llama-cpp-turboquant/tree/feature/turboquant-kv-cache
I took https://github.com/TheTom/llama-cpp-turboquant branch this weekend and merged master from https://github.com/ggml-org/llama.cpp into it.
maglat@reddit
Whats you GPU setup for that?
jdiegmueller@reddit
I've been running the TomTom fork -- specifically the feature/turboquant-kv-cache branch -- for a week or so now on CUDA hardware. I've landed on using q8 for K and turbo2 for V with both qwen3.5-27B:Q6 and qwen3.5-9B:Q6.
Overall-Somewhere760@reddit
Why q8 on k ? Better speed or better PPL?
jdiegmueller@reddit
Check out TomTom's TurboQuant repo (not llama.cpp; TurboQuant). I'm mobile or I'd direct link it for you.
A lot of research notes, test results, etc. Truthfully a lot of it is over my head, but all the data is there.
Ultimately my takeaway was that you can be as aggressive as you want with V without any real world penalty. On Q8 and Q6 quants I got the impression using turbo4 or even turbo3 would probably be just fine for K on qwen3.5, but turbo2 V gave me enough headroom to crank my context + parallel sessions to where I wanted so I decided to take the win for now while this stuff gets sorted out.
I do also plan to try out the vLLM PR tonight.
Naiw80@reddit
Google is busy trying to patch gemma4 to work as expected…
b1231227@reddit
Not bad. I managed to fit in context that I previously couldn’t use, and there’s hardly any loss (K Q8_0 V turbo3).
Overall-Somewhere760@reddit
Why q8 for k ?
Porespellar@reddit
Just give Milla Jovovich some more time to get it coded.
superdariom@reddit
I'm running the rocm version which also includes triattention and it is working really well and I'm using qwen 3.5 Q5 XL with over 100,000 context on 24gb card. I think with some more upstream bug fixes even larger context may be possible. It's my daily agentic driver and I have not seen any problems at all. Wish it were merged upstream.
Blizado@reddit
I think even 2 Weeks are too less to see a proper implementation. That will take some more weeks until we got outside of the experimental phase. I would guess there is still a lot of optimization needed to make.
FrogsJumpFromPussy@reddit
TokForge a new android app that's built for both gguf and mnn formats has an experimental implementation of turbo quant that fixes the Gemma E4B.mnn long context window. It seems to work with 32,000 context window but I didn't test the whole of it yet.
jacek2023@reddit
Another day, another discussion about TurboQuant in llama.cpp
AdamDhahabi@reddit
Since two weeks we have TurboQuant-alike KV cache improvements: https://www.reddit.com/r/LocalLLaMA/comments/1s9nri7/attnrot_turboquantlike_kv_cache_trick_lands_in/
But now I realize there is more to come as someone just posted here in the comments: https://github.com/ggml-org/llama.cpp/discussions/20969
norofbfg@reddit
most of what landed in llama.cpp looks experimental so it has not really translated into stable usage yet
qwen_next_gguf_when@reddit
A bunch of people validate their own implementations and nothing is confirmed on the mainstream.