attn-rot (TurboQuant-like KV cache trick) lands in llama.cpp
Posted by Dany0@reddit | LocalLLaMA | View on Reddit | 28 comments
80% of the benefit of TQ with almost no downsides. Q8 is now ≈ F16
QuackerEnte@reddit
I still don't understand to this day, is this then included in the new releases automatically or how does it work? Compiling it on your own is maybe the safest way to get the latest features but I wanna know what differs in releases if anything at all. e.g. at the time, b8611 is the latest. Does it include that? Does it not? how to turn it off/on?
_reverse@reddit
That's a good question. It appears the most recent release (b8611 at the time I'm writing this), only includes up to commit d43375f, which is before (744c0c7) which includes the attention rotation changes. So you'll need to wait for another release or pull from main and rebuild.
PR 21038 - https://github.com/ggml-org/llama.cpp/pull/21038 Commit 744c0c7 - https://github.com/ggml-org/llama.cpp/commit/744c0c7310aad90e99a29c5739e4ee317fb6a748 Release b8611 - https://github.com/ggml-org/llama.cpp/releases/tag/b8611 Main - https://github.com/ggml-org/llama.cpp/commits/master/
QuackerEnte@reddit
Thanks. Releases are so confusing. I actually took the time this afternoon to understand them and I think now I do. It's easier to look up commit tags and see what releases have the change than vice versa. Again, thank you.
AdamDhahabi@reddit
Normally it takes a few hours to show up in the releases but I just saw that's 19h ago. There is a failed test in CI which I just reported. Release should follow soon.
andy2na@reddit
You have to build it yourself but as long as you define the cache (q8, q4), rotation is on automatically
Electronic-Metal2391@reddit
Impressed by the hard work! Can't wait for this and QT become available for the users.
e979d9@reddit
Will it reduce memory use for KV cache like Google's TurboQuant ?
ArtfulGenie69@reddit
Yeah because you won't be stuck with fp16 cache, you can use q8 with similar quality.
e979d9@reddit
I can only use Q4 :/
dinerburgeryum@reddit
Q4 will see marked improvements with the new Hadamard rotation scheme. You should get an almost immediate uplift.
rm-rf-rm@reddit
stupid question, but we don't need to do download any new weights right?
erazortt@reddit
Correct
ArtfulGenie69@reddit
For the kv cache? You can do whatever you want. It's just set in the command not the actual models quant.
CircularSeasoning@reddit
Don't feel bad. AI is all about that Q4. Nvidia knows.
Dany0@reddit (OP)
Yes, it's the same core trick but a different, more conservative approach
AnonLlamaThrowaway@reddit
It's not "the same core trick", it's just ONE part of the entire TurboQuant package: attention rotation + PolarQuant + Lloyd-Max quantizer + 1-bit QLJ error correction
Dany0@reddit (OP)
Attention rotation is the core trick. Lloyd-Max isn't optimal.
dinerburgeryum@reddit
Yeah, I wouldn't say it's TurboQuant-like... in truth this is a well established technique that has been widely used already in exllama and ik_llama.cpp. Pretty fun once you dig into it, and it's wonderful it's in mainline. But it isn't quite like a projection into polar coordinates. More like turning your KV cache into a weighed sum to smooth outliers.
Designer-Article-956@reddit
Google pr team is restless
dinerburgeryum@reddit
Yeah, honestly, TurboQuant seemed cool, but I was really waiting for better comparisons to existing techniques (Hadamard rotations included). It made quite a splash in the news tho!
-dysangel-@reddit
weird considering the paper is from last spring though. I wonder if it was a purposeful attempt to manipulate stock/RAM prices
waiting_for_zban@reddit
It's mainly because people outside the field have no tengible grasp of the inner developments for such methods, everyone is rolling by vibes. Even people in the field, because there are so many rabbit holes.
So once a while a big tech company comes along, copies ideas from a method publish nearly 2 years ago, and start shilling it non stop, so that normies will start parroting it.
CircularSeasoning@reddit
The news is mostly AI. Splash!
soshulmedia@reddit
The name "attn-rot" seems off - sounds like "attention rot". But as far as I understand, it is exactly what this should prevent?
alberto_467@reddit
Yeah it sounds like a weird phenomenon you'd want to monitor and avoid
CircularSeasoning@reddit
You're absolutely right.
What was I saying? Who are you and why do you bother my endless attention?
LegacyRemaster@reddit
Amazing job! Can't wait to test it!
mr_zerolith@reddit
Interesting.. please weigh in if you've tried the Q8 version