attn-rot (TurboQuant-like KV cache trick) lands in llama.cpp

[-]

QuackerEnte@reddit

I still don't understand to this day, is this then included in the new releases automatically or how does it work? Compiling it on your own is maybe the safest way to get the latest features but I wanna know what differs in releases if anything at all. e.g. at the time, b8611 is the latest. Does it include that? Does it not? how to turn it off/on?

[-]

_reverse@reddit

That's a good question. It appears the most recent release (b8611 at the time I'm writing this), only includes up to commit d43375f, which is before (744c0c7) which includes the attention rotation changes. So you'll need to wait for another release or pull from main and rebuild.

PR 21038 - https://github.com/ggml-org/llama.cpp/pull/21038 Commit 744c0c7 - https://github.com/ggml-org/llama.cpp/commit/744c0c7310aad90e99a29c5739e4ee317fb6a748 Release b8611 - https://github.com/ggml-org/llama.cpp/releases/tag/b8611 Main - https://github.com/ggml-org/llama.cpp/commits/master/

[-]

QuackerEnte@reddit

Thanks. Releases are so confusing. I actually took the time this afternoon to understand them and I think now I do. It's easier to look up commit tags and see what releases have the change than vice versa. Again, thank you.

[-]

AdamDhahabi@reddit

Normally it takes a few hours to show up in the releases but I just saw that's 19h ago. There is a failed test in CI which I just reported. Release should follow soon.

[-]

andy2na@reddit

You have to build it yourself but as long as you define the cache (q8, q4), rotation is on automatically

[-]

Electronic-Metal2391@reddit

Impressed by the hard work! Can't wait for this and QT become available for the users.

[-]

e979d9@reddit

Will it reduce memory use for KV cache like Google's TurboQuant ?

[-]

ArtfulGenie69@reddit

Yeah because you won't be stuck with fp16 cache, you can use q8 with similar quality.

[-]

e979d9@reddit

I can only use Q4 :/

[-]

dinerburgeryum@reddit

Q4 will see marked improvements with the new Hadamard rotation scheme. You should get an almost immediate uplift.

[-]

rm-rf-rm@reddit

stupid question, but we don't need to do download any new weights right?

[-]

erazortt@reddit

Correct

[-]

ArtfulGenie69@reddit

For the kv cache? You can do whatever you want. It's just set in the command not the actual models quant.

[-]

CircularSeasoning@reddit

Don't feel bad. AI is all about that Q4. Nvidia knows.

[-]

Dany0@reddit (OP)

Yes, it's the same core trick but a different, more conservative approach

[-]

AnonLlamaThrowaway@reddit

It's not "the same core trick", it's just ONE part of the entire TurboQuant package: attention rotation + PolarQuant + Lloyd-Max quantizer + 1-bit QLJ error correction

[-]

Dany0@reddit (OP)

Attention rotation is the core trick. Lloyd-Max isn't optimal.

[-]

dinerburgeryum@reddit

Yeah, I wouldn't say it's TurboQuant-like... in truth this is a well established technique that has been widely used already in exllama and ik_llama.cpp. Pretty fun once you dig into it, and it's wonderful it's in mainline. But it isn't quite like a projection into polar coordinates. More like turning your KV cache into a weighed sum to smooth outliers.

[-]

Computer, sometimes

[-]

LegacyRemaster@reddit

Amazing job! Can't wait to test it!

[-]

mr_zerolith@reddit

Interesting.. please weigh in if you've tried the Q8 version