kv-cache : support attention rotation for heterogeneous iSWA by ggerganov · Pull Request #21513 · ggml-org/llama.cpp
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 16 comments
tl;dr: Fixes KV-cache rotation for hybrid-attention models like Gemma 4
BigYoSpeck@reddit
I've tested it with both the UD Q6_K_XL and bartowski Q8_0 of Gemma 4 31B
For general logic, reasoning, instruction following and creativity it seems broadly a match for none quantised KV. But for coding it's been just slightly off in the details that completely blow it
One of the tests I do is getting the model to make a Micro Machines game
Gemma 4 does a really good job of this. AI cars that drive the track, collisions, sliding physics, track limits, lap counts and race position all handled producing a perfectly playable game
With -ctk and -ctv q8_0 it gets the details just wrong enough that it all falls apart. AI driving in circles, acceleration physics off so the car zooms off screen instantly, track graphics not aligned
I've no doubt a clearer prompt could work around it, but the point of the test is as basic a prompt as the base config can handle not behaving quite as well with this
jacek2023@reddit (OP)
could you share the game? :)
BigYoSpeck@reddit
Top-down racing game (like Micro Machines).
harlequin-coleen-1.tiiny.site
SlaveZelda@reddit
ggerganov still doing things by hand - what a legend
-Ellary-@reddit
People from SillyTavernAI always do their things by hand.
LegacyRemaster@reddit
aahahahahahahaahah
SkyFeistyLlama8@reddit
As someone who needs an AI to make sense of C++ code, I salute him. ggerganov is a legend.
soyalemujica@reddit
How can one make use of this ?
BigYoSpeck@reddit
-ctv q8_0 -ctk q8_0
EffectiveCeilingFan@reddit
🙏 thank you for not just calling this TurboQuant
x0wl@reddit
This is not turboquant though
-dysangel-@reddit
could call it
turboquasn'tsalbego5@reddit
or turboquain't
-dysangel-@reddit
much better!
jacek2023@reddit (OP)
I posted this https://www.reddit.com/r/LocalLLaMA/comments/1s9lge6/llama_rotate_activations_for_better_quantization/
laters someone posted this https://www.reddit.com/r/LocalLLaMA/comments/1s9nri7/attnrot_turboquantlike_kv_cache_trick_lands_in/
as you can see reposting same content with "TurboQuant" is what LocalLLaMA readers expect :)
ttkciar@reddit
I really appreciate that you've been sharing recent llama.cpp developments with the community. Thank you :-)