Wait is attn rotate already enabled by default since this release tell it support SWA attention?

Posted by Altruistic_Heat_9531@reddit | LocalLLaMA | View on Reddit | 22 comments

For the past 2 weeks, my daily routine has included checking the main llama.cpp releases to see if attn rotate has been merged. Am I missing something? I mean, it should be there already since the core rotation PR has been merged. Is it enabled by default?

[-]

_wOvAN_@reddit

why it doesn't work for bf16, f16 cache types?

[-]

Altruistic_Heat_9531@reddit (OP)

Because bf16/fp16 is the native computation dtype, rotating quantization help to reduce error relative to fp/bf16,

[-]

_wOvAN_@reddit

so it should be one of cache-types then, quite misleading.

[-]

x0wl@reddit

No, because it's applied to Q8 and Q4

[-]

Dazzling_Equipment_9@reddit

Does anyone know of any existing issues with using gemma4 in llama.cpp? Until yesterday, I was still seeing people complaining about problems with gemma4 support in llama.cpp.

[-]

nickm_27@reddit

Been working great for me for multiple days now

[-]

DOAMOD@reddit

still broken

[-]

Dry-Influence9@reddit

There were tons of issues, many of which are now resolved. That's to be expected on software development this fast.

[-]

Dazzling_Equipment_9@reddit

The llama.cpp developers probably never imagined that supporting every new model release would turn out to be such a massive headache. At the same time, I have to say their release speed is absolutely insane—like a rocket.

[-]

x0wl@reddit

It's basically for Gemma 4

[-]

Altruistic_Heat_9531@reddit (OP)

I understand that, but the thing that make me confused is, "All this time attn rot already applied?"

[-]

OfficialXstasy@reddit

It was applied about a week ago https://github.com/ggml-org/llama.cpp/pull/21038

[-]

grandong123@reddit

So do we need to change the llama-server run command for Gemma 4? Or do we not need to change anything?

[-]

erazortt@reddit

as long as you want attn-rot enabled, then not changes are needed.

[-]

grandong123@reddit

okay thank you!

[-]

ambient_temp_xeno@reddit

Subconsciously, OP can't really believe they merged it without giving it a cli setting.

[-]

Altruistic_Heat_9531@reddit (OP)

Let me reprahsed it, I understand that this is specifically from model that use SWA block like Gemma, but SWA is subset of attention implementation, therefore , there is a previous release that i missed about normal full attention already applied to mainline llamacpp. is it enabled by default or i add another flag in cli args?

[-]

grumd@reddit

Enabled by default and yes you missed a release that introduced kv cache rotation

[-]

Altruistic_Heat_9531@reddit (OP)

Ahh i see ... thanks, is it opt out? i mean i am going to use attn rot anyway, just asking since there is no cli flag

[-]

Special-Mistake8923@reddit

It is enabled by default.

[-]

grumd@reddit

There's an environment variable you can use to disable rotations: LLAMA_ATTN_ROT_DISABLE

https://github.com/ggml-org/llama.cpp/pull/21038

[-]

Clear-Ad-9312@reddit

more nuanced, this is to support rotation in swa models. it was not working with gemma 4 models, but now it does