what's the case against flash attention?

Posted by Responsible-Crew1801@reddit | LocalLLaMA | View on Reddit | 33 comments

I accidently stumbled upon the -fa (flash attention) flag in llama.cpp's llama-server. I cannot speak to the speedup in performence as i haven't properly tested it, but the memory optimization is huge: 8B-F16-gguf model with 100k fit comfortably in 32GB vram gpu with some 2-3 GB to spare.

A very brief search revealed that flash attention theoretically computes the same mathematical function, and in practice benchmarks show no change in the model's output quality.

So my question is, is flash attention really just free lunch? what's the catch? why is it not enabled by default?

[-]

LagOps91@reddit

I tried it a while back and it degraded performance for me (t/s, not output). Not sure if I did anything wrong...

[-]

512bitinstruction@reddit

what hardware were you using?

[-]

LagOps91@reddit

7900xtx full offload with 24gb vram using vulcan

[-]

512bitinstruction@reddit

I don't think Vulkan has great FA support. There were PRs recently in llama.cpp repo. Maybe open an issue there for the devs to look at.

[-]

LagOps91@reddit

yeah that likely is the case

[-]

LagOps91@reddit

160 t/s pp with FA enabled on 32k context GLM-4 Q4m. I get 500-ish without FA enabled. Sure, it saves some memory, but performance just isn't great.

[-]

Wheynelau@reddit

Free lunch only for supported hardware, I don't remember it being supported on CPU, but I could be wrong. Maybe llama.cpp had a different implementation for cpu.

[-]

512bitinstruction@reddit

Flash Attention is a different and optimized way of doing the same thing. It was invented to make models run faster on GPUs.

It's basically a free lunch iff your hardware and drivers support it properly. And that is a big if. I suspect that the reason it is not enabled globally is that it would break in a lot of older hardware or drivers, which would upset people.

[-]

Double_Cause4609@reddit

It's free lunch for well supported models; it's mathematically identical to traditional Attention, just calculated differently. Most of the memory savings come from an idea related to activation checkpointing (from training) which you can read about in the Huggingface docs under various strategies for memory management in training.

Some models nowadays have it built into the raw Pytorch modelling files.

Not all models do well with it, as some have custom Attention implementations that don't play well with a naive implementation of FA, so they get worse speed or numerically different performance with it enabled, but almost all alternative formulations of Attention could be made to use it with an update to the inference backend.

In particular, I think early implementations of Gemma 2 and 3 didn't play well with FA for example.

[-]

Responsible-Crew1801@reddit (OP)

Interesting, you seem to have experimented quite a bit with this. Any tips on which models to avoid with flash attention other than Gemma / what to look for when a new model is released?

[-]

Double_Cause4609@reddit

Gemma's supported now, it's just that it used to cause weird behavior.

MLA models used to be weird, and I want to say at launch there was also weird behavior for Llama 4, but I think most of the weird behaviors have been patched out.

As for new models, I'd expect any model that follows an existing paradigm (GQA, MLA, MQA, SWA etc) should work fine, but as soon as you see a weird algorithm in the white paper I generally expect that somewhere there will be weird behavior for the first month and a half that it's out, so I tend to hold off on my judgement until I get a handle on the specific algorithm and see the active issues on projects related to it.

[-]

dinerburgeryum@reddit

Gemma’s big problem was iSWA, not FA. It also has problems with KV quantization due to the number of attention heads causing CUDA register trashing. But I don’t believe FA was ever the explicit culprit.

[-]

Double_Cause4609@reddit

I don't believe it was, exactly, in and of itself, but anecdotally, I, and a lot of people I knew, for a long period of time, saw really weird behavior in memory usage and speed related to the Attention mechanisms of Gemma 2 and 3. It's possible FA wasn't the culprit outright, but enabling it caused a lot of weird behavior that one wouldn't expect.

You could very well be right.

[-]

lordpuddingcup@reddit

Isn’t sage just better than flash at this point?

[-]

Finanzamt_Endgegner@reddit

is there support for it in llama.cpp?

[-]

fallingdowndizzyvr@reddit

Nope. Which baffles me. Since in the SD world, flash is passe since sage is better.

[-]

Cheap_Ship6400@reddit

Sage is basicly built on Flash.

Here is a short intro of both of them:

Flash-attention 1&2: A mathmatically lossless attention acceleration method, which splits big QKV matrix operations into small ones thus improving memory efficiency (they call this tiling).

Flash-attention 3: Just for NVIDIA's Hopper GPUs, utilizing their new asynchronous features.

Sage-attention 1: Based on Flash-attention, they replace some float matrix operations with int8 ones to speed up, so this is not mathmatically lossless. Therefore, they apply adaptive quantization techniques to obtain "visually lossless" results.

Sage-attention 2: Further quantization to int4 and fp8 to utilize more low-precision (but really fast) calculations. Some smoothing algorithms are applied to improve the loss of precision.

To summerize, Flash Attention is mathmatically lossless using tiling, and Sage Attention is based on Flash Attention, using adpative quantization to speed up and smoothing to maintain visually lossless.

[-]

Finanzamt_Endgegner@reddit

but idk if its as good in transformers

[-]

Finanzamt_Endgegner@reddit

yeah, sage is a game changer

[-]

fallingdowndizzyvr@reddit

Yes. I've often wondered why it's not supported as supposed to flash.

[-]

nuclearbananana@reddit

On my system (igpu + cpu) it tanks performance

[-]

HumerousGorgon8@reddit

When enabling flash attention on the SYCL llama-server variant, it tanks my performance. Its great to have quantised KV cache though.

[-]

ArtfulGenie69@reddit

Uhhh, sage?

[-]

dc740@reddit

For me it starts great, with good t/s. But then it gets slower until it produces so little tokens that it's essentially like hanging the response. I ended up disabling it. I posted a discussion in the llama.cpp github but got no replies

[-]

chibop1@reddit

I've seen some people claiming it decreases the quality of output, so they don't use it. However, I think it's pretty negligible especially considering the benefit.

[-]

Responsible-Crew1801@reddit (OP)

A commentor pointed out that bugs were found in FA implementations. I'd recommend giving it a go after pulling the latest llama.cpp since in my (fairly limited) testing, i did not encounter such bugs

[-]

FullOf_Bad_Ideas@reddit

Are you sure that original flash attention 2 and FA in llama.cpp are bug free?

I don't think so. It works for me but I've heard it causes quality output degradation for others. I don't think perplexity with and without it is the same, I saw some discussions about it. If perplexity is the same - it's not the same mathematically. Computers are complex, errors creep up, flash attention is yet another thing that can break some of the time, so you should be able to not use it.

[-]

Responsible-Crew1801@reddit (OP)

I've used models that turned out to be broken, Unsloth's UD quantization of SeedCoder F16 was the latest. Flash attention, on the models i tried it on (Qwen 3 14b 32b + the deepSeek distilled 8B) does not create the issues i faced with broken models.

[-]

FullOf_Bad_Ideas@reddit

Yes, and you are one person, while software should work for as many people as possible, generally. It should even work on someone's Raspberry Pi Zero ideally, and every phone (there are a few apps running llama.cpp-based engine on phones). FA is not necessarily compatible with every model and each type of hardware, or some other llama.cpp features - there's usually a feature matrix and some features break other features.

Bug: https://github.com/ggml-org/llama.cpp/issues/13430

fix: https://github.com/ggml-org/llama.cpp/pull/13438

this was less than a month ago, and FA is in llama.cpp since around a year or so, meaning - it's not rock solid and it wasn't like that for the last year, so unless things will suddently change and software become bug-free overnight, some people will have issues with using it on their hardware.

[-]

Responsible-Crew1801@reddit (OP)

I see, so you're saying FA's downside is that it still needs some software maturity before it can be used as default

[-]

Chromix_@reddit

It speeds up prompt processing speed, usually more than doubles it for longer prompts.

It allows you to use -ctk and -ctv for KV cache quantization to save more VRAM and thus allow larger context sizes.

Enable it. It usually works. For some models it's disabled, and it might not work for some cards that are not from Nvidia or somewhat recent.

[-]

FullstackSensei@reddit

I think a most of the memory savings you're seeing cone from the recent implementation of sliding window attention in llama.cpp. It reduces context memory consumption by some 75%.

As far as flash attention is concerned, it's mathematically identical to regular attention. Any differences you find are bugs in the implementation in llama.cpp. Otherwise, it's free lunch.

[-]

Calcidiol@reddit

AFAICT from long past cursory reading it seems like at least originally FA upstream and also in downstream dependent projects was only implemented / defined for nvidia GPUs, and then perhaps (?) only for certain "relatively recent" architectures of those. Unsurprisingly the primary use case / development target was for enterprise category high end server DGPUs with somewhat different architectural optimization domains than what applies to consumer DGPUs with "tiny" amounts of VRAM.

So I think that relative (historical?) unported status was problematic sometimes. Whether it's fairly fully optimum for contemporary consumer level DGPUs is also an interesting question since IDK if that's been an optimization target between when it was published / created and now upstream.

I gather there are some downstream forked / ported implementations of it or something like it now, though, for different inference engines / platforms.