llama: use f16 mask for FA to save VRAM by am17an · Pull Request #23764 · ggml-org/llama.cpp

[-]

def_not_jose@reddit

Not much of a help for VRAM poors, because we already use -b and -ub 128 which saves like hundreds of megabytes

[-]

AnonLlamaThrowaway@reddit

What's the downside of reducing -b and -ub ?

[-]

MerePotato@reddit

The real downside is for models like Gemma your vision encoder is locked to a token resolution at or below that batch size

[-]

AnonLlamaThrowaway@reddit

That is a very important thing to know, thank you very much for the heads-up

[-]

No-Educator-249@reddit

It does reduce prompt processing speed, but it's not as bad as it sounds. Using -b and -ub 256 allows me to run a Gemma-4-31B Q3_XXS quant at 8 t/s with a 32000 context window on a 16GB VRAM GPU.

[-]

suprjami@reddit

This guy is on fire lately. llama.cpp contributor of the year.

[-]

am17an@reddit

Thank you!

[-]

Mountain_Patience231@reddit

Dear llamacpp god, I just want to say thank you for making my hardware worth every penny.

[-]

pmttyji@reddit

u/am17an help us to run 100B models just with 8GB VRAM(check both thread & comments) by end of this year. Counting on you!

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

BitGreen1270@reddit

Man I just have to run git pull on llama.cpp occasionally to make it faster and more efficient 😄

[-]

sagiroth@reddit

That's how most things work yeah

[-]

TheWaffleKingg@reddit

If only windows worked that way

[-]

amuhak@reddit

You can just get the binary evey now and then

[-]

No_Afternoon_4260@reddit

so the enshittification is a complet myth?

[-]

suicidaleggroll@reddit

Enshittification is a problem with proprietary/paid software, not open source.

[-]

CheatCodesOfLife@reddit

Agreed. You get setbacks occasionally but they're usually bugs / mistakes that get fixed and you can always roll back.

Meanwhile the iPhone blows efuses to prevent rolling back to the last good version of the software.

[-]

Glass_Cat_4281@reddit

Yeah, you usually see it when open source software pivots to paid.

[-]

No_Afternoon_4260@reddit

True

[-]

xpnrt@reddit

What if fa is slower / unusable for me , anything?

[-]

BillDStrong@reddit

No, this is FA only. If you use a fallback, like me on a P40, it won't change anything.

[-]

FerLuisxd@reddit

Does ik_llamacpp have something similar?

[-]

Sisaroth@reddit

nice, i can fit a few more experts in VRAM

[-]

Ok_Needleworker_6431@reddit

Sounds mad! Will need to try this/test it out to make sense of this!

[-]

Hot_Turnip_3309@reddit

Hi there is no explaination? what is this?

[-]

Remove_Ayys@reddit

One of the other maintainers here, particularly as it relates to the CUDA backend. Honestly I feel very lucky to have Aman be part of the project.

[-]

LegacyRemaster@reddit

sounds awesome

[-]

acetaminophenpt@reddit

Thanks!!

[-]

anthonyg45157@reddit

Damn this might just let me pull the image model back to GPU. I offloaded to gpu to maximize context with MTP

[-]

FormalAd7367@reddit

thanks! am17an

[-]

ParaboloidalCrest@reddit

So will -fit automatically realize that I can fit more context now?

[-]

Remove_Ayys@reddit

Yes, because the memory reduction comes from a smaller compute graph allocation which is one of the buffer sizes used for feedback.

[-]

sagiroth@reddit

Think it should pick up yes

[-]

soyalemujica@reddit

According to the merge we can save 1.2GB of vram by default now ?

[-]

goldcakes@reddit

The merged PR is a simple optimisation that removes entirely redundant RAM usage in certain scenarios. (As the title suggests, it only works when FlashAttention is used).

I'm confident the merged PR isn't responsible for your regression, it might be something else in llama.cpp?

[-]

soyalemujica@reddit

I have removed my edit from the main comment, looks like it's a problem with latest Fedora 44 Linux kernel 7.0.10 update, I will try some stuff out

[-]

No_Algae1753@reddit

Correct me if im wrong but I think this only applies to mtp models no?

This provides 1.2GB of VRAM saving at -ub 2048 and \~300Mb at -ub 512 when using MTP

[-]

am17an@reddit

No it's just 2x with MTP, should help everywhere

[-]

goldcakes@reddit

This only works when you have FlashAttention working. It's a cleanup that removes redundant RAM usage with FA.

[-]

grumd@reddit

That's amazing, I've been using ub 2048 and 4096 for my moe models, this opens up a lot more context, thanks!

[-]

soyalemujica@reddit

llama.cpp now is using actually more vram than before, I can no longer fit 120k context for 3.6 27B - 24GB VRAM @ 24gb vram @ q5_1 / q4_1, and its performance has drastically decreased

[-]

Beamsters@reddit

and he just landed ANOTHER 1.2gb save follow-up https://github.com/ggml-org/llama.cpp/pull/23861

[-]

sagiroth@reddit

Madlad

[-]

SarcasticBaka@reddit

Was pretty excited about the prospect of saving some VRAM but after testing pre and post recompiling llama.cpp, I'm not seeing even a single MB of difference. Literally the exact same as before.

[-]

nickm_27@reddit

It depends what model you use and it's vocab size I believe, at least that's what the referenced issue says.

For me on Gemma4 26B I see 300mb of saving

[-]