llama: use f16 mask for FA to save VRAM by am17an · Pull Request #23764 · ggml-org/llama.cpp
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 51 comments
now you can download more VRAM ;)
(by downloading new llama.cpp version)
def_not_jose@reddit
Not much of a help for VRAM poors, because we already use -b and -ub 128 which saves like hundreds of megabytes
AnonLlamaThrowaway@reddit
What's the downside of reducing -b and -ub ?
MerePotato@reddit
The real downside is for models like Gemma your vision encoder is locked to a token resolution at or below that batch size
AnonLlamaThrowaway@reddit
That is a very important thing to know, thank you very much for the heads-up
No-Educator-249@reddit
It does reduce prompt processing speed, but it's not as bad as it sounds. Using -b and -ub 256 allows me to run a Gemma-4-31B Q3_XXS quant at 8 t/s with a 32000 context window on a 16GB VRAM GPU.
suprjami@reddit
This guy is on fire lately. llama.cpp contributor of the year.
am17an@reddit
Thank you!
Mountain_Patience231@reddit
Dear llamacpp god, I just want to say thank you for making my hardware worth every penny.
pmttyji@reddit
u/am17an help us to run 100B models just with 8GB VRAM(check both thread & comments) by end of this year. Counting on you!
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
BitGreen1270@reddit
Man I just have to run git pull on llama.cpp occasionally to make it faster and more efficient 😄
sagiroth@reddit
That's how most things work yeah
TheWaffleKingg@reddit
If only windows worked that way
amuhak@reddit
You can just get the binary evey now and then
No_Afternoon_4260@reddit
so the enshittification is a complet myth?
suicidaleggroll@reddit
Enshittification is a problem with proprietary/paid software, not open source.
CheatCodesOfLife@reddit
Agreed. You get setbacks occasionally but they're usually bugs / mistakes that get fixed and you can always roll back.
Meanwhile the iPhone blows efuses to prevent rolling back to the last good version of the software.
Glass_Cat_4281@reddit
Yeah, you usually see it when open source software pivots to paid.
No_Afternoon_4260@reddit
True
xpnrt@reddit
What if fa is slower / unusable for me , anything?
BillDStrong@reddit
No, this is FA only. If you use a fallback, like me on a P40, it won't change anything.
FerLuisxd@reddit
Does ik_llamacpp have something similar?
Sisaroth@reddit
nice, i can fit a few more experts in VRAM
Ok_Needleworker_6431@reddit
Sounds mad! Will need to try this/test it out to make sense of this!
Hot_Turnip_3309@reddit
Hi there is no explaination? what is this?
Remove_Ayys@reddit
One of the other maintainers here, particularly as it relates to the CUDA backend. Honestly I feel very lucky to have Aman be part of the project.
LegacyRemaster@reddit
sounds awesome
acetaminophenpt@reddit
Thanks!!
anthonyg45157@reddit
Damn this might just let me pull the image model back to GPU. I offloaded to gpu to maximize context with MTP
FormalAd7367@reddit
thanks! am17an
ParaboloidalCrest@reddit
So will
-fitautomatically realize that I can fit more context now?Remove_Ayys@reddit
Yes, because the memory reduction comes from a smaller compute graph allocation which is one of the buffer sizes used for feedback.
sagiroth@reddit
Think it should pick up yes
soyalemujica@reddit
According to the merge we can save 1.2GB of vram by default now ?
goldcakes@reddit
The merged PR is a simple optimisation that removes entirely redundant RAM usage in certain scenarios. (As the title suggests, it only works when FlashAttention is used).
I'm confident the merged PR isn't responsible for your regression, it might be something else in llama.cpp?
soyalemujica@reddit
I have removed my edit from the main comment, looks like it's a problem with latest Fedora 44 Linux kernel 7.0.10 update, I will try some stuff out
No_Algae1753@reddit
Correct me if im wrong but I think this only applies to mtp models no?
am17an@reddit
No it's just 2x with MTP, should help everywhere
goldcakes@reddit
This only works when you have FlashAttention working. It's a cleanup that removes redundant RAM usage with FA.
grumd@reddit
That's amazing, I've been using ub 2048 and 4096 for my moe models, this opens up a lot more context, thanks!
soyalemujica@reddit
llama.cpp now is using actually more vram than before, I can no longer fit 120k context for 3.6 27B - 24GB VRAM @ 24gb vram @ q5_1 / q4_1, and its performance has drastically decreased
Beamsters@reddit
and he just landed ANOTHER 1.2gb save follow-up https://github.com/ggml-org/llama.cpp/pull/23861
sagiroth@reddit
Madlad
SarcasticBaka@reddit
Was pretty excited about the prospect of saving some VRAM but after testing pre and post recompiling llama.cpp, I'm not seeing even a single MB of difference. Literally the exact same as before.
nickm_27@reddit
It depends what model you use and it's vocab size I believe, at least that's what the referenced issue says.
For me on Gemma4 26B I see 300mb of saving
SarcasticBaka@reddit
Yeah perhaps it's a model issue, I'm using Qwen3.6-27B and the vram usage figures are exactly the same so it seems to have no effect.
cibernox@reddit
That guy is giving us 25k more context!
Shoddy_Bed3240@reddit
Nice fix — I’m seeing about a 5% boost in decode speed on large MoE models because of it.
redblood252@reddit
I just tried iq3 qwen3.6 27b mtp yesterday and it didn’t fit in the vram while non mtp did. This might make it work !
Pentium95@reddit
Woah! am17an Is on fire!
Kahvana@reddit
Very nice!