TheaterFire

2024 was the year GGUF took off

Posted by cfahlgren1@reddit | LocalLLaMA | View on Reddit | 80 comments

https://preview.redd.it/knuzft9fptae1.png?width=1642&format=png&auto=webp&s=0e8597805e26e394d2d071e7f003830ba4b33e5c

Reply to Post

80 Comments

the320x200@reddit

How are you guys mentally pronouncing GGUF? Are we going with "gee gee you eff" or "guh-guff"?
View on Reddit #44681772

BuriqKalipun@reddit

goo goo ga ga
View on Reddit #67206577

Hertigan@reddit

Gee-guff
View on Reddit #45191137

TheEverchooser@reddit

I just pronounce it the way you pronounce gif except with 'uf' instead of 'if' and the g part twice. Simple really.
View on Reddit #44685048

YourEvilTwine@reddit

There is no g in gif 😅
View on Reddit #44868474

Wide_Egg_5814@reddit

Gaguff
View on Reddit #44805626

RiotNrrd2001@reddit

I tend to mentally pronounce it as "jee guff".
View on Reddit #44682107

ArsNeph@reddit

I always thought this was the default. Now I'm imagining a bunch of localllama members going around and telling their friend about the the guh-guff or goof that just released 😂😂
View on Reddit #44701871

Silent_Video9490@reddit

Bold of you to assume LocalLlama members in real life. And no, chatbots don't count as real friends guys.
View on Reddit #44781868

mo_fig_devOps@reddit

Me too
View on Reddit #44774207

Brahvim@reddit

Me too. And I find it cute \^-\^.
View on Reddit #44713381

Harvard_Med_USMLE267@reddit

You and me both, this is the way!
View on Reddit #44700874

lxe@reddit

Georgi Gerganov’s Ultimate File
View on Reddit #44781079

milo-75@reddit

“Jif”
View on Reddit #44779278

felipemarinho@reddit

I say gee-guff
View on Reddit #44773871

Harvard_Med_USMLE267@reddit

Gee-Guf
View on Reddit #44700837

CttCJim@reddit

Like jpeg. We have standards.
View on Reddit #44762040

Harvard_Med_USMLE267@reddit

There’s probably some weirdo here pronouncing that Jay pee ay gee, or juh-peg. Not sure what it wrong with some of these people, as a medical man I am concerned though.
View on Reddit #44766627

SoundHole@reddit

"Guff". I just say guff. Go ahead and correct me during an irl conversation, see what happens.
View on Reddit #44707198

RadiantHueOfBeige@reddit

As if people like us have IRL conversations.
View on Reddit #44762148

Smeetilus@reddit

… you've got to ask yourself one question: Do I feel lucky?
View on Reddit #44754978

CttCJim@reddit

Gee guff
View on Reddit #44761992

SelfPromotionLC@reddit

Guh-guff. Not enough hours in the day to spell it out every time.
View on Reddit #44751949

AppearanceHeavy6724@reddit

native speaker of russian. geh-guf. 'h' silent, 'u' like in "put".
View on Reddit #44744432

NEXUSX@reddit

Good Guy UF
View on Reddit #44682065

LiteSoul@reddit

Good guy oof
View on Reddit #44741338

Big-Ad1693@reddit

Geegee you eff, yeah
View on Reddit #44737560

merotatox@reddit

Even localllama got their own "GIF" or "JIF" situation
View on Reddit #44736077

Threatening-Silence-@reddit

Guff
View on Reddit #44732477

tmflynnt@reddit

Guys I hate to burst everybody's bubble but the only valid way to pronounsce it is in its native Bulgarian, just like the first name of its Bulgarian creator Georgi (JYORG-ee) Gerganov. Soft G then a hard G, ergo "Jyuh-GUFF". Case closed. Period. Full stop. You're all wrong. Glad I could clarify that for everybody. (But uh, yeah I just say the letters G-G-U-F.)
View on Reddit #44728143

rtlingo@reddit

I don't.
View on Reddit #44696246

Brahvim@reddit

*"Gee-gee-yuu-eff"?*
View on Reddit #44713446

rtlingo@reddit

If I were to talk about it out loud to someone I would say that, but when it comes to reading I just glaze over it and not say it in my head.
View on Reddit #44719622

aitookmyj0b@reddit

I've never said GGUF out loud, probably never will
View on Reddit #44718453

Morphix_879@reddit

I pronounce as "gee-guff"
View on Reddit #44713588

cms2307@reddit

Jee juf
View on Reddit #44706337

OrangeESP32x99@reddit

I just say the letters. That ain’t even close to a pronounceable word. If someone came up to me and said “You see that new guh-guff?” I’d call them an ambulance cause they’re having a stroke lol
View on Reddit #44686572

EPICWAFFLETAMER@reddit

You are mentally pronouncing GGUF like gee-gee-you-eff? Nah guh-guff all the way.
View on Reddit #44696702

JawGBoi@reddit

gih-guff for me
View on Reddit #44691466

Pro-editor-1105@reddit

gee guff
View on Reddit #44691263

ttkciar@reddit

I thought of it as "guff" for a long time, and then a few months ago a friend called it "guh-guff" and now I can't get that out of my head.
View on Reddit #44689108

SocialDinamo@reddit

An example of only seeing it written down lol, I spell it out ever time with "gee gee you eff"
View on Reddit #44685659

fallingdowndizzyvr@reddit

Goof.
View on Reddit #44682889

kyuubi840@reddit

go-goof
View on Reddit #44682470

rbgo404@reddit

My go to quantize version is GGUF with 8-bit quantized version. Good TPS and now with easy integration with Hugging Face it’s so easy to use. I usually follow this template: https://docs.inferless.com/how-to-guides/deploy-a-Llama-3.1-8B-Instruct-GGUF-using-inferless
View on Reddit #51812021

-Adanedhel-@reddit

You mean as a replacement for GGML? Didn't everybody make the switch when llama.cpp did?
View on Reddit #44681316

SoundHole@reddit

I used to use exl2, but the speed of the gguf format has improved so markedly, I use them now. The small hit in speed is worth the ability to load larger models, imo.
View on Reddit #44707423

CheatCodesOfLife@reddit

You mean because you can offload layers to CPU? Otherwise, I think exl2 lets you squeeze more into nvidia rigs.
View on Reddit #44735976

SoundHole@reddit

Yes, offloading layers onto the CPU allows me to load models that are at least twice the size of the exl2 models I can fit on my Nvidia. The speed cost used to be too steep so I avoided ggufs, but, like I said, they've really improved in speed so I've switched.
View on Reddit #44846254

fallingdowndizzyvr@reddit

Llama.cpp is not the only thing that uses GGUF now. Image/Video gen models are also GGUF'd now. There was never GGML for them.
View on Reddit #44682799

ttkciar@reddit

> It's become a standard model for AI in general. I am really, really glad it did. It's a well-thought-out container format. I appreciate having the metadata embedded inseparably with the data, and there are some good conventions (which should be better standardized) for what metadata to include in a model, and for handling of segmented models.
View on Reddit #44689404

Awkward-Economy7936@reddit

ok embarrassing - I use the llama.cpp and transformers libraries. I convert to gguf (of course) and only use the models downloaded directly from meta. what is the baseline meta llama format (e.g. ggml) and is it fp32 or fp16 at baseline?
View on Reddit #44785255

Independent_Try_6891@reddit

2023 was the year GGUF took off
View on Reddit #44722839

CheatCodesOfLife@reddit

Agreed. LLMs in generally became more popular in 2024. GGUF was popular when TheBloke was making all the quants.
View on Reddit #44781682

cfahlgren1@reddit (OP)

Really cool to see GGUF really grow and thrive in 2024! You can view the see the results above here: [https://huggingface.co/datasets/cfahlgren1/hub-stats/embed/sql-console/YpoTCDR](https://huggingface.co/datasets/cfahlgren1/hub-stats/embed/sql-console/YpoTCDR)
View on Reddit #44677151

altomek@reddit

Cool why? GGUF is one file which is easy to manage, it is also multi platform, that are some good features. I started useing GGUF files lately as I noticed they have improved a lot both in inference generation time and "feel". By feel I mean that ExLlamav2 safetensor model had like totally different personality then GGUF quantized one and I did not like how GGUF models worked. Might be personal preference. Now as I mentioned GGUF quants feel better, mostly same as ExLlama ones. However... there are "some" problems with GGUF and its ecosystem. There is safetensors format and "safe" is there for a reason. There were attacks that used implanted pickle format files as well as it was possible to make exploits emdebed into GGUF files. GGUF will luckily pack pickle format file with exploit. Additionally llamacpp is one big blob of spagetti code with everything packed inside to make it require less dependencies to build and run it and... is core component of majority of inference engines... it is just huge exploit vector. So if you like GGUF make sure to learn yourself how to make that deam GGUF files or have trusted quanter... Remember to check: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=GGUF from time to time. Verify you inference engine uses latest llamacpp code... good luck ;P
View on Reddit #44695679

daHaus@reddit

>It is also possible to make exploits embedded into GGUF files. How so? You're right about the output quality and unreadability of the code, mathematical correctness as a concept seems to be below par for what's an exercise in computer science. This isn't specific to just that project, however. The entire AI/ML ecosystem lends itself to be dominated by academics with their own priorities who don't know any better and shortcuts by Nvidia and other OEMs only make matters worse.
View on Reddit #44724263

altomek@reddit

>> It is also possible to make exploits embedded into GGUF files. > How so? You have to dig CVE-2024-23605, CVE-2024-23496, CVE-2024-21836, CVE-2024-21825, CVE-2024-21802 for details. All that CVEs mention a specially crafted .gguf file that can lead to code execution.
View on Reddit #44729307

daHaus@reddit

There's also CVE-2024-41130 but aren't all of these resolved? [https://github.com/ggerganov/llama.cpp/security](https://github.com/ggerganov/llama.cpp/security) >GGUF will happily pack pickle format files with an exploit. I was curious what was meant about this point specifically. If you know it has issues that need fixed that information would be useful. The code that parses pickle data is straight forward. The backends, especially for cuda, is spaghetti code sure. That's courtesy of the ecosystem and coding conventions that are recommend, which happen to be similar to what Intel was doing with their compiler and that earned them an anti-trust suit.
View on Reddit #44733527

Eisenstein@reddit

I think the problem they are pointing out is that ggufs can are not safe because they rely on parser implementation for safety, unlike safetensors which are safe by design restrictions in the format itself. If you manage to pack an exploit somewhere in the gguf and trigger a buffer overflow in the parser with malformed data, you can execute code with a gguf. You can't do that with a safetensor.
View on Reddit #44754558

searstream@reddit

Just wish it was as supported with vllm and such. Love GGUF but it's slower than other quants on almost every platform I've tested.
View on Reddit #44709848

Calcidiol@reddit

I wish everything wasn't such a "tower of babylon" with so many independent incompatible model file formats, and then within any given model file format there's a possible set of a few model encodings / quantizations which might be used. But almost every inference SW accepts only one or a few model file formats to the exclusion of all the others. And almost every inference SW accepts only one or a few options for encodings / quantizations it will work with in some use case to the exclusion of all others. So what file format someone publishes a model in almost 100% dictates available quantizations and the combination of those almost 100% dictate which of 2-3 inference SW tools you can use and also which GPUs / accelerator hardware and software frameworks you can use. And most of the model file formats are not capable of (or I suppose intended to) preserve the full fidelity of the originally published model. e.g. any available auxiliary data / metadata etc. And you cannot in practice easily convert from model file format X to model file format Y and even worse is trying to convert quantization X to quantization Y even if the target one is roughly equivalent as or more coarsely quantized than the origin. So, tower of babylon. It seems as an overall ecosystem that is cross OS platform, cross accelerator type, cross inference SW type we could've standardized some kind of "container format" which at least can preserve metadata and data with fidelity about the original model and then has the flexibility to contain data resources encoded with various possible schemes to represent the quantized model, or metadata to represent a suggested / desired quantization that could be made dynamically (install time, run time, ...) from the superior / upstream model weights to enable their use in some way. GGUF has its positive attributes of being minimalistic for what it is designed to do -- support basically one type of inference SW library, but it's a leaf node which is de facto likely going to stay incompatible with basically all other inference SW ecosystems -- onnx, HF / safetensors, et. al. Larger file sizes of original upstream models deter many from just downloading them, though that is the superior choice in that it's not totally uncommon for GGUF options / conversions to require or desire to be changed over time -- maybe you upgrade your GPU and you'd rather run a Q8 than your previous Q4 -- maybe there was an upstream model metadata / tokenizer fix and now you need a new GGUF that reflects that. So you download the new GGUF 1x, 2x, 3x whatever over the time you want to use it. But if it was easier (or at least more commonly done) to convert from the original model to a quantization locally or even at run time then one would likely be able to keep more choice and currency of proper conversions and options over time for the price of having to download the true full model once.
View on Reddit #44723231

CheatCodesOfLife@reddit

I feel like this would be easier if VLLM supported exl2. Then (for quants), you'd have GGUF for cpu/cpu+gpu, and exl2 for GPU.
View on Reddit #44735860

Calcidiol@reddit

Yes. Agreed. I'm sure a lot of things coming down from the model makers aren't designed to be optimum for the "gpu poor" (98% of consumers / hobbyists I guess) but it's almost a non-starter for many to not have a good option that allows splitting between cpu+gpu(s) and ideally would nicely support distributed inference on top of that. So yeah having a runtime inference framework that gave at least good gpu and then gpu+cpu options would be nice. I've gravitated to gguf simply because of the cpu+gpu and nascent distributed support though more options would be nice. I'm warming to hf transformers / diffusers because in many ways they have better stability / flexibility / documentation and better day 1 support from model makers but it seems like the available quantizations vs. heterogeneous accelerator / cpu types support options may be limited (bitsandbytes, ...). I'll have to play with the distributed inference capacities to get a better feel for how well that could work. It feels like a big missed opportunity that there's not more use of semantic encoding of the model inference itself and then just generating / optimizing code to infer a model for "whatever" accelerator and runtime ecosystem there may be. pytorch (AFAIK) seems to be most prominently encouraging / supporting / utilizing such capabilities in this age of things like IR / SPIR / WASM / LLVM et. al. to define WHAT needs to be done in a high level language / framework and then have various code generation / optimization backend stages present to support targeting & optimizing the "inference program" for whatever platform you want to use. I'm sure there must be some NIH involved wrt. projects that just want to implement every single thing from scratch in their own way, and other cases where people may not understand or be intimidated by the pytorch ecosystem (it is complex but also "industrial strength" in capability / flexibility). But we still see all these tower of babylon things where a given inference SW doesn't support for inference or quantization multiple possible accelerators / runtime frameworks e.g. vulkan, spir, ptx / cuda, rocm, cpu targeting & tuning from IR, etc. etc. which is just silly since we've got LLVM and to some extent GCC and other frameworks / tools that can pretty much target anything relatively efficiently at runtime given a specification of what to do as input.
View on Reddit #44738474

stddealer@reddit

https://xkcd.com/927/
View on Reddit #44734622

Weird-Field6128@reddit

G G U F
View on Reddit #44737677

madaradess007@reddit

goo-goof for me its epsecially funny for a Russian since we have a local rapper 'Goof' that faked his death and his name became a strong meme, that dying sometimes is called 'to get goofed'. Sorry, if you didn't need that info :D
View on Reddit #44719671

MrPrevedmedved@reddit

I'm glad I'm not the only one who thinks like that
View on Reddit #44737213

CheatCodesOfLife@reddit

I guess I'm the only one who says "Gee Juff"
View on Reddit #44736031

shepbryan@reddit

https://preview.redd.it/bu0p3lub3wae1.jpeg?width=1320&format=pjpg&auto=webp&s=88a8c1ec1e7e5a2568a2fe9b11de0b6897e70d98
View on Reddit #44711081

devsanbid@reddit

Which is the best llm model for mcq solving type
View on Reddit #44708903

skinnyjoints@reddit

Can anyone explain how GGUF works yet?
View on Reddit #44691081

uti24@reddit

Sorry, we haven't figured out it yet, our best research team working on it though.
View on Reddit #44706272

JawGBoi@reddit

Look at this google trends graph [https://trends.google.com/trends/explore?date=today%205-y&q=gguf,ggml](https://trends.google.com/trends/explore?date=today%205-y&q=gguf,ggml)
View on Reddit #44691365

MedicalScore3474@reddit

That region breakdown...
View on Reddit #44693245

FOE-tan@reddit

Japan is probably a combination of their complex language being less support + the popularity of image generation models for generating anime stuff (I get that Stable Diffusion/FLUX can be GGUF'd too, but its probably not seen as the primary way to run a model locally over there in the same way it is for LLMs) For China, they also like their anime art (just look at all the Chinese gacha games around these days), along with other China-specific reasons why someone text models are less available. In this case, South Korea is an outlier, but they have RisuAI, which is a Sillytavern-like frontend (which is both online and local) that has a large Korean userbase which probably makes character cards and RP more appealing to them than AI image generation, perhaps.
View on Reddit #44702122

teachersecret@reddit

1.4 billion people living there, and many of them seem very interested in this tech.
View on Reddit #44701538

ttkciar@reddit

So .. should an "Intro to Local LLM" tutorial talk about more than the "Big Three" model container formats (pytorch, safetensors, GGUF) or are the others "also-rans" which can be ignored? When I wrote an intro document for my employer a year ago, it explained those three plus GPTQ, AWQ, and GGML, but obviously that's stale information now.
View on Reddit #44690154

MoffKalast@reddit

When GGUF?   Now.
View on Reddit #44688385

LinkSea8324@reddit

I mean yeah it was specified mid 2023, no shit it needed a whole year https://github.com/ggerganov/ggml/pull/302
View on Reddit #44680766