2024 was the year GGUF took off

Posted by cfahlgren1@reddit | LocalLLaMA | View on Reddit | 80 comments

https://preview.redd.it/knuzft9fptae1.png?width=1642&format=png&auto=webp&s=0e8597805e26e394d2d071e7f003830ba4b33e5c

Reply to Post

80 Comments

[-]

the320x200@reddit

How are you guys mentally pronouncing GGUF? Are we going with "gee gee you eff" or "guh-guff"?

[-]

TheEverchooser@reddit

I just pronounce it the way you pronounce gif except with 'uf' instead of 'if' and the g part twice. Simple really.

[-]

RiotNrrd2001@reddit

I tend to mentally pronounce it as "jee guff".

[-]

ArsNeph@reddit

I always thought this was the default. Now I'm imagining a bunch of localllama members going around and telling their friend about the the guh-guff or goof that just released 😂😂

[-]

Silent_Video9490@reddit

Bold of you to assume LocalLlama members in real life. And no, chatbots don't count as real friends guys.

[-]

Harvard_Med_USMLE267@reddit

There’s probably some weirdo here pronouncing that Jay pee ay gee, or juh-peg. Not sure what it wrong with some of these people, as a medical man I am concerned though.

[-]

SoundHole@reddit

"Guff". I just say guff. Go ahead and correct me during an irl conversation, see what happens.

[-]

RadiantHueOfBeige@reddit

As if people like us have IRL conversations.

[-]

Smeetilus@reddit

… you've got to ask yourself one question: Do I feel lucky?

[-]

SelfPromotionLC@reddit

Guh-guff. Not enough hours in the day to spell it out every time.

[-]

AppearanceHeavy6724@reddit

native speaker of russian. geh-guf. 'h' silent, 'u' like in "put".

[-]

merotatox@reddit

Even localllama got their own "GIF" or "JIF" situation

[-]

Guys I hate to burst everybody's bubble but the only valid way to pronounsce it is in its native Bulgarian, just like the first name of its Bulgarian creator Georgi (JYORG-ee) Gerganov. Soft G then a hard G, ergo "Jyuh-GUFF". Case closed. Period. Full stop. You're all wrong. Glad I could clarify that for everybody. (But uh, yeah I just say the letters G-G-U-F.)

[-]

rtlingo@reddit

I don't.

[-]

Brahvim@reddit

*"Gee-gee-yuu-eff"?*

[-]

rtlingo@reddit

If I were to talk about it out loud to someone I would say that, but when it comes to reading I just glaze over it and not say it in my head.

[-]

aitookmyj0b@reddit

I've never said GGUF out loud, probably never will

[-]

Morphix_879@reddit

I pronounce as "gee-guff"

[-]

cms2307@reddit

Jee juf

[-]

OrangeESP32x99@reddit

I just say the letters. That ain’t even close to a pronounceable word. If someone came up to me and said “You see that new guh-guff?” I’d call them an ambulance cause they’re having a stroke lol

[-]

EPICWAFFLETAMER@reddit

You are mentally pronouncing GGUF like gee-gee-you-eff? Nah guh-guff all the way.

[-]

JawGBoi@reddit

gih-guff for me

[-]

Pro-editor-1105@reddit

gee guff

[-]

ttkciar@reddit

I thought of it as "guff" for a long time, and then a few months ago a friend called it "guh-guff" and now I can't get that out of my head.

[-]

SocialDinamo@reddit

An example of only seeing it written down lol, I spell it out ever time with "gee gee you eff"

[-]

fallingdowndizzyvr@reddit

Goof.

[-]

kyuubi840@reddit

go-goof

[-]

rbgo404@reddit

My go to quantize version is GGUF with 8-bit quantized version. Good TPS and now with easy integration with Hugging Face it’s so easy to use. I usually follow this template: https://docs.inferless.com/how-to-guides/deploy-a-Llama-3.1-8B-Instruct-GGUF-using-inferless

[-]

-Adanedhel-@reddit

You mean as a replacement for GGML? Didn't everybody make the switch when llama.cpp did?

[-]

SoundHole@reddit

I used to use exl2, but the speed of the gguf format has improved so markedly, I use them now. The small hit in speed is worth the ability to load larger models, imo.

[-]

CheatCodesOfLife@reddit

You mean because you can offload layers to CPU? Otherwise, I think exl2 lets you squeeze more into nvidia rigs.

[-]

SoundHole@reddit

Yes, offloading layers onto the CPU allows me to load models that are at least twice the size of the exl2 models I can fit on my Nvidia. The speed cost used to be too steep so I avoided ggufs, but, like I said, they've really improved in speed so I've switched.

[-]

fallingdowndizzyvr@reddit

Llama.cpp is not the only thing that uses GGUF now. Image/Video gen models are also GGUF'd now. There was never GGML for them.

[-]

ttkciar@reddit

> It's become a standard model for AI in general. I am really, really glad it did. It's a well-thought-out container format. I appreciate having the metadata embedded inseparably with the data, and there are some good conventions (which should be better standardized) for what metadata to include in a model, and for handling of segmented models.

[-]

Awkward-Economy7936@reddit

ok embarrassing - I use the llama.cpp and transformers libraries. I convert to gguf (of course) and only use the models downloaded directly from meta. what is the baseline meta llama format (e.g. ggml) and is it fp32 or fp16 at baseline?

[-]

Independent_Try_6891@reddit

2023 was the year GGUF took off

[-]

CheatCodesOfLife@reddit

Agreed. LLMs in generally became more popular in 2024. GGUF was popular when TheBloke was making all the quants.

[-]

cfahlgren1@reddit (OP)

Really cool to see GGUF really grow and thrive in 2024! You can view the see the results above here: [https://huggingface.co/datasets/cfahlgren1/hub-stats/embed/sql-console/YpoTCDR](https://huggingface.co/datasets/cfahlgren1/hub-stats/embed/sql-console/YpoTCDR)

[-]

altomek@reddit

Cool why? GGUF is one file which is easy to manage, it is also multi platform, that are some good features. I started useing GGUF files lately as I noticed they have improved a lot both in inference generation time and "feel". By feel I mean that ExLlamav2 safetensor model had like totally different personality then GGUF quantized one and I did not like how GGUF models worked. Might be personal preference. Now as I mentioned GGUF quants feel better, mostly same as ExLlama ones. However... there are "some" problems with GGUF and its ecosystem. There is safetensors format and "safe" is there for a reason. There were attacks that used implanted pickle format files as well as it was possible to make exploits emdebed into GGUF files. GGUF will luckily pack pickle format file with exploit. Additionally llamacpp is one big blob of spagetti code with everything packed inside to make it require less dependencies to build and run it and... is core component of majority of inference engines... it is just huge exploit vector. So if you like GGUF make sure to learn yourself how to make that deam GGUF files or have trusted quanter... Remember to check: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=GGUF from time to time. Verify you inference engine uses latest llamacpp code... good luck ;P

[-]

daHaus@reddit

>It is also possible to make exploits embedded into GGUF files. How so? You're right about the output quality and unreadability of the code, mathematical correctness as a concept seems to be below par for what's an exercise in computer science. This isn't specific to just that project, however. The entire AI/ML ecosystem lends itself to be dominated by academics with their own priorities who don't know any better and shortcuts by Nvidia and other OEMs only make matters worse.

[-]

altomek@reddit

>> It is also possible to make exploits embedded into GGUF files. > How so? You have to dig CVE-2024-23605, CVE-2024-23496, CVE-2024-21836, CVE-2024-21825, CVE-2024-21802 for details. All that CVEs mention a specially crafted .gguf file that can lead to code execution.

[-]

daHaus@reddit

There's also CVE-2024-41130 but aren't all of these resolved? [https://github.com/ggerganov/llama.cpp/security](https://github.com/ggerganov/llama.cpp/security) >GGUF will happily pack pickle format files with an exploit. I was curious what was meant about this point specifically. If you know it has issues that need fixed that information would be useful. The code that parses pickle data is straight forward. The backends, especially for cuda, is spaghetti code sure. That's courtesy of the ecosystem and coding conventions that are recommend, which happen to be similar to what Intel was doing with their compiler and that earned them an anti-trust suit.

[-]

Eisenstein@reddit

I think the problem they are pointing out is that ggufs can are not safe because they rely on parser implementation for safety, unlike safetensors which are safe by design restrictions in the format itself. If you manage to pack an exploit somewhere in the gguf and trigger a buffer overflow in the parser with malformed data, you can execute code with a gguf. You can't do that with a safetensor.

[-]

searstream@reddit

Just wish it was as supported with vllm and such. Love GGUF but it's slower than other quants on almost every platform I've tested.

[-]

Calcidiol@reddit

I wish everything wasn't such a "tower of babylon" with so many independent incompatible model file formats, and then within any given model file format there's a possible set of a few model encodings / quantizations which might be used. But almost every inference SW accepts only one or a few model file formats to the exclusion of all the others. And almost every inference SW accepts only one or a few options for encodings / quantizations it will work with in some use case to the exclusion of all others. So what file format someone publishes a model in almost 100% dictates available quantizations and the combination of those almost 100% dictate which of 2-3 inference SW tools you can use and also which GPUs / accelerator hardware and software frameworks you can use. And most of the model file formats are not capable of (or I suppose intended to) preserve the full fidelity of the originally published model. e.g. any available auxiliary data / metadata etc. And you cannot in practice easily convert from model file format X to model file format Y and even worse is trying to convert quantization X to quantization Y even if the target one is roughly equivalent as or more coarsely quantized than the origin. So, tower of babylon. It seems as an overall ecosystem that is cross OS platform, cross accelerator type, cross inference SW type we could've standardized some kind of "container format" which at least can preserve metadata and data with fidelity about the original model and then has the flexibility to contain data resources encoded with various possible schemes to represent the quantized model, or metadata to represent a suggested / desired quantization that could be made dynamically (install time, run time, ...) from the superior / upstream model weights to enable their use in some way. GGUF has its positive attributes of being minimalistic for what it is designed to do -- support basically one type of inference SW library, but it's a leaf node which is de facto likely going to stay incompatible with basically all other inference SW ecosystems -- onnx, HF / safetensors, et. al. Larger file sizes of original upstream models deter many from just downloading them, though that is the superior choice in that it's not totally uncommon for GGUF options / conversions to require or desire to be changed over time -- maybe you upgrade your GPU and you'd rather run a Q8 than your previous Q4 -- maybe there was an upstream model metadata / tokenizer fix and now you need a new GGUF that reflects that. So you download the new GGUF 1x, 2x, 3x whatever over the time you want to use it. But if it was easier (or at least more commonly done) to convert from the original model to a quantization locally or even at run time then one would likely be able to keep more choice and currency of proper conversions and options over time for the price of having to download the true full model once.

[-]

CheatCodesOfLife@reddit

I feel like this would be easier if VLLM supported exl2. Then (for quants), you'd have GGUF for cpu/cpu+gpu, and exl2 for GPU.

[-]

Calcidiol@reddit

Yes. Agreed. I'm sure a lot of things coming down from the model makers aren't designed to be optimum for the "gpu poor" (98% of consumers / hobbyists I guess) but it's almost a non-starter for many to not have a good option that allows splitting between cpu+gpu(s) and ideally would nicely support distributed inference on top of that. So yeah having a runtime inference framework that gave at least good gpu and then gpu+cpu options would be nice. I've gravitated to gguf simply because of the cpu+gpu and nascent distributed support though more options would be nice. I'm warming to hf transformers / diffusers because in many ways they have better stability / flexibility / documentation and better day 1 support from model makers but it seems like the available quantizations vs. heterogeneous accelerator / cpu types support options may be limited (bitsandbytes, ...). I'll have to play with the distributed inference capacities to get a better feel for how well that could work. It feels like a big missed opportunity that there's not more use of semantic encoding of the model inference itself and then just generating / optimizing code to infer a model for "whatever" accelerator and runtime ecosystem there may be. pytorch (AFAIK) seems to be most prominently encouraging / supporting / utilizing such capabilities in this age of things like IR / SPIR / WASM / LLVM et. al. to define WHAT needs to be done in a high level language / framework and then have various code generation / optimization backend stages present to support targeting & optimizing the "inference program" for whatever platform you want to use. I'm sure there must be some NIH involved wrt. projects that just want to implement every single thing from scratch in their own way, and other cases where people may not understand or be intimidated by the pytorch ecosystem (it is complex but also "industrial strength" in capability / flexibility). But we still see all these tower of babylon things where a given inference SW doesn't support for inference or quantization multiple possible accelerators / runtime frameworks e.g. vulkan, spir, ptx / cuda, rocm, cpu targeting & tuning from IR, etc. etc. which is just silly since we've got LLVM and to some extent GCC and other frameworks / tools that can pretty much target anything relatively efficiently at runtime given a specification of what to do as input.

[-]

stddealer@reddit

https://xkcd.com/927/

[-]

Weird-Field6128@reddit

G G U F

[-]

madaradess007@reddit

goo-goof for me its epsecially funny for a Russian since we have a local rapper 'Goof' that faked his death and his name became a strong meme, that dying sometimes is called 'to get goofed'. Sorry, if you didn't need that info :D

[-]

MrPrevedmedved@reddit

I'm glad I'm not the only one who thinks like that

[-]

CheatCodesOfLife@reddit

I guess I'm the only one who says "Gee Juff"

[-]

shepbryan@reddit

https://preview.redd.it/bu0p3lub3wae1.jpeg?width=1320&format=pjpg&auto=webp&s=88a8c1ec1e7e5a2568a2fe9b11de0b6897e70d98

[-]

devsanbid@reddit

Which is the best llm model for mcq solving type

[-]

skinnyjoints@reddit

Can anyone explain how GGUF works yet?

[-]

uti24@reddit

Sorry, we haven't figured out it yet, our best research team working on it though.

[-]

JawGBoi@reddit

Look at this google trends graph [https://trends.google.com/trends/explore?date=today%205-y&q=gguf,ggml](https://trends.google.com/trends/explore?date=today%205-y&q=gguf,ggml)

[-]

MedicalScore3474@reddit

That region breakdown...

[-]

FOE-tan@reddit

Japan is probably a combination of their complex language being less support + the popularity of image generation models for generating anime stuff (I get that Stable Diffusion/FLUX can be GGUF'd too, but its probably not seen as the primary way to run a model locally over there in the same way it is for LLMs) For China, they also like their anime art (just look at all the Chinese gacha games around these days), along with other China-specific reasons why someone text models are less available. In this case, South Korea is an outlier, but they have RisuAI, which is a Sillytavern-like frontend (which is both online and local) that has a large Korean userbase which probably makes character cards and RP more appealing to them than AI image generation, perhaps.

[-]

teachersecret@reddit

1.4 billion people living there, and many of them seem very interested in this tech.

[-]

ttkciar@reddit

So .. should an "Intro to Local LLM" tutorial talk about more than the "Big Three" model container formats (pytorch, safetensors, GGUF) or are the others "also-rans" which can be ignored? When I wrote an intro document for my employer a year ago, it explained those three plus GPTQ, AWQ, and GGML, but obviously that's stale information now.

[-]

MoffKalast@reddit

When GGUF?   Now.

[-]

LinkSea8324@reddit

I mean yeah it was specified mid 2023, no shit it needed a whole year https://github.com/ggerganov/ggml/pull/302