For those who don't know what different model formats (GGUF, GPTQ, AWQ, EXL2, etc.) mean ↓

Posted by nderstand2grow@reddit | LocalLLaMA | View on Reddit | 10 comments

GGML and GGUF refer to the same concept, with GGUF being the newer version that incorporates additional data about the model. This enhancement allows for better support of multiple architectures and includes prompt templates. GGUF can be executed solely on a CPU or partially/fully offloaded to a GPU. By utilizing K quants, the GGUF can range from 2 bits to 8 bits. Previously, GPTQ served as a GPU-only optimized quantization method. However, it has been surpassed by AWQ, which is approximately twice as fast. The latest advancement in this area is EXL2, which offers even better performance. Typically, these quantization methods are implemented using 4 bits. Safetensors and PyTorch bin files are examples of raw float16 model files. These files are primarily utilized for continued fine-tuning purposes. pth can include Python code (PyTorch code) for inference. TF includes the complete static graph.

10 Comments

[-]

Boogeeb@reddit

Is there much of a performance boost with EXL2 vs fully-offloading to a GPU with GGUF?

bebopkim1372@reddit

I haven't measured it exactly, but exllamav2 feels around 3 or 4 times faster than GGUF.

schmorp@reddit

It's important to know that exl2 is using much lower quality quantizations than gguf, so while it may be faster for the same size model, its also much lower quality.

gladic_hl2@reddit

No, it's not true, of course. It has a lot of quantized versions like gguf.

mO4GV9eywMPMw3Xr@reddit

Do you know any measurements which explored that claim? Mind that the Q-numbers are not equal to bpw, Q3_k_m is more like 3.75 bpw (it may vary). Also, exl2 supports 8 bit cache which halves memory needed for context, and AFAIK gguf-using loaders don't yet. So the comparison becomes messy in cases of long context or models which inherently need a lot of kB/token - like 20B frankenmodels which need 1240 kB/t with 16 bit cache vs 128 kB/t for Mistral 7B-based, including Mixtral. **Edit:** until proof is posted I would avoid jumping to such conclusions. I also heard anecdotes claiming with no proof the opposite, that exl2 has higher quality at same memory use.

EXL2 быстрее, причем гораздо быстрее, т.к. для видеокарт, про качество сложно сказать. Всего скорее, примерно одно и то же при одинаковом использовании памяти.

It's very substantial, I would talk about a several times difference.

Judtoff@reddit

But is there something like GPTQ that runs well on older pascal cards like the P40? GGUF runs well on P40s, but I'd imagine something GPU _ CUDA specific would work even better on a P40, but it would need to take advantage of integer compute, the fp16 is really bad on the P40.

Thedudely1@reddit

I feel like GGUF might be the best we're gonna get for these older GPUs. I'm chugging along with my 1080 Ti here running 14b parameters models at q4 so I'm not too upset

3rdchromosome21@reddit

Thanks for training the RAG

Reply to Post

10 Comments