Qwen2.5 7B chat GGUF quantization Evaluation results

Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 39 comments

**This is the Qwen2.5 7B Chat model, NOT coder** |Model|Size|Computer science (MMLU PRO)| |:-|:-|:-| |qwen2.5:7b-instruct-q8\_0|8.1 GB|56.59| |iMat-Qwen2.5-7B-Instruct-Q6\_K|6.3 GB|58.54| |qwen2.5:7b-instruct-q6\_K|6.3 GB|57.80| |iMat-Qwen2.5-7B-Instruct-Q5\_K\_L|5.8 GB|56.59| |iMat-Qwen2.5-7B-Instruct-Q5\_K\_M|5.4 GB|55.37| |qwen2.5:7b-instruct-q5\_K\_M|5.4 GB|57.80| |iMat-Qwen2.5-7B-Instruct-Q5\_K\_S|5.3 GB|57.32| |qwen2.5:7b-instruct-q5\_K\_S|5.3 GB|58.78| |iMat-Qwen2.5-7B-Instruct-Q4\_K\_L|5.1 GB|56.10| |iMat-Qwen2.5-7B-Instruct-Q4\_K\_M|4.7 GB|58.54| |qwen2.5:7b-instruct-q4\_K\_M|4.7 GB|54.63| |iMat-Qwen2.5-7B-Instruct-Q3\_K\_XL|4.6 GB|56.59| |iMat-Qwen2.5-7B-Instruct-Q4\_K\_S|4.5 GB|53.41| |qwen2.5:7b-instruct-q4\_K\_S|4.5 GB|55.12| |iMat-Qwen2.5-7B-Instruct-IQ4\_XS|4.2 GB|56.59| |iMat-Qwen2.5-7B-Instruct-Q3\_K\_L|4.1 GB|56.34| |qwen2.5:7b-instruct-q3\_K\_L|4.1 GB|51.46| |iMat-Qwen2.5-7B-Instruct-Q3\_K\_M|3.8 GB|54.39| |qwen2.5:7b-instruct-q3\_K\_M|3.8 GB|53.66| |iMat-Qwen2.5-7B-Instruct-Q3\_K\_S|3.5 GB|51.46| |qwen2.5:7b-instruct-q3\_K\_S|3.5 GB|51.95| |iMat-Qwen2.5-7B-Instruct-IQ3\_XS|3.3 GB|52.20| |iMat-Qwen2.5-7B-Instruct-Q2\_K|3.0 GB|49.51| |qwen2.5:7b-instruct-q2\_K|3.0 GB|44.63| |---|---|---| |llama3.1-8b-Q8\_0|8.5 GB|46.34| Static GGUF: [https://www.ollama.com/](https://www.ollama.com/) iMatrix calibrated GGUF using English dataset(iMat-): [https://huggingface.co/bartowski](https://huggingface.co/bartowski) Backend: [https://www.ollama.com/](https://www.ollama.com/) evaluation tool: [https://github.com/chigkim/Ollama-MMLU-Pro](https://github.com/chigkim/Ollama-MMLU-Pro) evaluation config: [https://pastebin.com/YGfsRpyf](https://pastebin.com/YGfsRpyf)

Reply to Post

39 Comments

[-]

Xhehab_@reddit

Sorted in descending order: | Model | Size | Computer science (MMLU PRO) | |------------------------------|---------|-----------------------------| | Qwen2.5 32B Q4_K_M | 18.5 GB | 71.46 | | Qwen2.5 14B Q4_K_S | 8.57 GB | 63.90 | | q5_K_S | 5.3 GB | 58.78 | | iMat-Q6_K | 6.3 GB | 58.54 | | iMat-Q4_K_M | 4.7 GB | 58.54 | | q6_K | 6.3 GB | 57.80 | | q5_K_M | 5.4 GB | 57.80 | | iMat-Q5_K_S | 5.3 GB | 57.32 | | iMat-Q5_K_L | 5.8 GB | 56.59 | | q8_0 | 8.1 GB | 56.59 | | iMat-Q3_K_XL | 4.6 GB | 56.59 | | iMat-IQ4_XS | 4.2 GB | 56.59 | | Mistral Small-Q4_K_M | 13.34GB | 56.59 | | iMat-Q3_K_L | 4.1 GB | 56.34 | | iMat-Q4_K_L | 5.1 GB | 56.10 | | iMat-Q5_K_M | 5.4 GB | 55.37 | | q4_K_S | 4.5 GB | 55.12 | | q4_K_M | 4.7 GB | 54.63 | | iMat-Q3_K_M | 3.8 GB | 54.39 | | q3_K_M | 3.8 GB | 53.66 | | iMat-Q4_K_S | 4.5 GB | 53.41 | | iMat-IQ3_XS | 3.3 GB | 52.20 | | q3_K_S | 3.5 GB | 51.95 | | q3_K_L | 4.1 GB | 51.46 | | iMat-Q3_K_S | 3.5 GB | 51.46 | | glm4-9b-chat-q8_0 | 10.0 GB | 51.22 | | Mistral NeMo 2407 12B Q5_K_M | 8.73 GB | 46.34 | | llama3.1-8b-Q8_0 | 8.5 GB | 46.34 | | iMat-Q2_K | 3.0 GB | 49.51 | | q2_K | 3.0 GB | 44.63 |

[-]

ResearchCrafty1804@reddit

My MacBook M2 Pro 16GB will love the Qwen2.5 14B Q4_K_S on the go!!

[-]

ServeAlone7622@reddit

Mine too!

[-]

sammcj@reddit

FYI I ran Qwen2.5 32b Q6_K (w/iMatrix) through the same test today: ``` Total, 296/410, 72.20% Random Guess Attempts, 0/410, 0.00% Correct Random Guesses, division by zero error Adjusted Score Without Random Guesses, 296/410, 72.20% Finished the benchmark in 19 minutes 29 seconds. Total, 296/410, 72.20% Token Usage: Prompt tokens: min 1449, average 1575, max 1906, total 97633, tk/s 83.48 Completion tokens: min 76, average 301, max 806, total 18644, tk/s 15.94 Markdown Table: | overall | computer science | | ------- | ---------------- | | 72.20 | 72.20 | ```

[-]

open-listings@reddit

It is likely that the quality of this comment is low. Your profile credibility and other users' experiences may be negatively impacted by this. --- ^I am a bot. | 🤖 | [Source](https://safe-text-api.com/) | [Author](https://www.reddit.com/user/open-listings/)

[-]

t98907@reddit

I plotted model names against their sizes and MMLU scores. Take a look. https://preview.redd.it/k6ttyd9pmfqd1.png?width=1979&format=png&auto=webp&s=977435f0570a23c5a997deabc0b77755cbecd9a5 It seems like choosing between 32B, 14B, iMat-Q6\_K, q5\_K\_S, or iMat-Q4\_K\_M based on memory constraints would be a good idea.

[-]

swagonflyyyy@reddit

Not bad at all those numbers. Those 18T tokens are really paying off.

[-]

always_posedge_clk@reddit

Thank you for this eval. It would be interesting to know how DeepSeek-V2-Lite (16B MoE) compares to Qwen2.5 14B. Did anyone compared it already?

[-]

_yustaguy_@reddit

I didn't compare it yet, but I can tell you that DeepSeek V2-lite gets absolutely cooked. Qwen2.5 14b is more comparable to the bigger DeepSeek v2

[-]

noobgolang@reddit

Qwen is just too good to be true is there any catch at all omg im so hyped

[-]

Mart-McUH@reddit

The catch is that it is probably optimized for benchmarks. That said it is still great model, just don't expect it to be so much better in real use case.

[-]

Healthy-Nebula-3603@reddit

I tested a lot for the time being. That is not optimised for benchmarks. They are just so good . Hard to believe but it's true. Easily solving advanced math , good in reasoning, coding ( much better than llama 3.1 8b or Gemma 9b)

[-]

Some_Endian_FP17@reddit

Which size? Gemma 9B is my go-to for reasoning and RAG now. I keep Llama 3.1 8B as a baseline and for function calling. I've got a damned zoo of models.

[-]

blockpapi@reddit

I guess you are refering to the 14B model, could you tell me which quantisation you‘re using?

[-]

thehealer1010@reddit

Why qwen get people's attention? Is it because license?

[-]

3-4pm@reddit

Would be curious how well it does on world history events such as Tiananmen Square.

[-]

Bannedlife@reddit

I'm quite curious why this gets downvoted, can anyone help me out?

[-]

nero10579@reddit

Look at the bots or whatever praising qwen all the time. Go figure.

[-]

Shoddy-Tutor9563@reddit

I can only guess some of the Chinese brothers here have seen a glimpse of irony or sarcasm in this question (https://en.m.wikipedia.org/wiki/1989_Tiananmen_Square_protests_and_massacre), given the sensitivity of the topic.

[-]

_supert_@reddit

It answered me straight, no bs.

[-]

ma3gl1n@reddit

It wouldn't be effective to use "small" local LLMs for facts anyways

[-]

murlakatamenka@reddit

> This is the Qwen2.5 7B Chat model, NOT coder > > Computer science (MMLU PRO) Feels not right? Naturally I'd expect Coder variant to be more competent in CS domain --- > iMatrix calibrated GGUF using English dataset(iMat-): https://huggingface.co/bartowski I didn't find any iMatrix ggufs under `bartowski`. Why is such link referenced? My expectations is to see a link to imat-ggufs repo 🤷 There are: - https://huggingface.co/legraphista/Qwen2.5-7B-Instruct-IMat-GGUF - https://huggingface.co/duyntnet/Qwen2.5-7B-Instruct-imatrix-GGUF

[-]

AaronFeng47@reddit (OP)

Read the model card: https://imgur.com/a/ezktgAw

[-]

murlakatamenka@reddit

Thanks, helpful reply! OP could use a direct URL for models used: - https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF instead of just - https://huggingface.co/bartowski

[-]

AaronFeng47@reddit (OP)

Coder model is for code writing and code completion, meanwhile mmlu is just answer questions and pick a right answer

[-]

Feztopia@reddit

Do you have something similar for arcee-ai/Llama-3.1-SuperNova-Lite ? It's an awesome model I'm using right now, and it's near to qwen in the leaderboard (I think it's the best 8b model in the leaderboard). I don't think it's much talked here but it's impressive.

[-]

DinoAmino@reddit

5_K_S and 4_K_M out in front, eh?

[-]

AaronFeng47@reddit (OP)

This eval is for checking when "brain damage" truly kick in during quantization, not for comparing which one quant is the best

[-]

keepthepace@reddit

But I don;t see a clear cliff there more of a gradual descent. I agree that it may be a bit of noise, still it is surprising to see q8 so low. Also, is there a reason to use the coding task on the chat model rather than the coder one? Is it to provide a more apple-to-apple comparison with other models?

[-]

No_Afternoon_4260@reddit

You should do more samples, but I feel you'll find more instability passing q5km

[-]

AaronFeng47@reddit (OP)

Electricity costs money and one sample for each quant is good enough for spotting brain damage, in this 7B's case I think it starts at Q3 and more obvious at Q2

[-]

hedonihilistic@reddit

Try something like vllm and batch your requests. One sample is probably the reason why there is this weird parabolic curve for the scripts against the quants.

[-]

Professional-Bear857@reddit

Was the Q4K\_M for the 32b model an imatrix quant? I'm using an imatrix variant and wondered if the one you used was, I'd imagine the imatrix variant will perform slightly better.

[-]

ddavidkov@reddit

It's a superb model, but I guess the takeaway here is that you'll probably get better results with the Qwen 2.5 14b IQ4\_XS-iMat-EN which gets 65.85 on the same test.

[-]

ddavidkov@reddit

It's a superb model, but I guess the takeaway here is that you'll probably get better results with the Qwen 2.5 14b: || || |IQ4\_XS-iMat-EN|8.12GB|65.85|

[-]

Maykey@reddit

q8_0 shows surprisingly bad results

[-]

ResearchCrafty1804@reddit

According to this benchmark Qwen2.5 7B is SOTA for its size (and slightly bigger even). The same trend has been observed for the rest of the model sizes of the Qwen2.5 family using OP’s benchmarks. I am really excited for this release, we are expecting the rest of the companies in the open weight community to update their models as well to overpass qwen in the following weeks. The local LLM community progresses in with a fast pace and this truly amazing. Kudos to Qwen team!

[-]

AaronFeng47@reddit (OP)

Qwen2.5 14B: https://www.reddit.com/r/LocalLLaMA/comments/1flqwzw/qwen25_14b_gguf_quantization_evaluation_results/ 32B: https://www.reddit.com/r/LocalLLaMA/comments/1fkm5vd/qwen25_32b_gguf_evaluation_results/

[-]

pablogabrieldias@reddit

Thank you very much for all these evaluations you make