Qwen3 on Dubesor Benchmark

Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 10 comments

[https://dubesor.de/benchtable.html](https://dubesor.de/benchtable.html) One of the few benchmarks that tested both thinking on/off of qwen https://preview.redd.it/eim5m35nxqye1.png?width=1265&format=png&auto=webp&s=cd814d571735444429331c73b4cd17a066497907 >Small-scale manual performance comparison benchmark I made for myself. This table showcases the results I recorded of various AI models across different personal tasks I encountered over time (currently 83). I use a **weighted rating system** and calculate the difficulty for each tasks by incorporating the results of all models. This is particularly relevant in scoring when failing easy questions or passing hard ones. >**NOTE, THAT THIS JUST ME SHARING THE RESULTS FROM MY OWN SMALL-SCALE PERSONAL TESTING. YMMV! OBVIOUSLY THE SCORES ARE JUST THAT AND MIGHT NOT REFLECT YOUR OWN PERSONAL EXPERIENCES OR OTHER WELL-KNOWN BENCHMARKS.**

Reply to Post

10 Comments

[-]

MLDataScientist@reddit

Qwen3-235B-A22B Thinking fp8 is impressive at 11th place with the cost much lower than any other model above it.

[-]

RickyRickC137@reddit

Great efforts! Are you planning to continue this rankings as new model comes out?

[-]

Healthy-Nebula-3603@reddit

qwen 3 32b has reasoning 44% and llama 3.3 70b 49% ?? LOL

[-]

Cool-Chemical-5629@reddit

So Qwen3-4B (thinking) beats the old much bigger Qwen2.5-32B-Instruct (non-thinking), Qwen2.5-14B-Instruct as well as Qwen3-8B (non-thinking). Qwen3-14B (non-thinking) model unsurprisingly beats Qwen3-4B (thinking), but also lands just below the older Qwen2-72B-Instruct. Qwen3-30B-A3B (non-thinking) tails R1-Distill-Qwen-32B (thinking only) which is pretty impressive since it should mean that it is able to deliver comparable quality without thinking, but more importantly Qwen3-30B-A3B (non-thinking) also beats the older Qwen2-72B-Instruct. QwQ-32B (thinking only) lands just above Qwen3-32B (non-thinking), but far below Qwen3-32B (thinking). Interestingly Qwen3-14B (thinking) and Qwen3-8B (thinking) both beat the old big Qwen2.5-Plus (non-thinking, API only) model. And finally, Qwen3-30B-A3B (thinking) tails the old biggest Qwen2.5-Max (non-thinking, API only) which is only beaten by Qwen3-32B (thinking) and the current biggest Qwen3-235B-A22B in both thinking and non-thinking modes. All in all, it looks as though the Qwen3-30B-A3B in non-thinking mode is a decent sweet spot somewhere in the middle and with thinking enabled it's a very competent contender, all with the higher inference speed thanks to MoE architecture as a bonus.

[-]

ResearchCrafty1804@reddit

I suggest you test Qwen3-30b-A3b fp8 as well. I noticed that due to the small number of activated parameters this particular model is more sensitive than the rest of the models in Qwen3 series.

[-]

Impressive_Half_2819@reddit

Unslloth is a life changer.

[-]

MustBeSomethingThere@reddit

You're currently using Q4\_K\_M. You might want to try Unsloth UD-Q4\_K\_XL (for example, Qwen3-32B-UD-Q4\_K\_XL.gguf) to see if it makes a difference.

[-]

AaronFeng47@reddit (OP)

Yeah but it's good enough to compare between qwen2.5 and qwen3 (btw I'm not Dubesor)

[-]

AaronFeng47@reddit (OP)

GLM vs Qwen https://preview.redd.it/bhkx6tei5rye1.png?width=1217&format=png&auto=webp&s=80d7a0aadaca853cb67561c533615fcf8f8f848b

[-]

AaronFeng47@reddit (OP)

https://preview.redd.it/kpq8hvll5rye1.png?width=1207&format=png&auto=webp&s=9160bad3af651ecae917ac176a572fc1f1f243dc GLM vs Qwen (thinking)