Qwen 2.5 on official LiveCodeBench leaderboard
Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 63 comments
32B is really punching above it's weight
Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 63 comments
32B is really punching above it's weight
gabe_dos_santos@reddit
Claude in the seventh seems doubtful. It's the best model by far when it comes to coding tasks.
cyan2k@reddit
what do you mean "doubtful"? you can literally run the benchmark yourself? you can see the questions claude gets asked and how the evaluation is build.
ambient_temp_xeno@reddit
I can't code so I can't comment, but one guy said that Mistral Large 2 really came through for some complicated problem when claude and gpt4o didn't.
s1fro@reddit
I've tried all three for coding tasks. The biggest difference in my experience is when you try to use a less popular language with custom code and libraries. My general experience looks like this.
4o: will attempt to estimate what using the code would look like. Tries using syntax that doesn't exist, calls met hods and functions that aren't there. If you tell it to try again with extra info it will spiral and rewrite everything longer and usually even further from what you want
Large 2: will assume it knows the language and libraries well. It gives roughly the right structure but nothing will be quite right. When given extra information to rewrite and fix issues it will refuse to use it and spit out the same thing over and over
Sonnet: will fail on the first try and usually use syntax from outdated versions. It is also prone to big rewrites and spaghetti code. The biggest difference I've seen is that when given enough context it will eventually get it right. Sometimes you literally have to give it the whole code, pages of docs and explain what changed in new versions but in my experience it is by far the most likely to actually use what you give it to come up with the right answer
eaerdiablosios@reddit
u/s1fro which one are u using in your daily/weekly work then, Sonnet? Do you see this as your companion?
From what i have tested so far on 4o for instance, I found myself spending heaps of time correcting it and testing; seemed more work to get the code "fixed" from multiple iterations of the prompts, until it got it right; next thing you know, I spent 1-2hrs only on prompting back and forth, checking the output : /
s1fro@reddit
A lot of times it really is more hassle than it's worth. For programming I default to Sonnet. In my experience it is the most likely to solve it eventually + artifacts are nice to have. I don't really rely on any model too much. Sometimes I'll go on lmarena to try other models. I got some good results with new Gemini, new Qwen and GPT models.
I use it mostly if I want to do something new that I don't know a lot about to avoid stupid early mistakes or to debug. It is hit or miss. Sometimes it gets it on the first try and sometimes you just have to figure it out yourself.
eaerdiablosios@reddit
cool i'll give Sonnet a go. I tried Gemini and it was not that great, it seemed to lack behind or get some logic wrong.
CheatCodesOfLife@reddit
Mistral-Large has done this for me sometimes. I tend to swap between Large2, Sonnet, 4o and Qwen2.5 with occasional calls to o1 when I can't figure something out. They all have their strengths, it's a shame Large2 has a non-commercial license.
eaerdiablosios@reddit
I haven't tested it yet, only used 4o for coding and was happy for small Python scripts. do you use it within Visual Studio or anything like that?
FullstackSensei@reddit
It depends on which programming language you're using the LLM, for what type of coding tasks, and how are you promoting the model. Claude can probably answer less detailed prompts because Anthropic (like OpenAI) has gathered tons of feedback from users and can use that to tune responses probabilistically even for vague requests.
The fact remains that a 32B model is even remotely close to a model 10x or more larger shows how good Qwen is, and how much headroom is left for 30-70B models.
talk_nerdy_to_m3@reddit
Yea it really goes to show how differently things can be measured. I totally agree with you, Claude is way ahead of everyone.
ThaisaGuilford@reddit
I wish I could use it locally
sasik520@reddit
It indicates huge gap between o1-mini and o1-preview. Honestly speaking, I haven't noticed THAT big difference. I see o1-mini is better on some tasks but it is a small improvement since GPT4-Turbo. Not 1.6x better, maybe 1.1x better.
Sanket_1729@reddit
My experience with o1-mini is great. There is another leaderboard Aider LLM Leaderboards which for some reason ranks o1-mini way too low.
Durian881@reddit
Would it be right to say that Q8 version of Qwen2.5-32B will likely outperform Q4/4bit version of Qwen2.5-72B?
zotero-chatpdf@reddit
Can Qwen2.3-32B or quantization version be runned locally
me1000@reddit
I’d suggest you try it yourself. Imperially I found 32B roughly on par maybe a tiny bit better than 72B for a couple code tasks. but 72B is too slow for me to run regularly. But I ran both at Q4.
Also, I believe the published benchmarks had the 32B model performing slightly better than the 72B model.
But models are weird, everyone’s needs are different.
Healthy-Nebula-3603@reddit
Nope ...not even close .... Q4km 72b will be far better than 32b Q8
Q4km is giving very little performance loss on big models.
Q4km is not 4 bit model actually.
Running llamacpp with qwen 72 Q4km I see:
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type q5_0: 40 tensors
llama_model_loader: - type q8_0: 40 tensors
llama_model_loader: - type q4_K: 401 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 41 tensors
AaronFeng47@reddit (OP)
Nah, large models Q4 quant shows minimum quality loss, 72B Q4 still would be better than 32B
ObnoxiouslyVivid@reddit
What's more interesting is that new sonnet is performing worse than old sonnet 3.5 with a noticeable decrease in medium/hard problems
2016YamR6@reddit
I haven’t even tried using 01 mini because I assumed 01 preview/sonnet was better.. is everyone else finding 01 mini to be the best at coding?
Salty-Garage7777@reddit
Yeah. Provided you know a bit about how to state what you need very precisely. It's way better logically, but much, much worse if your English is bad, so maybe that's the problem for some devs who speak very bad English.
TheDreamWoken@reddit
Yeah mini is ok at coding fails at other things
Educational_Gap5867@reddit
I think O1-mini just has more data it’s trained on given the limitations of O1-preview. My guess is that O1-preview just doesn’t run all the tests and the ones it didn’t run got marked as failure.
Photoperiod@reddit
Still out here waiting for 2.5 coder 33b. 72b instruct is crazy good tho.
TheDreamWoken@reddit
Wow
RipKip@reddit
I had my hopes up this post would be announcing the coder version. Just gotta wait a bit more
carchengue626@reddit
Can you share the Link?
mahiatlinux@reddit
That's crazy... The funny thing is that, this isn't even the coder variant that will hopefully be released soon!
balianone@reddit
i made this https://huggingface.co/spaces/llamameta/Qwen2.5-Chat-Assistant
Diegam@reddit
When Qwen 2.5 was released and everyone was praising it, I thought they were just saying that for political reasons. But when I tried it, I realized that the Qwen 2.5 models are the best to date.
abhiiiiiiiiiiii@reddit
Off topic question, can one train the Qwen 2.5 models after loading them up via LLM studio
Aymanfhad@reddit
Imagine if there were a model with a size of 400B parameters; it might even outperform o1
YuriPortela@reddit
This is what i was thinking, i wanna see a giant monster version of it
JohnDotOwl@reddit
I really love how a wide of a variety they provide when it comes to model of different sizes. Supports multiple VRAM sizes
AaronFeng47@reddit (OP)
Yeah their 14B model achieved a really good balance of speed and quality, it's my favourite local model for text stuff like translation and summarize
koloved@reddit
Is o1 mini is best for coding?
Healthy-Nebula-3603@reddit
like you see ... live bench is the best right now for testing as tests are changed each month.
fantomechess@reddit
livecodebench also adds new questions each month and lets you choose the date range. Also if you look at live bench and show subcategories for coding then you see o1-mini ahead there too which is what the default livecodebench page is testing (code generation). o1-mini doesn't do as well on coding completion and that's why it's overall score is down on livecodebench.
In my experience though a model that does really well on code generation benchmarks doesn't necessarily mean it's the best for programming in general. o1-preview as shown by aider benchmarks is better at planning a solution to coding problems and I think that's because it's just a larger smarter model in general. So even if o1-mini does better on general programming challenges, it's ability to keep track of enough details about more complex code makes it worse than o1-preview and claude.
MechanicLeading2210@reddit
How do we know, that some models aren't trained or finetuned on contaminated data?
AaronFeng47@reddit (OP)
Because it's LIVE code bench, they constantly update their benchmark dataset
positivitittie@reddit
Anyone using these with Cline? More importantly, are you able to completely replace Claude with comparable quality?
Nyao@reddit
Is there any api where I can test qwen?
AaronFeng47@reddit (OP)
Hugging chat has qwen2.5 72b
netikas@reddit
Qwen seems great on the surface, but it outputs Chinese in 10% of prompts that I send into it. Used 32B for synthetic data generation, was really disappointed. Also, hf chat version is kinda broken I guess.
Anybody feel the same?
Healthy-Nebula-3603@reddit
What are you talking about? 10 % Chinese output ? shows me source your claims. I'm using daily 32b version, didn't notice 10% Chinese output. Is fully multi language, knows even rare languages. Also following instructions very well.
netikas@reddit
https://imgur.com/a/tgd6eB2
Here are samples from French, Spanish, Russian and German generations, which have chinese symbols. This is the output of Qwen-2.5-32B-Instruct, running in VLLM on A100. Before that there was system prompt in English and a 10-shot instruction in the target language.
Keep in mind, I am generating detoxifications, but the model might still generate outputs, which are offensive.
netikas@reddit
Also, just asked it in the HF Chat to help me debug a memory leak -- it started repeating itself and generated 8 "check for leaks" suggestions. It shouldn't work like that, should it?
Charuru@reddit
Yes it’s great, smack between 4o and sonnet in quality. My only issue is that I find 4o useless as heck, sonnet is the first one that actually works for me, so I don’t know if beating 4o reeeaaallly means anything.
zap0011@reddit
new sonnet drives me nuts
TheRealGentlefox@reddit
I don't get the complaints about the new version. It's chill as hell for me now, using its own judgments and understanding of the rules rather than shutting down completely. It uses more humor, and is more outgoing.
zap0011@reddit
for code?
TheRealGentlefox@reddit
Ah, haven't given the new one any testing with code. Did some regex with it which didn't go perfectly, but I think that had more to do with me not explaining the results I needed better.
TheHippoGuy69@reddit
It's so stubborn it's insane
Healthy-Nebula-3603@reddit
Qwen 2.5 models are monsters ...
DarkArtsMastery@reddit
They are. I was and still am kind of baffled what they managed to pull in these 2.5. And if rumours are correct, 3.0 should really be something. Maybe go-to SOTA open-source LLM for everyone!!!
Meanwhile, OpenAI showcases its latest version of blackbox to a select few.
first2wood@reddit
What? Deepseek coder v2 is a 16B, right?
matteogeniaccio@reddit
No. It's the deepseek coder v2-base, a 236B MoE
first2wood@reddit
Lite is 16B. But still it's a MOE 21B active.
ThaisaGuilford@reddit
Llama 3.1 on top?
Ylsid@reddit
In practice I find Qwen is missing some domain knowledge but still performs very similarly to 4o, as the chart shows
No_Dig_7017@reddit
Trading blows with the big guys!
AaronFeng47@reddit (OP)
Link: https://livecodebench.github.io/leaderboard.html
And this is the time window of the screenshot I shared, no contamination