Qwen 2.5 on official LiveCodeBench leaderboard

[-]

Nyao@reddit

Is there any api where I can test qwen?

[-]

Anjz@reddit

Not only that but the specialized models are something else. The 7b coder is amazingly good for what it is, so the 32b coder would likely be game changing.

[-]

DarkArtsMastery@reddit

They are. I was and still am kind of baffled what they managed to pull in these 2.5. And if rumours are correct, 3.0 should really be something. Maybe go-to SOTA open-source LLM for everyone!!!

Meanwhile, OpenAI showcases its latest version of blackbox to a select few.

[-]

gabe_dos_santos@reddit

Claude in the seventh seems doubtful. It's the best model by far when it comes to coding tasks.

[-]

cyan2k@reddit

what do you mean "doubtful"? you can literally run the benchmark yourself? you can see the questions claude gets asked and how the evaluation is build.

[-]

ambient_temp_xeno@reddit

I can't code so I can't comment, but one guy said that Mistral Large 2 really came through for some complicated problem when claude and gpt4o didn't.

[-]

s1fro@reddit

I've tried all three for coding tasks. The biggest difference in my experience is when you try to use a less popular language with custom code and libraries. My general experience looks like this.

4o: will attempt to estimate what using the code would look like. Tries using syntax that doesn't exist, calls met hods and functions that aren't there. If you tell it to try again with extra info it will spiral and rewrite everything longer and usually even further from what you want

Large 2: will assume it knows the language and libraries well. It gives roughly the right structure but nothing will be quite right. When given extra information to rewrite and fix issues it will refuse to use it and spit out the same thing over and over

Sonnet: will fail on the first try and usually use syntax from outdated versions. It is also prone to big rewrites and spaghetti code. The biggest difference I've seen is that when given enough context it will eventually get it right. Sometimes you literally have to give it the whole code, pages of docs and explain what changed in new versions but in my experience it is by far the most likely to actually use what you give it to come up with the right answer

[-]

eaerdiablosios@reddit

u/s1fro which one are u using in your daily/weekly work then, Sonnet? Do you see this as your companion?
From what i have tested so far on 4o for instance, I found myself spending heaps of time correcting it and testing; seemed more work to get the code "fixed" from multiple iterations of the prompts, until it got it right; next thing you know, I spent 1-2hrs only on prompting back and forth, checking the output : /

[-]

s1fro@reddit

A lot of times it really is more hassle than it's worth. For programming I default to Sonnet. In my experience it is the most likely to solve it eventually + artifacts are nice to have. I don't really rely on any model too much. Sometimes I'll go on lmarena to try other models. I got some good results with new Gemini, new Qwen and GPT models.

I use it mostly if I want to do something new that I don't know a lot about to avoid stupid early mistakes or to debug. It is hit or miss. Sometimes it gets it on the first try and sometimes you just have to figure it out yourself.

[-]

eaerdiablosios@reddit

cool i'll give Sonnet a go. I tried Gemini and it was not that great, it seemed to lack behind or get some logic wrong.

[-]

CheatCodesOfLife@reddit

Mistral-Large has done this for me sometimes. I tend to swap between Large2, Sonnet, 4o and Qwen2.5 with occasional calls to o1 when I can't figure something out. They all have their strengths, it's a shame Large2 has a non-commercial license.

[-]

eaerdiablosios@reddit

I haven't tested it yet, only used 4o for coding and was happy for small Python scripts. do you use it within Visual Studio or anything like that?

[-]

FullstackSensei@reddit

It depends on which programming language you're using the LLM, for what type of coding tasks, and how are you promoting the model. Claude can probably answer less detailed prompts because Anthropic (like OpenAI) has gathered tons of feedback from users and can use that to tune responses probabilistically even for vague requests.

The fact remains that a 32B model is even remotely close to a model 10x or more larger shows how good Qwen is, and how much headroom is left for 30-70B models.

[-]

talk_nerdy_to_m3@reddit

Yea it really goes to show how differently things can be measured. I totally agree with you, Claude is way ahead of everyone.

[-]

ThaisaGuilford@reddit

I wish I could use it locally

[-]

sasik520@reddit

It indicates huge gap between o1-mini and o1-preview. Honestly speaking, I haven't noticed THAT big difference. I see o1-mini is better on some tasks but it is a small improvement since GPT4-Turbo. Not 1.6x better, maybe 1.1x better.

[-]

Sanket_1729@reddit

My experience with o1-mini is great. There is another leaderboard Aider LLM Leaderboards which for some reason ranks o1-mini way too low.

[-]

Durian881@reddit

Would it be right to say that Q8 version of Qwen2.5-32B will likely outperform Q4/4bit version of Qwen2.5-72B?

[-]

zotero-chatpdf@reddit

Can Qwen2.3-32B or quantization version be runned locally

[-]

me1000@reddit

I’d suggest you try it yourself. Imperially I found 32B roughly on par maybe a tiny bit better than 72B for a couple code tasks. but 72B is too slow for me to run regularly. But I ran both at Q4.

Also, I believe the published benchmarks had the 32B model performing slightly better than the 72B model.

But models are weird, everyone’s needs are different.

[-]

Healthy-Nebula-3603@reddit

Nope ...not even close .... Q4km 72b will be far better than 32b Q8
Q4km is giving very little performance loss on big models.

Q4km is not 4 bit model actually.

Running llamacpp with qwen 72 Q4km I see:

llama_model_loader: - type f32: 401 tensors

llama_model_loader: - type q5_0: 40 tensors

llama_model_loader: - type q8_0: 40 tensors

llama_model_loader: - type q4_K: 401 tensors

llama_model_loader: - type q5_K: 40 tensors

llama_model_loader: - type q6_K: 41 tensors

[-]

AaronFeng47@reddit (OP)

Nah, large models Q4 quant shows minimum quality loss, 72B Q4 still would be better than 32B

[-]

ObnoxiouslyVivid@reddit

What's more interesting is that new sonnet is performing worse than old sonnet 3.5 with a noticeable decrease in medium/hard problems

[-]

2016YamR6@reddit

I haven’t even tried using 01 mini because I assumed 01 preview/sonnet was better.. is everyone else finding 01 mini to be the best at coding?

[-]

Salty-Garage7777@reddit

Yeah. Provided you know a bit about how to state what you need very precisely. It's way better logically, but much, much worse if your English is bad, so maybe that's the problem for some devs who speak very bad English.

[-]

TheDreamWoken@reddit

Yeah mini is ok at coding fails at other things

[-]

Educational_Gap5867@reddit

I think O1-mini just has more data it’s trained on given the limitations of O1-preview. My guess is that O1-preview just doesn’t run all the tests and the ones it didn’t run got marked as failure.

[-]

Photoperiod@reddit

Still out here waiting for 2.5 coder 33b. 72b instruct is crazy good tho.

[-]

TheDreamWoken@reddit

Wow

[-]

RipKip@reddit

I had my hopes up this post would be announcing the coder version. Just gotta wait a bit more

[-]

carchengue626@reddit

Can you share the Link?

[-]

mahiatlinux@reddit

That's crazy... The funny thing is that, this isn't even the coder variant that will hopefully be released soon!

[-]

balianone@reddit

i made this https://huggingface.co/spaces/llamameta/Qwen2.5-Chat-Assistant

[-]

Diegam@reddit

When Qwen 2.5 was released and everyone was praising it, I thought they were just saying that for political reasons. But when I tried it, I realized that the Qwen 2.5 models are the best to date.

[-]

abhiiiiiiiiiiii@reddit

Off topic question, can one train the Qwen 2.5 models after loading them up via LLM studio

[-]

Aymanfhad@reddit

Imagine if there were a model with a size of 400B parameters; it might even outperform o1

[-]

YuriPortela@reddit

This is what i was thinking, i wanna see a giant monster version of it

[-]

JohnDotOwl@reddit

I really love how a wide of a variety they provide when it comes to model of different sizes. Supports multiple VRAM sizes

[-]

AaronFeng47@reddit (OP)

Yeah their 14B model achieved a really good balance of speed and quality, it's my favourite local model for text stuff like translation and summarize

[-]

koloved@reddit

Is o1 mini is best for coding?

[-]

Healthy-Nebula-3603@reddit

like you see ... live bench is the best right now for testing as tests are changed each month.

[-]

fantomechess@reddit

livecodebench also adds new questions each month and lets you choose the date range. Also if you look at live bench and show subcategories for coding then you see o1-mini ahead there too which is what the default livecodebench page is testing (code generation). o1-mini doesn't do as well on coding completion and that's why it's overall score is down on livecodebench.

In my experience though a model that does really well on code generation benchmarks doesn't necessarily mean it's the best for programming in general. o1-preview as shown by aider benchmarks is better at planning a solution to coding problems and I think that's because it's just a larger smarter model in general. So even if o1-mini does better on general programming challenges, it's ability to keep track of enough details about more complex code makes it worse than o1-preview and claude.

[-]

MechanicLeading2210@reddit

How do we know, that some models aren't trained or finetuned on contaminated data?

[-]

AaronFeng47@reddit (OP)

Because it's LIVE code bench, they constantly update their benchmark dataset

[-]

positivitittie@reddit

Anyone using these with Cline? More importantly, are you able to completely replace Claude with comparable quality?

[-]

netikas@reddit

Qwen seems great on the surface, but it outputs Chinese in 10% of prompts that I send into it. Used 32B for synthetic data generation, was really disappointed. Also, hf chat version is kinda broken I guess.

Anybody feel the same?

[-]

Healthy-Nebula-3603@reddit

What are you talking about? 10 % Chinese output ? shows me source your claims. I'm using daily 32b version, didn't notice 10% Chinese output. Is fully multi language, knows even rare languages. Also following instructions very well.

[-]

netikas@reddit

https://imgur.com/a/tgd6eB2

Here are samples from French, Spanish, Russian and German generations, which have chinese symbols. This is the output of Qwen-2.5-32B-Instruct, running in VLLM on A100. Before that there was system prompt in English and a 10-shot instruction in the target language.

Keep in mind, I am generating detoxifications, but the model might still generate outputs, which are offensive.

[-]

AaronFeng47@reddit (OP)

Link: https://livecodebench.github.io/leaderboard.html

And this is the time window of the screenshot I shared, no contamination