all solved by qwen 2.5 32B

[-]

Ok-Rest-4276@reddit

what are the hardware requirements for it to run? will it run on M4 Pro?

[-]

Automatic-Ad-9530@reddit

Not impressed. We should be impressed if it can solve new problems, a.k.a. "thinking," when we prompt it with a problem that is not available in the model's training set.

[-]

_cooder@reddit

Of it passes, to many solutions in open source, also github pages have some of the, so what is open = free(for Ai)

[-]

SomeOddCodeGuy@reddit

I've been toying around with the 32b coder, and the 72b instruct, and I'm getting a pretty good feel for them.

In general, I have found that the 32b is as good of a code writer as, or possibly better than, the 72b. I've also found that the reading comprehension and general contextual understanding of the 72b far exceeds the 32b.

Here's one example: I recently was working on some code where I had a json field "delimiter" and was reading it in some python code. In the python code, I spelled it "delimeter", not noticing that error. I spelled it everywhere in the code like that.

When it came time to run the code, of course nothing was working; it was driving me nuts. I copied the code (but not the json) into Qwen2.5 coder 32b and asked it to help me figure out what was going on. It happily started trying to solve code problems, wondering if maybe the code (which looked fine to it) was just acting weird due to an interpreter, etc. I had it re-solve in new chats, with new prompts, over and over and each time it tried some new innovative way to rewrite the code to force it to work.

The 72b, alternatively, just straight up asked if I spelled it right in the json cause it's definitely misspelled here. Boom, fixed in 1 minute lol.

I've had a few situations like that. The recurring theme is that Qwen 32b coder will absolutely do an amazing job writing code and solving exactly what you tell it to write code and solve, while the 72b will think through the problem more deeply and may figure out how to handle things that you yourself are mistaken on.

Between the two, the 72b still holds more value to me for that reason, even if it doesn't quite code as well. I personally use it needing an assistant to catch me when I'm being dumb or help me work through a problem a bit more. But if you know exactly what you want to write, or exactly how to solve your issue? The 32b is the better option in terms of speed and quality, I think.

[-]

Training_Pudding9338@reddit

and where can I test this model at a reasonable price, because it probably won't work on my computer

[-]

PrettyGeek30@reddit

use hugging face chat

[-]

synn89@reddit

DeepInfra does a good job keeping up with the latest good models.

[-]

GoogleOpenLetter@reddit

https://huggingface.co/spaces/Qwen/Qwen2.5-Coder-demo

[-]

z_3454_pfk@reddit

Open router, it’s like $0.50 for 1m tokens. It’s free on HuggingChat and HuggingChat api website.

[-]

mevsgame@reddit

There's solid research where changing just the questions ever so slightly makes the LLM accuracy drop by 50% or even more.

The most recent paper wad by Apple I think: https://link.springer.com/article/10.1007/s10849-023-09409-x

So yes. Neural Nets have memory. Good LLMs get their answers mostly through just repeating what they were trained on.

This research was not conducted on Sonnet 3.5 or o1 though. Qwen is amazing. But as people said, it was very likely just trained on this.

[-]

Whotea@reddit

If you actually read the paper, you’d see o1 preview and mini arent affected by this unless the question is changed to add irrelevant information and preview still gets it right 77% of the time

Solved by a simple prompt, getting a perfect 10/10: https://andrewmayne.com/2024/10/18/can-you-dramatically-improve-results-on-the-latest-large-language-model-reasoning-benchmark-with-a-simple-prompt/

Humans often fall for the same trap: https://en.m.wikipedia.org/wiki/List_of_cognitive_biases

Example: trick questions like "spell silk" and then you spell "S-I-L-K" and then they ask "what do cows drink" and of course the answer is milk. except it's not milk, cows drink water.

Americans deciding whether or not they support price controls: https://x.com/USA_Polling/status/1832880761285804434

A federal law limiting how much companies can raise the price of food/groceries: +15% net favorability A federal law establishing price controls on food/groceries: -10% net favorability

[-]

mevsgame@reddit

So, to follow up on your interesting point, what is irrelevant information ? Is there a chance there will be irrelevant information in the context window in real life query?

If we follow the overfitting argument, the neural net traverses the optimization energy horizon via the overfitted path, exactly to the global minimum.

Any divergence in the input increases the likelihood of taking a different path and just arriving at some other local minimum.

So, my counterargument is that any divergence in the input is irrelevant information. Attention helps to arrive at some okish result, but it's not even immune to reordering or tokens.

[-]

Whotea@reddit

You can read the paper lol

Same for humans. And o1 does better than most other models

[-]

mevsgame@reddit

So what you are saying is that it is possible to provide in a live scenario 100% information and 0% noise. Cool.

[-]

Thireus@reddit

Would be interesting to see if the same LLM can rephrase the question before answering it.

[-]

Whotea@reddit

they don’t need to

[-]

firemeaway@reddit

That’s because problem solving using an LLM is akin to running an LLM on RAM. Sure..you could do it, but it isn’t really doing it

[-]

Verypowafoo@reddit

I dont think I like that analogy sir.

[-]

Barry_Jumps@reddit

Problem solving with an LLM is akin to eating a baseball with yum yum sauce. You could do it, but you're not really doing it.

[-]

Coresce@reddit

What about problem solving using an LLM running on RAM?

[-]

gy0p4k@reddit

just repeating what they were trained on

aren't we all? 🤔

[-]

futuristsalon@reddit

💯% in the training data

[-]

GodCREATOR333@reddit

Leet code problems might have been in the data set

[-]

AdhesivenessRich960@reddit

That is why the only credible rankings are the ones using private evals. Scale's leaderboard is one such ranking. I hope some university or research institute make some similar leaderboard using private evals.

https://scale.com/leaderboard

[-]

randombsname1@reddit

Their coding leaderboard is WAY off.

Livebench is the only benchmark that shows how bad the o1 models are at code completion.

If you went by that benchmark, you would think o1 mini is the best coding model, but in both livebench and aider, they both get smoked by the new Sonnet.

This is verifiable by anyone making more than a simple script, too.

Ie: Any actual useful coding project of decent length.

I agree with your general framing, but Scale has pretty big flaws in how they measure domains.

[-]

OmarBessa@reddit

o1 mini (properly prompted) is a very unique model with amazing performance and capabilities

[-]

OmarBessa@reddit

o1 mini (properly prompted) is a very unique model with amazing performance and capabilities

[-]

stillnoguitar@reddit

FYO o1-mini is SOTA on aider with 85% as a code editor with 01-preview as the architect. https://aider.chat/2024/09/26/architect.html

[-]

randombsname1@reddit

That didn't take into account new Sonnet. Which came out late October.

Sonnet "3.6" had a big boost on both aider and livebench leaderboards.

[-]

pkmxtw@reddit

Also for some reason their llama-3.2 90B scores much higher than their 3.1 70B (in some cases there are 2-3x difference of the 95% CI). In theory, they should be the same for text-only evaluations sans some errors.

That really makes wonder how they have been running those benchmarks.

[-]

Evening_Ad6637@reddit

There is no comparison between llama 3.2 90B and llama 3.1 70B

Each time you see llama 3.2 90B, the corresponding 70B model is llama 3, not llama 3.1

Therefore the differences totally make sense to me

[-]

randomqhacker@reddit

Wouldn't llama 3 70B score lower than llama 3.1 70B?

[-]

Evening_Ad6637@reddit

Yes exactly, Llama 3 70B should score lower than llama 3.1 70B.

Llama 3 is known to score worse than llama 3.1 of the same parameter size. And this finding is also consistent with the results from this benchmark.

If we assume that llama 3.1 70B and llama 3.2 90B achieve the same scores, then llama 3.2 90B is more or less representative of llama 3.1 70B

Now in each of the specific benchmark categories in which llama 3.2 90B and llama 3 70B both appear, we see that llama 3 70B scores significantly lower than its 3.2 90B sibling. Therefore, by implication, as stated, this is representative of "In each of the specific benchmark categories in which both llama 3.1 70B and llama 3 70B both appear, we can see that llama 3 70B scores significantly worse than its 3.1 70B sibling part"

But yeah, slightly confusing with the naming and versioning xD

[-]

Xandrmoro@reddit

Its funny how benchmarks dont really reflect the reality when it comes to more special usecases. L3.1 is so, so horribad at RP, and L3 is actually damn amazing, for example. And people tend to forget that benchmarks are not all that matters :p

[-]

Expensive-Apricot-25@reddit

The new sonnet is lowkey creepy with how good it is…

I remember the other day I was using it and asked it to write code, when it caught its own bug, within the same response. Which is insane. And there have been other unique things it does, but this one stands out.

If u understand how LLMs work, That goes against how an auto regression model would work. It would be more likely, even if it recognized the bug for it to just continue as if it were correct (this is human nature, think of a stack overflow post/response)

My guess is they are experimenting with some RL witchcraft (not the RL involved in turning it from prediction to chatbot)

[-]

tucnak@reddit

It's not just coding; Spanish looks suspicious to me, and in our Ukrainian evals o1 is just horrible, behind some open source models, not to mention the latest gemini's. Their eval might not be as good as they think.

[-]

Ok-Scarcity-7875@reddit

How are these private if they using gpt-4o, gemini and so on on it? If these want to remain private they only can allow local hosted LLMs to benchmark them.Otherwise companies might include or accidentally spread the data sets anywhere.

[-]

HiddenoO@reddit

That is why the only credible rankings are the ones using private evals.

Private ones are inherently not credible because of their nature of being private so nobody can ascertain they're representative of anything or even real, to begin with.

[-]

TechnoByte_@reddit

Private evals can't be trusted, we need a way to verify the results or it's not credible at all.

There's nothing stopping private evals from just lying about the results, since no one can verify it

That's like publising a paper without citing sources

[-]

HiddenoO@reddit

You don't even have to assume maliciout intent (like being paid to lie), there's no strong motivator for the company to spend resources to ascertain a representative data set (negligence), nor can we be sure they're even competent enough to build representative data sets.

[-]

Whotea@reddit

Livebench avoids that by updating their questions every month

[-]

SuperChewbacca@reddit

It's too bad Scale doesn't test more models.

[-]

jzn21@reddit

Indeed, where are the Qwen’s?

[-]

Verypowafoo@reddit

Well they better stop being lazy and do it.

[-]

Status_Contest39@reddit

Scale doesn't include deekseek and qwen at all, 😂

[-]

Billy462@reddit

This approach also has problems. There is so much money/vested interests in AI now that I would trust a closed benchmark even less than something like livebench.

Also, the questions are not guaranteed to be private to the API endpoint, even if the testers are 100% reliable and completely beyond reproach.

In my mind there is just no 100% reliable way to compare models.

[Note: I am not saying anything about the above benchmark posted, just some general comments.]

[-]

tucnak@reddit

Thank you! Hopefully, as private eval scene grows larger and stronger, it would dispatch with the embarrassment that is the latest crop of Chinese models and their derivatives that have been trained on public evals, & arena. The same should be said about OpenAI: probably the biggest offender in the West.

On a different note, surprised to see o1-preview so high in Spanish.

[-]

Status_Contest39@reddit

It's just a guess, l tried with 7b,14b, 32b with the same leetcode tests, but only 32b can solve them all. I think all three are using the same training datasets. So I doubt your saying seriously with my test results 😕

[-]

TheFrenchSavage@reddit

Which is really great.
Anything that will bring this forsaken platform down is great.

[-]

CarefulGarage3902@reddit

I’m honestly fine with it in this case because I maybe could use help with it explaining the problems to me. Of course we want llm’s to be able to solve unseen problems though

[-]

xSnoozy@reddit

has qwen mentioned when their training data cutoff is?

[-]

InterestingAnt8669@reddit

Please stop trying prove the worth of any LLM with data that has been in place for years. Give them brand new problems and see how they do.

[-]

Sudden-Lingonberry-8@reddit

I've been trying to get them to output typst syntax (since they probably haven't seen it) and most models fail at it miserably, gemini 1114 is pretty decent at it tho.

[-]

Whotea@reddit

So would most humans

[-]

Sudden-Lingonberry-8@reddit

not really, since I give them example syntax it chooses to completely ignore.

[-]

Whotea@reddit

Guess you’ve never been a TA before

[-]

TheLogiqueViper@reddit (OP)

Yes you are right , i posted it and now was testing on atcoder and recent codeforces problems it is struggling and giving same code again and again

[-]

JakoDel@reddit

yep thats about how far most (all?) LLMs can go right now. the only useful tests nowadays are those that use private prompts. theres little point otherwise when they just know all replies by memory.

[-]

Whotea@reddit

So why can’t gpt 3.5 do it? It was trained on the internet too and it’s much bigger (175 billion params)

[-]

martinerous@reddit

Can it do ARC-AGI too? :)

[-]

tucnak@reddit

Shock: Chinese model trained on Leetcode problems can solve them.

Elsewhere: the sky is colour blue.

[-]

Whotea@reddit

So why can’t gpt 3.5 do it? It was trained on the internet too

[-]

HiddenoO@reddit

Because not every model trained on specific data can fully represent that data, but it is comparatively easy to make a model fully represent that data. Also, "trained on the internet" doesn't mean anything. For all we know, their crawler might not have been able to crawl leetcode in a way that the model would have properly formatted question/response pairs for that site.

[-]

Whotea@reddit

So why can this one do it and not other ones? They all want to be #1 right?

[-]

xseson23@reddit

trained on Leetcode problems can solve them.

Is that the case? I thought leet code kept these hard problems close to chest and don't share much. (asking genuinely)

[-]

artificial_simpleton@reddit

Most leetcode problems are in publicly available datasets with solutions etc. It is extremely easy to scrape the remaining ones for your instruction tuning.

[-]

Whotea@reddit

So why can’t gpt 3.5 do it? It was trained on the internet too and it’s much bigger (175 billion params)

[-]

artificial_simpleton@reddit

The fact it was present in the training data of the base model doesn't mean much. It is much more relevant that the data is present in the finetuning set many times over, in a format that would allow the model to reproduce the solution.

[-]

Whotea@reddit

And you think they didn’t do this for other models? Why can’t grok do this? Or Command R+?

[-]

FriedGil@reddit

Every leetcode problem has publicly available solutions on the site, not to mention a plethora of solutions scattered elsewhere around the web.

[-]

Whotea@reddit

why isn’t gpt 3.5 able do it? It was trained on the internet too and it’s much bigger (175 billion params)

[-]

tucnak@reddit

Don't worry about nuances like that; you can tell your dad Qwen 2.5 32b CODER is WINNING! 🏆

[-]

Suspicious-Trick4528@reddit

You should test model for the newest hard problemss

[-]

0x5f3759df-i@reddit

Amazing what you can solve if the answers are in your training set.

[-]

Whotea@reddit

How’s gpt 3.5’s performance? It’s much bigger and it has that in the dataset too

[-]

WashHead744@reddit

Algorithms aren't an issue for LLMs

[-]

GROTOK3000@reddit

This is getting stupid now. Why even post this? Yes model creators train on basic problems, leet questions, 1000 ways to do basic programming tasks - it shows nothing at all other than the fact: that it was in the dataset.

[-]

rabbotz@reddit

Ehhh. I gave qwen 2.5 Coder 32B a basic SQL question “this query isn’t running tell me what’s wrong with it”. It was basic stuff, including an unmatched parentheses, a mismatched table alias, and the wrong cast of a timestamp to a date. It correctly identified 1 of the 3 issues and made up an issue that didn’t exist. Claude easily solved the problem.

[-]

ortegaalfredo@reddit

This might have been on the training set but Qwen Coder is legit. It's better than Qwen-72B and even the latest mistral at coding. Not super good at anything else, though.

[-]

CheatCodesOfLife@reddit

Yep, it's very good at writing code and turning your instructions into code which runs first shot.

Not so good at interpreting logs, planning things, explaining what large bespoke code files to, etc. But it's great for what it was built to do.

[-]

extopico@reddit

Sorry but in real use cases, identical prompts and challenge QWen 2.5 32B (Q8) is not even in the same ballpark as Claude 3.5. It just isn’t and posts like these are not helping anyone.

[-]

no_witty_username@reddit

IMO, the benchmarks we need to start using should come from real life sources and amateur operators who don't know anything from that field. For example, someone who knows nothing about coding or programming should have one of these models try and finish any of the bounties on Upwork. If an llm or an agent is able to finish any of those jobs and can verify its work to know that the job is done to expected specifications and actually makes money from those jobs because the bounty was paid, then i would be very impressed. The reason I specifically said operator should be ignorant, is because many people don't understand how much of a role they themselves play in these things, someone experienced in programming could steer the llm for that last 5% or he would have the knowledge to understand the job is done and verified that its good. Those skew the results by a lot.

[-]

SadWolverine24@reddit

Does Qwen follow a ~5 month release cadence? So Qwen 3.0 would be around January?

[-]

NewExplor3r@reddit

I’ve tested qwencoder on the latest leetcode problems, those who are not in the training data. I’ve found that most easy problems are being solved at first shot. More than half of the medium problems, and none from the hard ones.

In my experience it’s just below 4o. (4o still surpasses it)

[-]

g33khub@reddit

Free problems with solutions and discussions are most likely there in the training set. Try premium hard problems or change the problem structure, naming scheme. Claude 3.5, gpt-4 fails so much more than passes.

[-]