Not impressed. We should be impressed if it can solve new problems, a.k.a. "thinking," when we prompt it with a problem that is not available in the model's training set.
I've been toying around with the 32b coder, and the 72b instruct, and I'm getting a pretty good feel for them.
In general, I have found that the 32b is as good of a code writer as, or possibly better than, the 72b. I've also found that the reading comprehension and general contextual understanding of the 72b far exceeds the 32b.
Here's one example: I recently was working on some code where I had a json field "delimiter" and was reading it in some python code. In the python code, I spelled it "delimeter", not noticing that error. I spelled it everywhere in the code like that.
When it came time to run the code, of course nothing was working; it was driving me nuts. I copied the code (but not the json) into Qwen2.5 coder 32b and asked it to help me figure out what was going on. It happily started trying to solve code problems, wondering if maybe the code (which looked fine to it) was just acting weird due to an interpreter, etc. I had it re-solve in new chats, with new prompts, over and over and each time it tried some new innovative way to rewrite the code to force it to work.
The 72b, alternatively, just straight up asked if I spelled it right in the json cause it's definitely misspelled here. Boom, fixed in 1 minute lol.
I've had a few situations like that. The recurring theme is that Qwen 32b coder will absolutely do an amazing job writing code and solving exactly what you tell it to write code and solve, while the 72b will think through the problem more deeply and may figure out how to handle things that you yourself are mistaken on.
Between the two, the 72b still holds more value to me for that reason, even if it doesn't quite code as well. I personally use it needing an assistant to catch me when I'm being dumb or help me work through a problem a bit more. But if you know exactly what you want to write, or exactly how to solve your issue? The 32b is the better option in terms of speed and quality, I think.
If you actually read the paper, you’d see o1 preview and mini arent affected by this unless the question is changed to add irrelevant information and preview still gets it right 77% of the time
Solved by a simple prompt, getting a perfect 10/10: https://andrewmayne.com/2024/10/18/can-you-dramatically-improve-results-on-the-latest-large-language-model-reasoning-benchmark-with-a-simple-prompt/
Humans often fall for the same trap: https://en.m.wikipedia.org/wiki/List_of_cognitive_biases
Example: trick questions like "spell silk" and then you spell "S-I-L-K" and then they ask "what do cows drink" and of course the answer is milk. except it's not milk, cows drink water.
Americans deciding whether or not they support price controls: https://x.com/USA_Polling/status/1832880761285804434
A federal law limiting how much companies can raise the price of food/groceries: +15% net favorability
A federal law establishing price controls on food/groceries: -10% net favorability
So, to follow up on your interesting point, what is irrelevant information ? Is there a chance there will be irrelevant information in the context window in real life query?
If we follow the overfitting argument, the neural net traverses the optimization energy horizon via the overfitted path, exactly to the global minimum.
Any divergence in the input increases the likelihood of taking a different path and just arriving at some other local minimum.
So, my counterargument is that any divergence in the input is irrelevant information.
Attention helps to arrive at some okish result, but it's not even immune to reordering or tokens.
That is why the only credible rankings are the ones using private evals. Scale's leaderboard is one such ranking. I hope some university or research institute make some similar leaderboard using private evals.
Livebench is the only benchmark that shows how bad the o1 models are at code completion.
If you went by that benchmark, you would think o1 mini is the best coding model, but in both livebench and aider, they both get smoked by the new Sonnet.
This is verifiable by anyone making more than a simple script, too.
Ie: Any actual useful coding project of decent length.
I agree with your general framing, but Scale has pretty big flaws in how they measure domains.
Also for some reason their llama-3.2 90B scores much higher than their 3.1 70B (in some cases there are 2-3x difference of the 95% CI). In theory, they should be the same for text-only evaluations sans some errors.
That really makes wonder how they have been running those benchmarks.
Yes exactly, Llama 3 70B should score lower than llama 3.1 70B.
Llama 3 is known to score worse than llama 3.1 of the same parameter size. And this finding is also consistent with the results from this benchmark.
If we assume that llama 3.1 70B and llama 3.2 90B achieve the same scores, then llama 3.2 90B is more or less representative of llama 3.1 70B
Now in each of the specific benchmark categories in which llama 3.2 90B and llama 3 70B both appear, we see that llama 3 70B scores significantly lower than its 3.2 90B sibling. Therefore, by implication, as stated, this is representative of "In each of the specific benchmark categories in which both llama 3.1 70B and llama 3 70B both appear, we can see that llama 3 70B scores significantly worse than its 3.1 70B sibling part"
But yeah, slightly confusing with the naming and versioning xD
Its funny how benchmarks dont really reflect the reality when it comes to more special usecases. L3.1 is so, so horribad at RP, and L3 is actually damn amazing, for example.
And people tend to forget that benchmarks are not all that matters :p
The new sonnet is lowkey creepy with how good it is…
I remember the other day I was using it and asked it to write code, when it caught its own bug, within the same response. Which is insane. And there have been other unique things it does, but this one stands out.
If u understand how LLMs work, That goes against how an auto regression model would work. It would be more likely, even if it recognized the bug for it to just continue as if it were correct (this is human nature, think of a stack overflow post/response)
My guess is they are experimenting with some RL witchcraft (not the RL involved in turning it from prediction to chatbot)
It's not just coding; Spanish looks suspicious to me, and in our Ukrainian evals o1 is just horrible, behind some open source models, not to mention the latest gemini's. Their eval might not be as good as they think.
How are these private if they using gpt-4o, gemini and so on on it? If these want to remain private they only can allow local hosted LLMs to benchmark them.Otherwise companies might include or accidentally spread the data sets anywhere.
That is why the only credible rankings are the ones using private evals.
Private ones are inherently not credible because of their nature of being private so nobody can ascertain they're representative of anything or even real, to begin with.
You don't even have to assume maliciout intent (like being paid to lie), there's no strong motivator for the company to spend resources to ascertain a representative data set (negligence), nor can we be sure they're even competent enough to build representative data sets.
This approach also has problems. There is so much money/vested interests in AI now that I would trust a closed benchmark even less than something like livebench.
Also, the questions are not guaranteed to be private to the API endpoint, even if the testers are 100% reliable and completely beyond reproach.
In my mind there is just no 100% reliable way to compare models.
[Note: I am not saying anything about the above benchmark posted, just some general comments.]
Thank you! Hopefully, as private eval scene grows larger and stronger, it would dispatch with the embarrassment that is the latest crop of Chinese models and their derivatives that have been trained on public evals, & arena. The same should be said about OpenAI: probably the biggest offender in the West.
On a different note, surprised to see o1-preview so high in Spanish.
It's just a guess, l tried with 7b,14b, 32b with the same leetcode tests, but only 32b can solve them all. I think all three are using the same training datasets. So I doubt your saying seriously with my test results 😕
I’m honestly fine with it in this case because I maybe could use help with it explaining the problems to me. Of course we want llm’s to be able to solve unseen problems though
I've been trying to get them to output typst syntax (since they probably haven't seen it) and most models fail at it miserably, gemini 1114 is pretty decent at it tho.
yep thats about how far most (all?) LLMs can go right now. the only useful tests nowadays are those that use private prompts. theres little point otherwise when they just know all replies by memory.
Because not every model trained on specific data can fully represent that data, but it is comparatively easy to make a model fully represent that data. Also, "trained on the internet" doesn't mean anything. For all we know, their crawler might not have been able to crawl leetcode in a way that the model would have properly formatted question/response pairs for that site.
Most leetcode problems are in publicly available datasets with solutions etc. It is extremely easy to scrape the remaining ones for your instruction tuning.
The fact it was present in the training data of the base model doesn't mean much. It is much more relevant that the data is present in the finetuning set many times over, in a format that would allow the model to reproduce the solution.
This is getting stupid now. Why even post this? Yes model creators train on basic problems, leet questions, 1000 ways to do basic programming tasks - it shows nothing at all other than the fact: that it was in the dataset.
Ehhh. I gave qwen 2.5 Coder 32B a basic SQL question “this query isn’t running tell me what’s wrong with it”. It was basic stuff, including an unmatched parentheses, a mismatched table alias, and the wrong cast of a timestamp to a date. It correctly identified 1 of the 3 issues and made up an issue that didn’t exist. Claude easily solved the problem.
This might have been on the training set but Qwen Coder is legit. It's better than Qwen-72B and even the latest mistral at coding. Not super good at anything else, though.
Sorry but in real use cases, identical prompts and challenge QWen 2.5 32B (Q8) is not even in the same ballpark as Claude 3.5. It just isn’t and posts like these are not helping anyone.
IMO, the benchmarks we need to start using should come from real life sources and amateur operators who don't know anything from that field. For example, someone who knows nothing about coding or programming should have one of these models try and finish any of the bounties on Upwork. If an llm or an agent is able to finish any of those jobs and can verify its work to know that the job is done to expected specifications and actually makes money from those jobs because the bounty was paid, then i would be very impressed. The reason I specifically said operator should be ignorant, is because many people don't understand how much of a role they themselves play in these things, someone experienced in programming could steer the llm for that last 5% or he would have the knowledge to understand the job is done and verified that its good. Those skew the results by a lot.
I’ve tested qwencoder on the latest leetcode problems, those who are not in the training data. I’ve found that most easy problems are being solved at first shot. More than half of the medium problems, and none from the hard ones.
In my experience it’s just below 4o. (4o still surpasses it)
Free problems with solutions and discussions are most likely there in the training set. Try premium hard problems or change the problem structure, naming scheme. Claude 3.5, gpt-4 fails so much more than passes.
Automatic-Ad-9530@reddit
Not impressed. We should be impressed if it can solve new problems, a.k.a. "thinking," when we prompt it with a problem that is not available in the model's training set.
_cooder@reddit
Of it passes, to many solutions in open source, also github pages have some of the, so what is open = free(for Ai)
SomeOddCodeGuy@reddit
I've been toying around with the 32b coder, and the 72b instruct, and I'm getting a pretty good feel for them.
In general, I have found that the 32b is as good of a code writer as, or possibly better than, the 72b. I've also found that the reading comprehension and general contextual understanding of the 72b far exceeds the 32b.
Here's one example: I recently was working on some code where I had a json field "delimiter" and was reading it in some python code. In the python code, I spelled it "delimeter", not noticing that error. I spelled it everywhere in the code like that.
When it came time to run the code, of course nothing was working; it was driving me nuts. I copied the code (but not the json) into Qwen2.5 coder 32b and asked it to help me figure out what was going on. It happily started trying to solve code problems, wondering if maybe the code (which looked fine to it) was just acting weird due to an interpreter, etc. I had it re-solve in new chats, with new prompts, over and over and each time it tried some new innovative way to rewrite the code to force it to work.
The 72b, alternatively, just straight up asked if I spelled it right in the json cause it's definitely misspelled here. Boom, fixed in 1 minute lol.
I've had a few situations like that. The recurring theme is that Qwen 32b coder will absolutely do an amazing job writing code and solving exactly what you tell it to write code and solve, while the 72b will think through the problem more deeply and may figure out how to handle things that you yourself are mistaken on.
Between the two, the 72b still holds more value to me for that reason, even if it doesn't quite code as well. I personally use it needing an assistant to catch me when I'm being dumb or help me work through a problem a bit more. But if you know exactly what you want to write, or exactly how to solve your issue? The 32b is the better option in terms of speed and quality, I think.
Training_Pudding9338@reddit
and where can I test this model at a reasonable price, because it probably won't work on my computer
PrettyGeek30@reddit
use hugging face chat
synn89@reddit
DeepInfra does a good job keeping up with the latest good models.
GoogleOpenLetter@reddit
https://huggingface.co/spaces/Qwen/Qwen2.5-Coder-demo
z_3454_pfk@reddit
Open router, it’s like $0.50 for 1m tokens. It’s free on HuggingChat and HuggingChat api website.
mevsgame@reddit
There's solid research where changing just the questions ever so slightly makes the LLM accuracy drop by 50% or even more.
The most recent paper wad by Apple I think: https://link.springer.com/article/10.1007/s10849-023-09409-x
So yes. Neural Nets have memory. Good LLMs get their answers mostly through just repeating what they were trained on.
This research was not conducted on Sonnet 3.5 or o1 though. Qwen is amazing. But as people said, it was very likely just trained on this.
Whotea@reddit
If you actually read the paper, you’d see o1 preview and mini arent affected by this unless the question is changed to add irrelevant information and preview still gets it right 77% of the time
Solved by a simple prompt, getting a perfect 10/10: https://andrewmayne.com/2024/10/18/can-you-dramatically-improve-results-on-the-latest-large-language-model-reasoning-benchmark-with-a-simple-prompt/
Humans often fall for the same trap: https://en.m.wikipedia.org/wiki/List_of_cognitive_biases
Example: trick questions like "spell silk" and then you spell "S-I-L-K" and then they ask "what do cows drink" and of course the answer is milk. except it's not milk, cows drink water.
Americans deciding whether or not they support price controls: https://x.com/USA_Polling/status/1832880761285804434
mevsgame@reddit
So, to follow up on your interesting point, what is irrelevant information ? Is there a chance there will be irrelevant information in the context window in real life query?
If we follow the overfitting argument, the neural net traverses the optimization energy horizon via the overfitted path, exactly to the global minimum.
Any divergence in the input increases the likelihood of taking a different path and just arriving at some other local minimum.
So, my counterargument is that any divergence in the input is irrelevant information. Attention helps to arrive at some okish result, but it's not even immune to reordering or tokens.
Whotea@reddit
You can read the paper lol
Same for humans. And o1 does better than most other models
mevsgame@reddit
So what you are saying is that it is possible to provide in a live scenario 100% information and 0% noise. Cool.
Thireus@reddit
Would be interesting to see if the same LLM can rephrase the question before answering it.
Whotea@reddit
they don’t need to
firemeaway@reddit
That’s because problem solving using an LLM is akin to running an LLM on RAM. Sure..you could do it, but it isn’t really doing it
Verypowafoo@reddit
I dont think I like that analogy sir.
Barry_Jumps@reddit
Problem solving with an LLM is akin to eating a baseball with yum yum sauce. You could do it, but you're not really doing it.
Coresce@reddit
What about problem solving using an LLM running on RAM?
gy0p4k@reddit
aren't we all? 🤔
futuristsalon@reddit
💯% in the training data
GodCREATOR333@reddit
Leet code problems might have been in the data set
AdhesivenessRich960@reddit
That is why the only credible rankings are the ones using private evals. Scale's leaderboard is one such ranking. I hope some university or research institute make some similar leaderboard using private evals.
https://scale.com/leaderboard
randombsname1@reddit
Their coding leaderboard is WAY off.
Livebench is the only benchmark that shows how bad the o1 models are at code completion.
If you went by that benchmark, you would think o1 mini is the best coding model, but in both livebench and aider, they both get smoked by the new Sonnet.
This is verifiable by anyone making more than a simple script, too.
Ie: Any actual useful coding project of decent length.
I agree with your general framing, but Scale has pretty big flaws in how they measure domains.
OmarBessa@reddit
o1 mini (properly prompted) is a very unique model with amazing performance and capabilities
OmarBessa@reddit
o1 mini (properly prompted) is a very unique model with amazing performance and capabilities
stillnoguitar@reddit
FYO o1-mini is SOTA on aider with 85% as a code editor with 01-preview as the architect. https://aider.chat/2024/09/26/architect.html
randombsname1@reddit
That didn't take into account new Sonnet. Which came out late October.
Sonnet "3.6" had a big boost on both aider and livebench leaderboards.
pkmxtw@reddit
Also for some reason their llama-3.2 90B scores much higher than their 3.1 70B (in some cases there are 2-3x difference of the 95% CI). In theory, they should be the same for text-only evaluations sans some errors.
That really makes wonder how they have been running those benchmarks.
Evening_Ad6637@reddit
There is no comparison between llama 3.2 90B and llama 3.1 70B
Each time you see llama 3.2 90B, the corresponding 70B model is llama 3, not llama 3.1
Therefore the differences totally make sense to me
randomqhacker@reddit
Wouldn't llama 3 70B score lower than llama 3.1 70B?
Evening_Ad6637@reddit
Yes exactly, Llama 3 70B should score lower than llama 3.1 70B.
Llama 3 is known to score worse than llama 3.1 of the same parameter size. And this finding is also consistent with the results from this benchmark.
If we assume that llama 3.1 70B and llama 3.2 90B achieve the same scores, then llama 3.2 90B is more or less representative of llama 3.1 70B
Now in each of the specific benchmark categories in which llama 3.2 90B and llama 3 70B both appear, we see that llama 3 70B scores significantly lower than its 3.2 90B sibling. Therefore, by implication, as stated, this is representative of "In each of the specific benchmark categories in which both llama 3.1 70B and llama 3 70B both appear, we can see that llama 3 70B scores significantly worse than its 3.1 70B sibling part"
But yeah, slightly confusing with the naming and versioning xD
Xandrmoro@reddit
Its funny how benchmarks dont really reflect the reality when it comes to more special usecases. L3.1 is so, so horribad at RP, and L3 is actually damn amazing, for example. And people tend to forget that benchmarks are not all that matters :p
Expensive-Apricot-25@reddit
The new sonnet is lowkey creepy with how good it is…
I remember the other day I was using it and asked it to write code, when it caught its own bug, within the same response. Which is insane. And there have been other unique things it does, but this one stands out.
If u understand how LLMs work, That goes against how an auto regression model would work. It would be more likely, even if it recognized the bug for it to just continue as if it were correct (this is human nature, think of a stack overflow post/response)
My guess is they are experimenting with some RL witchcraft (not the RL involved in turning it from prediction to chatbot)
tucnak@reddit
It's not just coding; Spanish looks suspicious to me, and in our Ukrainian evals o1 is just horrible, behind some open source models, not to mention the latest gemini's. Their eval might not be as good as they think.
Ok-Scarcity-7875@reddit
How are these private if they using gpt-4o, gemini and so on on it? If these want to remain private they only can allow local hosted LLMs to benchmark them.Otherwise companies might include or accidentally spread the data sets anywhere.
HiddenoO@reddit
Private ones are inherently not credible because of their nature of being private so nobody can ascertain they're representative of anything or even real, to begin with.
TechnoByte_@reddit
Private evals can't be trusted, we need a way to verify the results or it's not credible at all.
There's nothing stopping private evals from just lying about the results, since no one can verify it
That's like publising a paper without citing sources
HiddenoO@reddit
You don't even have to assume maliciout intent (like being paid to lie), there's no strong motivator for the company to spend resources to ascertain a representative data set (negligence), nor can we be sure they're even competent enough to build representative data sets.
Whotea@reddit
Livebench avoids that by updating their questions every month
SuperChewbacca@reddit
It's too bad Scale doesn't test more models.
jzn21@reddit
Indeed, where are the Qwen’s?
Verypowafoo@reddit
Well they better stop being lazy and do it.
Status_Contest39@reddit
Scale doesn't include deekseek and qwen at all, 😂
Billy462@reddit
This approach also has problems. There is so much money/vested interests in AI now that I would trust a closed benchmark even less than something like livebench.
Also, the questions are not guaranteed to be private to the API endpoint, even if the testers are 100% reliable and completely beyond reproach.
In my mind there is just no 100% reliable way to compare models.
[Note: I am not saying anything about the above benchmark posted, just some general comments.]
tucnak@reddit
Thank you! Hopefully, as private eval scene grows larger and stronger, it would dispatch with the embarrassment that is the latest crop of Chinese models and their derivatives that have been trained on public evals, & arena. The same should be said about OpenAI: probably the biggest offender in the West.
On a different note, surprised to see o1-preview so high in Spanish.
Status_Contest39@reddit
It's just a guess, l tried with 7b,14b, 32b with the same leetcode tests, but only 32b can solve them all. I think all three are using the same training datasets. So I doubt your saying seriously with my test results 😕
TheFrenchSavage@reddit
Which is really great.
Anything that will bring this forsaken platform down is great.
CarefulGarage3902@reddit
I’m honestly fine with it in this case because I maybe could use help with it explaining the problems to me. Of course we want llm’s to be able to solve unseen problems though
xSnoozy@reddit
has qwen mentioned when their training data cutoff is?
InterestingAnt8669@reddit
Please stop trying prove the worth of any LLM with data that has been in place for years. Give them brand new problems and see how they do.
Sudden-Lingonberry-8@reddit
I've been trying to get them to output typst syntax (since they probably haven't seen it) and most models fail at it miserably, gemini 1114 is pretty decent at it tho.
Whotea@reddit
So would most humans
Sudden-Lingonberry-8@reddit
not really, since I give them example syntax it chooses to completely ignore.
Whotea@reddit
Guess you’ve never been a TA before
TheLogiqueViper@reddit (OP)
Yes you are right , i posted it and now was testing on atcoder and recent codeforces problems it is struggling and giving same code again and again
JakoDel@reddit
yep thats about how far most (all?) LLMs can go right now. the only useful tests nowadays are those that use private prompts. theres little point otherwise when they just know all replies by memory.
Whotea@reddit
So why can’t gpt 3.5 do it? It was trained on the internet too and it’s much bigger (175 billion params)
martinerous@reddit
Can it do ARC-AGI too? :)
tucnak@reddit
Shock: Chinese model trained on Leetcode problems can solve them.
Elsewhere: the sky is colour blue.
Whotea@reddit
So why can’t gpt 3.5 do it? It was trained on the internet too
HiddenoO@reddit
Because not every model trained on specific data can fully represent that data, but it is comparatively easy to make a model fully represent that data. Also, "trained on the internet" doesn't mean anything. For all we know, their crawler might not have been able to crawl leetcode in a way that the model would have properly formatted question/response pairs for that site.
Whotea@reddit
So why can this one do it and not other ones? They all want to be #1 right?
xseson23@reddit
Is that the case? I thought leet code kept these hard problems close to chest and don't share much. (asking genuinely)
artificial_simpleton@reddit
Most leetcode problems are in publicly available datasets with solutions etc. It is extremely easy to scrape the remaining ones for your instruction tuning.
Whotea@reddit
So why can’t gpt 3.5 do it? It was trained on the internet too and it’s much bigger (175 billion params)
artificial_simpleton@reddit
The fact it was present in the training data of the base model doesn't mean much. It is much more relevant that the data is present in the finetuning set many times over, in a format that would allow the model to reproduce the solution.
Whotea@reddit
And you think they didn’t do this for other models? Why can’t grok do this? Or Command R+?
FriedGil@reddit
Every leetcode problem has publicly available solutions on the site, not to mention a plethora of solutions scattered elsewhere around the web.
Whotea@reddit
why isn’t gpt 3.5 able do it? It was trained on the internet too and it’s much bigger (175 billion params)
tucnak@reddit
Don't worry about nuances like that; you can tell your dad Qwen 2.5 32b CODER is WINNING! 🏆
Suspicious-Trick4528@reddit
You should test model for the newest hard problemss
0x5f3759df-i@reddit
Amazing what you can solve if the answers are in your training set.
Whotea@reddit
How’s gpt 3.5’s performance? It’s much bigger and it has that in the dataset too
WashHead744@reddit
Algorithms aren't an issue for LLMs
GROTOK3000@reddit
This is getting stupid now. Why even post this? Yes model creators train on basic problems, leet questions, 1000 ways to do basic programming tasks - it shows nothing at all other than the fact: that it was in the dataset.
rabbotz@reddit
Ehhh. I gave qwen 2.5 Coder 32B a basic SQL question “this query isn’t running tell me what’s wrong with it”. It was basic stuff, including an unmatched parentheses, a mismatched table alias, and the wrong cast of a timestamp to a date. It correctly identified 1 of the 3 issues and made up an issue that didn’t exist. Claude easily solved the problem.
ortegaalfredo@reddit
This might have been on the training set but Qwen Coder is legit. It's better than Qwen-72B and even the latest mistral at coding. Not super good at anything else, though.
CheatCodesOfLife@reddit
Yep, it's very good at writing code and turning your instructions into code which runs first shot.
Not so good at interpreting logs, planning things, explaining what large bespoke code files to, etc. But it's great for what it was built to do.
extopico@reddit
Sorry but in real use cases, identical prompts and challenge QWen 2.5 32B (Q8) is not even in the same ballpark as Claude 3.5. It just isn’t and posts like these are not helping anyone.
no_witty_username@reddit
IMO, the benchmarks we need to start using should come from real life sources and amateur operators who don't know anything from that field. For example, someone who knows nothing about coding or programming should have one of these models try and finish any of the bounties on Upwork. If an llm or an agent is able to finish any of those jobs and can verify its work to know that the job is done to expected specifications and actually makes money from those jobs because the bounty was paid, then i would be very impressed. The reason I specifically said operator should be ignorant, is because many people don't understand how much of a role they themselves play in these things, someone experienced in programming could steer the llm for that last 5% or he would have the knowledge to understand the job is done and verified that its good. Those skew the results by a lot.
SadWolverine24@reddit
Does Qwen follow a ~5 month release cadence? So Qwen 3.0 would be around January?
NewExplor3r@reddit
I’ve tested qwencoder on the latest leetcode problems, those who are not in the training data. I’ve found that most easy problems are being solved at first shot. More than half of the medium problems, and none from the hard ones.
In my experience it’s just below 4o. (4o still surpasses it)
g33khub@reddit
Free problems with solutions and discussions are most likely there in the training set. Try premium hard problems or change the problem structure, naming scheme. Claude 3.5, gpt-4 fails so much more than passes.
boxingdog@reddit
sorry but this is a worthless benchmark
Few_Painter_5588@reddit
Just to clarify, qwen 2.5 32b coder, or the general instruct model?
TheLogiqueViper@reddit (OP)
instruct
CarefulGarage3902@reddit
do you know if the model was quantized? and I don’t really know the difference between regular and instruct yet
Few_Painter_5588@reddit
Man, qwen 32b just can't stop winning.
sebastianmicu24@reddit
I'm hoping a qwen for a 2.5 close to 100b that will come out soon
Few_Painter_5588@reddit
They had a 100b model in qwen 1.5, but it sucked. It's plausible that they made one for qwen 2.5, and then put it behind an API.
ScrapEngineer_@reddit
What site is this?
crazzydriver77@reddit
LeetCode
balianone@reddit
what is leetcode?
CarefulGarage3902@reddit
its basically a website software developer/coder people use to practice for technical interviews. They practice algorithms and stuff
TheLogiqueViper@reddit (OP)
leetcode -> problems -> shortest path -> hard problems
Anjz@reddit
If you like 32B instruct, 32B coder is even better.
Double_Sherbert3326@reddit
overfit.
charmander_cha@reddit
So let's see if we can get the data from these privates to create our fine tuning datasets.
Psychedelic_Traveler@reddit
Which quant you using ?
Ok_Maize_3709@reddit
I would appreciate some more details on what we are looking at
TheLogiqueViper@reddit (OP)
Leetcode -> problems -> shortest path -> hard problems
lordpuddingcup@reddit
Leetcode tests