Another coding model, Achieves strong performance on software engineering tasks, including 37.2% resolve rate on SWE-Bench Verified.
Posted by Ornery_Local_6814@reddit | LocalLLaMA | View on Reddit | 18 comments
Posted by Ornery_Local_6814@reddit | LocalLLaMA | View on Reddit | 18 comments
coding_workflow@reddit
This is not a new model, only a fine tune based on Qwen coder. So have the same limits on context.
Fine tuning can improve a bit models, make them look better in benchmarks, but I have serious doubts about the real world use.
Wonderful_Second5322@reddit
The proliferation of models claiming superiority over qwq or qwen coder 32B, or even truly r1 (not distills) at comparable parameter counts is frankly, untenable. Furthermore, assertions of outperforming o1 mini with a mere 32B parameter model approach is no more than a farts. Let me reiterate: the benchmarks proffered by these entities are largely inconsequential and lack substantive merit. Unless such benchmarks demonstrably exhibit performance exceeding that of 4o mini, this more acceptable.
reginakinhi@reddit
You know... I really enjoy being specific and concise with proper terminology, but this 'Sphinx being given a thesaurus and then failing to socialize while using it' thing you are doing really isn't working.
YearnMar10@reddit
Fancy words. Where’d you learn those?
HokkaidoNights@reddit
!remindme 2 weeks
DinoAmino@reddit
Would be nice to see evals comparing the Qwen coder they fine-tuned on top of. IFEval usually takes a big hit after fine-tuning on an instruct model. And math scores shed light on general reasoning abilities.
audioen@reddit
They left comparison to the base model out, probably because the base model is either better or roughly as good as their own work.
Wonderful_Second5322@reddit
Nope, this isn't better than mistral. Fucking shit
Trojblue@reddit
Probably makes sense to think of it as a distilled deepseek v3 on OpenHands task
ResearchCrafty1804@reddit
I am very curious how would this model score on other coding benchmarks like livecodebench.
With good score across many benchmarks we can be ensured that the model was not trained on data of one benchmark to cheat its score.
CockBrother@reddit
It's not just an LLM. It's a fine tuned model plus agent framework so... the benchmarks aren't really apples to apples. Could be good.
CockBrother@reddit
Can it code a competent game of snake though? My company is running on Snake written in COBOL with some of the original code from the 1970s still kicking. We haven't been able to replace this system due to the high development costs.
SWE-Bench? Fah. Snake is the real benchmark. I know because it's all I see in Youtube videos.
Iron-Over@reddit
!remindMe 1 week
Charuru@reddit
It's not a model... it's a scaffolding.
Accomplished_Yard636@reddit
Remind me when it can vibe code a rocket by itself
Unlucky-Message8866@reddit
i would be happy if i could just prompt "fix the mess you created"
ConiglioPipo@reddit
!remindMe 1 week
RemindMeBot@reddit
I will be messaging you in 7 days on 2025-04-07 21:16:44 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)