Another coding model, Achieves strong performance on software engineering tasks, including 37.2% resolve rate on SWE-Bench Verified.

Posted by Ornery_Local_6814@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

coding_workflow@reddit

This is not a new model, only a fine tune based on Qwen coder. So have the same limits on context.

Fine tuning can improve a bit models, make them look better in benchmarks, but I have serious doubts about the real world use.

[-]

The proliferation of models claiming superiority over qwq or qwen coder 32B, or even truly r1 (not distills) at comparable parameter counts is frankly, untenable. Furthermore, assertions of outperforming o1 mini with a mere 32B parameter model approach is no more than a farts. Let me reiterate: the benchmarks proffered by these entities are largely inconsequential and lack substantive merit. Unless such benchmarks demonstrably exhibit performance exceeding that of 4o mini, this more acceptable.

[-]

reginakinhi@reddit

You know... I really enjoy being specific and concise with proper terminology, but this 'Sphinx being given a thesaurus and then failing to socialize while using it' thing you are doing really isn't working.

[-]

YearnMar10@reddit

Fancy words. Where’d you learn those?

[-]

HokkaidoNights@reddit

!remindme 2 weeks

[-]

DinoAmino@reddit

Would be nice to see evals comparing the Qwen coder they fine-tuned on top of. IFEval usually takes a big hit after fine-tuning on an instruct model. And math scores shed light on general reasoning abilities.

[-]

audioen@reddit

They left comparison to the base model out, probably because the base model is either better or roughly as good as their own work.

[-]

Wonderful_Second5322@reddit

Nope, this isn't better than mistral. Fucking shit

[-]

Trojblue@reddit

Probably makes sense to think of it as a distilled deepseek v3 on OpenHands task

[-]

ResearchCrafty1804@reddit

I am very curious how would this model score on other coding benchmarks like livecodebench.

With good score across many benchmarks we can be ensured that the model was not trained on data of one benchmark to cheat its score.

[-]

CockBrother@reddit

It's not just an LLM. It's a fine tuned model plus agent framework so... the benchmarks aren't really apples to apples. Could be good.

[-]

CockBrother@reddit

Can it code a competent game of snake though? My company is running on Snake written in COBOL with some of the original code from the 1970s still kicking. We haven't been able to replace this system due to the high development costs.

SWE-Bench? Fah. Snake is the real benchmark. I know because it's all I see in Youtube videos.

[-]

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)