FINAL-Bench/Darwin-36B-Opus · Hugging Face

[-]

Chromix_@reddit

This looks more like creative benchmarking than a model improvement.

The model card reports 88.4% on GPQA Diamond, putting it on-par with Qwen3.5-397B-A17B and making it better than Kimi-K2.5. What the benchmark table doesn't mention is that the original Qwen 3.6 35B A3B has a reported score of 86% in that benchmark. Yet still, the Darwin model scores better. Now looking at the aggregate results, the Darwin model has baseline of just 73.2%. If it answers incorrectly it gets at least one more retry with a majority vote of 8 runs. Throwing in more inference time improves the results, that's widely known. Comparing these results to model results achieved without that retry-on-fail seems rather unfair. The Kimi K2.5 score is an average of 8 runs simply to reduce the result variance. Now, the GPQA Diamond benchmark only has 198 questions. That means that retry-on-failure has a meaningful chance of achieving overly high results, when a single correctly answered question (after failure) yields 0.5% score.

[-]

AvidCyclist250@reddit

Rigorous thinking award. Thanks for the legwork.

[-]

AfternoonOk5482@reddit

How is this not just a merge?

[-]

IrisColt@reddit

It's sus, sorry.

[-]

YouCantMissTheBear@reddit

CLAUDE X HAPSBURG

[-]

IrisColt@reddit

Charwin II of Spain, heh

[-]

MmmmMorphine@reddit

Inbreeding for everyone!

[-]

zaafonin@reddit

[-]

denoflore_ai_guy@reddit

I love it when “check this out” posts lead to links where big words are used to make it seem like it’s not full of shit but anyone human knows it very likely is. Will edit with a post checkout edit if I’m wrong on first impressions.

[-]

denoflore_ai_guy@reddit

The “Darwin V7 evolutionary breeding engine” is just SLERP at alpha 0.84. Their own commit history says so before the marketing pass scrubbed it. You can find a one-line mergekit config that does this in any HF discussion thread from 2024.

MM was LoRA-SFT’d on 14,233 Claude Opus 4.6 reasoning traces, which is the shit Anthropic’s ToS explicitly says “don’t do” then they put “claude-opus” right in the tags and named the kid “Opus” anyway.

Bold strategy. I guess Anthropic only cares about shit and bans you if they think your acct was used by a kid or you try and use it to make a thermonuclear device or something. Anyways.

Mother’s OWN card flags her MMLU-Pro eval as a 70-question smoke test, author’s words, not release-quality.

Darwin then promotes her into “robust reasoning donor whose trajectories we preserve.”

Evidence laundering /w extra steps for street cred. Nice.

[-]

denoflore_ai_guy@reddit

The HF eval record is 213 bytes. No raw outputs, no seeds, no prompts, no extraction script. Not auditable. Just a number on a sticky note.

Spec on the card is 24 Q heads + 4 KV.

The ACTUAL config.json says 16 and 2.

Soooooooo they couldn’t even paste their own architecture in correctly, buuuuuut I’m supposed to trust the eval protocol?

Lol gtfo can’t be serious.

And FINAL-Bench is a benchmark org grading their own model on their own leaderboard against a stack of comparison entries half of which I can’t independently verify in one sitting.

They put themselves at rank 3 TIED /w a 397B model.

TOTALLY not weird about that at all. Like that worlds smartest South Korean dude who’s “iq” was given by his business partner who’s a theologian and the test they made themselves. Right. Moving on.

[-]

denoflore_ai_guy@reddit

Here’s the thing that pisses me off. This is the local LLM equivalent of those ab-zapper belts from late-night TV. You strap it on, it twitches your muscles a little, the box says “scientifically proven,” and people who don’t know any better think they’re getting fit on the couch. They’re not. They’re getting shocked. The marketing exists because the truth (lift weights, eat less) is boring and effortful.

Same energy here. The truth is “we SLERP’d two Qwen variants, one of which was trained on Claude outputs we shouldn’t have used, and the merge regressed the model so we papered over it with 24-shot best-of-N.” That’s a paragraph. Nobody clicks on that paragraph.

So instead you get “Darwin V7 Evolutionary Breeding Engine.” Father × Mother. Hybrid Vigor. Proto-AGI tag. 88.4% GPQA Diamond. Tied with 397B. Apache 2.0. Hero badges. Genealogy charts. Compute capacity dressed up as cognitive density. It doesn’t physically hurt anyone. Bartowski’s quants still work. The Father underneath is still a real model. Worst case you download 21GB of Q4_K_M and find out it’s mid.

But it wastes everyone’s fucking time. It pollutes leaderboards ppl actually look at.

It teaches new people in the space that this is what a model release looks like, so the next ten merges show up wearing the same costume.

It makes the model maker feel important because there’s a bar chart with their name at rank 3.

And it forces anyone who actually cares about the field to spend an afternoon writing posts like this one to explain why no, your cousin shouldn’t actually pull this for “AGI at home.”

This is why we can’t have nice things.

Dumb people fall for it because the words sound right.

Lazy people fall for it because checking is work.

Honest people lose because honest doesn’t make a hero badge.

And my AuDHD justice sensitive “fuck this fake bullshit in particular” ass pathologically has to expose it because my brain doesn’t allow me to do anything other than that or I physically feel it. God I hate my brain as much as I hate karma farming assholes who make fake shit.

[-]

New_Spray_7886@reddit

“But it wastes everyone’s fucking time. It pollutes leaderboards ppl actually look at.“

I downloaded a couple quants of the 3.5 version of this. It was one of the worst varieties of qwen-3.5 I tried, completely unusable and looped endlessly. I usually put the models I’ll seldom use in longterm storage on my nas considering the bandwidth they take to download - this one I just deleted since it was a complete waste

[-]

MmmmMorphine@reddit

How dare you insult my magic rectum rock!

But yes, this seems like a lot of bullshit. My hopes came to a screeching halt at the "train it in a hour!" part.

Unfortunately. Seems like an interesting potential strategy but evolutionary algorithms are usually pretty damn slow and inefficient, if they're relevant at all. Usually they aren't, with far more "intelligent" statistical approaches available.

Now if I had a server farm and could run this sort of thing on RYS style duplicate layers, per-tensor quantization, and the new kid on the block, selectable attention-mechanisms per layer (see superapriel)... I'd still use a different mechanism

[-]

denoflore_ai_guy@reddit

merges ARE that fast. Because no training is happening lol. The card pretending that’s a breakthrough instead of a Tuesday python script is the whole tell.

And yeah evolutionary on weights is dumb. Search space too big, no real fitness signal, they just ran SLERP four times and picked the highest GPQA. That’s not evolution that’s A/B testing with extra steps.

RYS + per-tensor quant + Superapriel-style selectable attention is the actually interesting stack. If anyone was gonna do evolutionary on top of that the only version I’d respect is fitness-on-routing not fitness-on-weights. Evolve which attention fires per layer.

Weights you train, routing you can maybe search.

This ain’t that tho. This is mergekit in a poncho.

[-]

Monkey_1505@reddit

There is no opus chain of thought to distill in the first place.

[-]

FullOf_Bad_Ideas@reddit

Stochastic retry evaluation does not seem valid, especially if other models do not get the same chance (and nothing indicates they do). Sharding questions by GPUs also seem weird, I don't know what that means. Do non-Darwin-Opus models also get the same treatment or is the evaluation on them more fair? This seems like a way to boost scores but it makes comparison just not fair.

[-]

Cool-Chemical-5629@reddit

Mother: hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled — a Claude Opus 4.6 reasoning-distilled variant of the same Father.

Took me a while to realize it said "Mother: hesamation..." and not "Mother: insemination", I guess I need to find my glasses. 😂

[-]

CalligrapherFar7833@reddit

Too much nsfw models makes you say that :D

[-]

Kodix@reddit

Love experiments like this. Can't wait to check it out, hopefully it performs well, but either way - thoroughly interesting. Thank you for sharing!