FINAL-Bench/Darwin-36B-Opus · Hugging Face
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 20 comments
https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF
Darwin-36B-Opus is a 36-billion-parameter mixture-of-experts (MoE) language model produced by the Darwin V7 evolutionary breeding engine from two publicly available parents:
- Father: Qwen/Qwen3.6-35B-A3B — the foundation MoE with hybrid attention and 256 routed experts.
- Mother: hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled — a Claude Opus 4.6 reasoning-distilled variant of the same Father.
Darwin V7 recombines these two parents into a single descendant that preserves the Mother's distilled chain-of-thought behavior while retaining the structural fidelity of the Father's expert topology. The breeding process is fully automated and produces a deployable bfloat16 checkpoint in under an hour on a single GPU.
On the GPQA Diamond benchmark — 198 graduate-level questions in physics, chemistry, and biology — Darwin-36B-Opus achieves 88.4%, establishing it as the highest-performing model in the Darwin family and extending the series' record of producing state-of-the-art open models through evolution rather than retraining.
Chromix_@reddit
This looks more like creative benchmarking than a model improvement.
The model card reports 88.4% on GPQA Diamond, putting it on-par with Qwen3.5-397B-A17B and making it better than Kimi-K2.5. What the benchmark table doesn't mention is that the original Qwen 3.6 35B A3B has a reported score of 86% in that benchmark. Yet still, the Darwin model scores better. Now looking at the aggregate results, the Darwin model has baseline of just 73.2%. If it answers incorrectly it gets at least one more retry with a majority vote of 8 runs. Throwing in more inference time improves the results, that's widely known. Comparing these results to model results achieved without that retry-on-fail seems rather unfair. The Kimi K2.5 score is an average of 8 runs simply to reduce the result variance. Now, the GPQA Diamond benchmark only has 198 questions. That means that retry-on-failure has a meaningful chance of achieving overly high results, when a single correctly answered question (after failure) yields 0.5% score.
AvidCyclist250@reddit
Rigorous thinking award. Thanks for the legwork.
AfternoonOk5482@reddit
How is this not just a merge?
IrisColt@reddit
It's sus, sorry.
YouCantMissTheBear@reddit
CLAUDE X HAPSBURG
IrisColt@reddit
Charwin II of Spain, heh
MmmmMorphine@reddit
Inbreeding for everyone!
zaafonin@reddit
denoflore_ai_guy@reddit
I love it when “check this out” posts lead to links where big words are used to make it seem like it’s not full of shit but anyone human knows it very likely is. Will edit with a post checkout edit if I’m wrong on first impressions.
denoflore_ai_guy@reddit
The “Darwin V7 evolutionary breeding engine” is just SLERP at alpha 0.84. Their own commit history says so before the marketing pass scrubbed it. You can find a one-line mergekit config that does this in any HF discussion thread from 2024.
MM was LoRA-SFT’d on 14,233 Claude Opus 4.6 reasoning traces, which is the shit Anthropic’s ToS explicitly says “don’t do” then they put “claude-opus” right in the tags and named the kid “Opus” anyway.
Bold strategy. I guess Anthropic only cares about shit and bans you if they think your acct was used by a kid or you try and use it to make a thermonuclear device or something. Anyways.
Mother’s OWN card flags her MMLU-Pro eval as a 70-question smoke test, author’s words, not release-quality.
Darwin then promotes her into “robust reasoning donor whose trajectories we preserve.”
Evidence laundering /w extra steps for street cred. Nice.
denoflore_ai_guy@reddit
The HF eval record is 213 bytes. No raw outputs, no seeds, no prompts, no extraction script. Not auditable. Just a number on a sticky note.
Spec on the card is 24 Q heads + 4 KV.
The ACTUAL config.json says 16 and 2.
Soooooooo they couldn’t even paste their own architecture in correctly, buuuuuut I’m supposed to trust the eval protocol?
Lol gtfo can’t be serious.
And FINAL-Bench is a benchmark org grading their own model on their own leaderboard against a stack of comparison entries half of which I can’t independently verify in one sitting.
They put themselves at rank 3 TIED /w a 397B model.
TOTALLY not weird about that at all. Like that worlds smartest South Korean dude who’s “iq” was given by his business partner who’s a theologian and the test they made themselves. Right. Moving on.
denoflore_ai_guy@reddit
Here’s the thing that pisses me off. This is the local LLM equivalent of those ab-zapper belts from late-night TV. You strap it on, it twitches your muscles a little, the box says “scientifically proven,” and people who don’t know any better think they’re getting fit on the couch. They’re not. They’re getting shocked. The marketing exists because the truth (lift weights, eat less) is boring and effortful.
Same energy here. The truth is “we SLERP’d two Qwen variants, one of which was trained on Claude outputs we shouldn’t have used, and the merge regressed the model so we papered over it with 24-shot best-of-N.” That’s a paragraph. Nobody clicks on that paragraph.
So instead you get “Darwin V7 Evolutionary Breeding Engine.” Father × Mother. Hybrid Vigor. Proto-AGI tag. 88.4% GPQA Diamond. Tied with 397B. Apache 2.0. Hero badges. Genealogy charts. Compute capacity dressed up as cognitive density. It doesn’t physically hurt anyone. Bartowski’s quants still work. The Father underneath is still a real model. Worst case you download 21GB of Q4_K_M and find out it’s mid.
But it wastes everyone’s fucking time. It pollutes leaderboards ppl actually look at.
It teaches new people in the space that this is what a model release looks like, so the next ten merges show up wearing the same costume.
It makes the model maker feel important because there’s a bar chart with their name at rank 3.
And it forces anyone who actually cares about the field to spend an afternoon writing posts like this one to explain why no, your cousin shouldn’t actually pull this for “AGI at home.”
This is why we can’t have nice things.
Dumb people fall for it because the words sound right.
Lazy people fall for it because checking is work.
Honest people lose because honest doesn’t make a hero badge.
And my AuDHD justice sensitive “fuck this fake bullshit in particular” ass pathologically has to expose it because my brain doesn’t allow me to do anything other than that or I physically feel it. God I hate my brain as much as I hate karma farming assholes who make fake shit.
New_Spray_7886@reddit
“But it wastes everyone’s fucking time. It pollutes leaderboards ppl actually look at.“
I downloaded a couple quants of the 3.5 version of this. It was one of the worst varieties of qwen-3.5 I tried, completely unusable and looped endlessly. I usually put the models I’ll seldom use in longterm storage on my nas considering the bandwidth they take to download - this one I just deleted since it was a complete waste
MmmmMorphine@reddit
How dare you insult my magic rectum rock!
But yes, this seems like a lot of bullshit. My hopes came to a screeching halt at the "train it in a hour!" part.
Unfortunately. Seems like an interesting potential strategy but evolutionary algorithms are usually pretty damn slow and inefficient, if they're relevant at all. Usually they aren't, with far more "intelligent" statistical approaches available.
Now if I had a server farm and could run this sort of thing on RYS style duplicate layers, per-tensor quantization, and the new kid on the block, selectable attention-mechanisms per layer (see superapriel)... I'd still use a different mechanism
denoflore_ai_guy@reddit
merges ARE that fast. Because no training is happening lol. The card pretending that’s a breakthrough instead of a Tuesday python script is the whole tell.
And yeah evolutionary on weights is dumb. Search space too big, no real fitness signal, they just ran SLERP four times and picked the highest GPQA. That’s not evolution that’s A/B testing with extra steps.
RYS + per-tensor quant + Superapriel-style selectable attention is the actually interesting stack. If anyone was gonna do evolutionary on top of that the only version I’d respect is fitness-on-routing not fitness-on-weights. Evolve which attention fires per layer.
Weights you train, routing you can maybe search.
This ain’t that tho. This is mergekit in a poncho.
Monkey_1505@reddit
There is no opus chain of thought to distill in the first place.
FullOf_Bad_Ideas@reddit
Stochastic retry evaluation does not seem valid, especially if other models do not get the same chance (and nothing indicates they do). Sharding questions by GPUs also seem weird, I don't know what that means. Do non-Darwin-Opus models also get the same treatment or is the evaluation on them more fair? This seems like a way to boost scores but it makes comparison just not fair.
Cool-Chemical-5629@reddit
Took me a while to realize it said "Mother: hesamation..." and not "Mother: insemination", I guess I need to find my glasses. 😂
CalligrapherFar7833@reddit
Too much nsfw models makes you say that :D
Kodix@reddit
Love experiments like this. Can't wait to check it out, hopefully it performs well, but either way - thoroughly interesting. Thank you for sharing!