Distils of opus 4.6: real improvements or hype?
Posted by StupidScaredSquirrel@reddit | LocalLLaMA | View on Reddit | 30 comments
i've been seeing all over huggingface all these models finetuned with synthetic data from opus 4.6 to get them to structure output like it. Is there any merit to any of them or are they just chasing downloads?
Cool-Chemical-5629@reddit
The Gemma 4 Opus distill is terribly broken and yet no one seems to care to mention that and it had lots of likes. I suspect people started liking models without ever testing them. I admit I use that heart icon as a "try later" feature, because it shows up in my activity and I can get back to it later when I have more time, but once I test the model and find that it's not useful for me, it's losing that heart icon on my end. Don't get me wrong, I do appreciate effort people put into making these models, but if any model out there turns out to be more hype than useability, then I have no reason to tell the authors I like that model, because it wouldn't be honest and it would only lead to misleading stats, bringing useless models up to the top and that's not cool.
arman-d0e@reddit
as someone who is trying to figure out gemma 4 it's been a mess lol. Sorry for any troubles you had with those. If you have issues, definitely no need to leave a like, I know the MoE has a lot more issues with unsloth finetuning, If im not mistaken a PR should be merged soon that actually targets the proper layers, hopefully this fixes the artifacts and jankiness with the model)
a_beautiful_rhind@reddit
If you are a good tuner and you use opus to make good data the model will theoretically improve. Unfortunately, most of what's out there is only grift.
Zc5Gwu@reddit
What do the grifters gain? Are huggingface downloads somehow worth something?
bolmer@reddit
I have seen grifters get jobs, we all have
a_beautiful_rhind@reddit
Reputation and "credibility", in theory, kinda like all those people shilling vibecoded projects.
ilintar@reddit
Hype - I benchmarked one, it was a few percentage points below the original on benchmarks. Which is unsurprising since the chain of thought formats don't match and the finetune is a small LoRA finetune with very few epochs that is unlikely to make the model better in a meaningful way.
Myrkkeijanuan@reddit
Anthropic summarizes Claude's reasoning with another model tasked with poisoning and redacting the output that you see.
Those who can cryptanalyze then reverse-engineer the signatures don't even think about leaking even the hint that they could even try to do that.
Thrumpwart@reddit
Largely no. I tried an early revision - Qwen 3 14B high reasoning I believe it was called. Did not impress me.
Tried one of the newer distills and I was not impressed.
Stick with your wife.
Blindax@reddit
I tried the Qwen 3.5 27b version. From my tests, the model was slightly better at searching on the web and summarising information, but on reasoning tasks, the distilled version was much much worse.
ForsookComparison@reddit
It is 100% hype, they are significantly worse han their base models.
FatheredPuma81@reddit
I am actually shocked you aren't be downvoted to hell and back lol. I said once said you shouldn't trust a Finetune that makes bold claims and doesn't show benchmarks and got downvoted to hell.
ForsookComparison@reddit
Nobody joining these threads has bothered to try these models so they have no leg to stand on anyways
MrBIMC@reddit
So I’ve been running qwopus-27b in iq4-nl quant throughout this morning. It is connected to roo and is given a task to setup a Hermes agent in a container and connect it via gateway to openwebui with my inference setup folder mounted as home directory of that said Hermes container.
From what I see is that it is quite decent, but I wouldn’t call it better than baseline qwen2.7.
From what I found is that while chain of thoughts is smaller, it is still generating/doing task slower because qwopus tends to swap to ask mode after the architect and use this ask mode to simply reiterate the conclusion that architect did.
So a full flow task has a lot of redundant steps, even though it thinks less on each. I wouldn’t call it a win.
GrungeWerX@reddit
I've tested a couple. They're all hype. I got worst results.
"Never argue with an idiot. They will drag you down to their level, then beat you with experience."
valkarias@reddit
I've commented this prior. I've seen no benchmarks or comparisons on these distills.
This Bytedance paper (please read it, its fire)
https://arxiv.org/html/2601.06002v1
Stated that summarized CoT WILL degrade the performance of base models.
Its safe to assume that most CoT distill datasets on HF are summarized. This is true for Gemini, Claude and probably any other closed-source model.
COAGULOPATH@reddit
Plus the real reasoning is usually ugly/messy/rambling. Not something you want in front of a customer.
FinBenton@reddit
I tried a couple for writing, I think they wrote slightly differently but made a bit more mistakes, I would say just hype, I deleted them and went back to unsloth/bartowski.
Altruistic_Heat_9531@reddit
Yes, but more so like putting truffles on a food, the technique and other matters, Omnicoder by Tesslate is geniunely improve my code search / summarization, no weird looping on tool calls, no stuck on task https://huggingface.co/Tesslate/OmniCoder-9B
This is my actuall everyday model, run 24 hour.
supermazdoor@reddit
One word, real! Qwen 27B killed, Q6: runs 15 : 20 minutes loops non stop accomplishing tasks even minimal m2.5 struggles with
FatheredPuma81@reddit
A lot of them are fakes but some are actually real. Look at what dataset they're being trained off of because a LOT of them are trained off of a Opus 4.5 dataset that has like 200 Conversations which is just not enough imo. There are many that are trained off of hundreds of thousands of Agentic Tool Calls those are real and should help improve quality.
quantum_splicer@reddit
You'd want to follow apples approach " simple self distillation " .
If you took apples approach but used multiple high quality datasets you'd probably be able to improve these models coding ability and tool calling. You'd be essentially increasing precision "suppressing distractor tails where precision matters " while preserving useful diversity where exploration matters.
Good analogy is cognitive rigidity / cognitive flexibility - these need to be balanced and to be able to do well at tasks this flexibility needs to exist within a certain range. On one end you have equivalent of ADHD erroneous exploration, on the other end obsessive compulsive disorder.
quantum_splicer@reddit
I am certain when I reviewed the sourcecode for Claude code it had things for anti distilling ( https://news.ycombinator.com/item?id=47585239 ).
So I'm wondering whether the training data would be poisoned
dash_bro@reddit
I think they're good improvements.
Between going straight to /nothink versions and overthinking, the 4.6 opus finetunes on qwen3.5 are probably the best middle ground.
My 27B qwen3.5 distilled with opus beats everything in its weight class and a tier above up to 80B-A3B qwen coder next. Unless the qwen3.6 comes with fixed overthinking, the distilled finetunes definitely perform better.
traveddit@reddit
I tried at least five of these things up to v3 and the GLM Flash version that used 4.5 and they're all trash. For conversation maybe the model might sound "more like Claude" but to be fair if you throw Claude's system prompt at any of the base models you would get something just as close if not better.
Ok_Try_877@reddit
I used the orginal and the best-selling opus-tuned-one for QWen 3.5 27B, and for my stuff it was nowhere near as good... My guess is it prob makes it better at benchmarks or maybe things ppl do that are common... but for stuff that was likely outside of its training, the logic was better on the original for me.
realmosai@reddit
Not too good in my experience.
I tried qwopus and the opus distill v2 and both have a looping problem in agentic use. Unsloths quants work great, doesn't have this problem.
Crampappydime@reddit
They can be better, I have my ornstein and harmonic ones you might come across. My issue with some/most is that the data quality itself is a spotty. Some of the opus data isnt actually opus data, and yeah it can be a lot of quantity over quality in terms of data.
The qwopus stuff, looks to he distills mostly and they say they do some data cleaning although its not clear to me what the criteria is. They changed their approach on the gemopus model to a less aggressive style. Altgpigh the reasoning is strong on gemma so hos much those improved im not sure.
velcroenjoyer@reddit
For models like Qwen3.5 the "Opus Distil" finetunes help slim the chain-of-thought down (so less time spent reasoning), it's great when you're only able to run a 9B at 8tk/s, other than that it's probably better to use the original models unless you specifically enjoy the Claude writing style.
I did see that Qwopus claims better benchmarks, so it's possible it could be better - but I haven't really tested it that much
qwen_next_gguf_when@reddit
They all look the same.