Do the "*Claude-4.6-Opus-Reasoning-Distilled" really bring something new to the original models?
Posted by Historical-Crazy1831@reddit | LocalLLaMA | View on Reddit | 33 comments
No offense to the fine-tune model providers, just curious. IMO the original models were already trained on massive amount of high quality data, so why bother with this fine-tune? Just to make the model's language style sounds like Claude? Or it really reshape the chain of thought ?
Bootes-sphere@reddit
Good question! Fine-tuned distilled models typically add value in three ways: they're smaller/faster (useful for local deployment), they're cheaper to run at scale, and they can specialize in specific reasoning patterns or domains. The original models are amazing, but they're often over-parameterized for specific tasks. Distillation captures the "useful knowledge" in a leaner package. Whether it's worth it depends on your use case: if you're running inference locally or at high volume, the performance/cost gains are real. If you're just doing one-off API calls with unlimited budget, the original models probably stay ahead.
Historical-Crazy1831@reddit (OP)
Thanks! I guess your answer is for general fine-tuned distilled model, whereas I am curious about the specific "Opus-4.6-reasoning-distilled" variants, which seems to be general-purpose rather than task-specific.
I understand that people think Opus 4.6 has better chain of thought and they want to 'inject' that advanced performance into qwen models. But IMO this level of general-purpose dataset is ordinary for Qwen lab ( if not, cannot believe they can train such good models ). But I could be wrong. So I am curious about if this level of fine-tune (thousands of records) really reshape the way model organize its reasoning. If the answer is yes, then that would be huge. That means a personal AI researcher could customize a model's reasoning behavior with an affordable dataset.
I was thinking to fine-tune the model's CoT so that when it writes articles, it reasons sentence by sentence and consider the logical structure instead of generating typical AI language. But I end up just write that into my prompt instead of fine-tuning.
CalligrapherFar7833@reddit
No because their distillation data points are too low for any meaningfull impact on the models in positive way
Glittering-Call8746@reddit
This. Not anything else but how many data points is min for distillation
kyr0x0@reddit
Qwopus has shorter reasoning time and more hallucination in most tasks.
Witty_Mycologist_995@reddit
Opus distills on huggingface are 90% slop.
bonobomaster@reddit
Meh, I'll go against the grain here.
I'm using Jackrong's Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF in Q8_0 for classification, date extraction and renaming of scanned conventional paper mail (invoices, receipts, tax stuff, insurance letters etc.) for paperless archival and in my personal experience, the distilled variant performs much better in getting the gist about the document's contents and gives better naming suggestions than the normal Q8 variant of the same model.
Text is extracted with PyMuPDF beforehand.
The 2B and 4B versions, no matter if Opus distilled or not, were useless.
sagiroth@reddit
I personally doubt. No offence to the people who fine tune it but it cant me this dramatically better over what the OG creators already make. Wouldn't make much sense to me
Hydroskeletal@reddit
In my own benchmarks I saw improvements in some cases and catastrophic regressions in others. Caveat emptor.
cmndr_spanky@reddit
It’s a bunch of fking noise. Ignore them
leonbollerup@reddit
Coding test with qwopus is better in all my tests than the original
AlwaysLateToThaParty@reddit
meh
redmctrashface@reddit
Not at all
pigeon57434@reddit
v3.5 specifically is not that bad but even it is really not gonna make the model any smarter if thats what you were hoping in the absolute best case it might be equal performance slightly more efficiently but in all likely hood it will be worse
sine120@reddit
I've looked over some of the datasets and they're often obviously full of junk. If they were more curated they might be more interesting, but until they run full benches to see how it compares to the original, I'm not interested
Monkey_1505@reddit
They do, specifically by making them worse.
You see, there are no claude public reasoning traces. They are training these models on reasoning summaries. Which is not remotely the same thing, and not at all helpful to model cognition.
Dany0@reddit
Yes they do make them worse overall, but no the datasets are from actual reasoning traces from before Anthropic started summarising them
Hence why no Opus 4.7 reasoning dataset exists
Monkey_1505@reddit
The datasets I looked at from these were summaries.
Dany0@reddit
For example per https://github.com/anthropics/claude-code/issues/42796
The summarised cot feature was rolled out progressively
Usually those datasets were collected via api calls (bedrock for example) not through claude code etc
API allowed you to select abbreviated vs summarised cot
If anyone produced summarised cot datasets they're foolish and no one should take them seriously. Is it perhaps possible the datasets you saw rather used low/medium effort instead of summarised cot? That would make sense...
Monkey_1505@reddit
The only unsummarized, or raw ish looking claude .6 dataset I saw had a mere 3000 pairs.
https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered
So if the gate was open in some way, it wasn't much, a pretty non-useful volume.
The ones I saw originally were definitely summarized. Like "I should look into this" with no other detail, and then "I should think about these elements" but not completed etc. But yeah apparently there is something that looks like raw reasoning traces out there apparently, just not enough of it AFAIK, to be useful.
Dany0@reddit
Yep, and mind you these are still datasets are kind of okay relatively
When I started looking into fine-tuning I saw people heap praise at rStarCoder. Go take a look at it, it's infuriating
Rn the best dataset which is both large and has provably been used to train at least somewhat useful model is the nemotron dataset, but that one is mostly derivative and iirc a lot of it is generated by Gemini 3.1 and Qwen coder 400b
It seems that good datasets either don't exist or are gatekept
A great majority of the coding datasets I looked at look like exactly what I would produce if I was an adversary TRYING to get everyone to struggle to train their models to be useful at coding
So much python slop. Legit if you took unannotated fortran/python code with single letter variable names written by math people forced to learn to code it would be a better dataset than these abominations
And the CoT datasets are all arse too... I don't know, sometimes I see a hermes trace that doesn't seem all that awful, or like a good place to start, but ALL of them have that amazon mechanical turk/openai vibe of "somewhat literate indian/nigerian paid 0.05$ per prompt to write something that looks like reasoning"
It has NOTHING to do with what actually useful chain of thought looks like and everything to do with ticking formal boxes that make it LOOK like CoT happened. Information density of a rock. Poor clankers
Tormeister@reddit
In my experience, Qwen 3.5 27B frequently had looping, unnecessarily long thinking chains and harness flow interruptions; These variants eliminated those issues (surely at a small "intelligence" cost).
Now that Qwen 3.6 27B does not have the same issues I haven't felt the need to use such variants. For this specific model I'd say the use case is offering an middle ground bewteen a really long reasoning and having reasoning disabled.
srigi@reddit
Exactly my experiece with Qwen3.5-27B. The JackRong's fine-tune helped a lot with the tool calls in OpenClaw. Now the vanilla model (3.6) from Unsloth is good at the task, so no fine-tune variant is needed.
OpenEvidence9680@reddit
In my own private benchmarks which I am running right now on all my models to cut off the dead weight, for the specific tasks I am testing (which are very specific to the case uses I will need them for) the opus ones were performing a tad better than the regular, but they were testing runs.
I am right now starting the "real" testing with the smallest models, but if the earlier tests were correct I'd say they might be a bit better or equal to the original model.
i_like_brutalism@reddit
a lot of the chinese models already distilled (parts of) claude better than we ever could imo. but as always with llms, this is just my personal experience using "finetuned" models
Pleasant-Shallot-707@reddit
They’re using so few distillation queries that it’s not super useful
sunychoudhary@reddit
They can feel smarter in narrow cases, but I’d be careful calling it real Opus level reasoning. Distillation usually transfers behavior patterns better than deep reliability. So you may get similar looking reasoning on common tasks, but weaker consistency on edge cases.
lemon07r@reddit
Yeah it brings to the table mindless sheep hearting a model on HF cause it has "Opus" in it's name despite managing to be significantly worse than the parent model. Hopefully a lesson to the community to be a little more skeptical and critical.
iMil@reddit
Loops. It brings loops.
aeroumbria@reddit
Maybe it will help a little bit for projects heavily infested with Claudism in its agent files, but otherwise I don't see how this can help anything. If it were helpful, they would have done so already in training. If they didn't do it in training, they must have a very good reason.
ps5cfw@reddit
I did try using them in IRL .NET + JavaScript scenarios, the ugly Truth Is that they think far less than their regular counterpart, and sometimes even seem to go in the right direction with their thinking, but at the end they just can't reach the right conclusion / find out the potential culprit (in case of bugfixes at least)
AdventurousSwim1312@reddit
It make them more efficient, but also dumber, the chain of thought length is a requirement to preserve model intelligence at these model size.
Maybe check the Omnicoder models from tesslate, they are much more experienced with model distillation (their UIGEN series where incredibly useful) so will most likely yield better results
z_3454_pfk@reddit
no