These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade
Posted by BuffMcBigHuge@reddit | LocalLLaMA | View on Reddit | 122 comments
Time and time again I find posts about these fine tunes that promise increased intelligence and reasoning with base models, and I continuously try them, realize they're botched, and delete them shortly after. I sometimes do resort to a lower quant since they are bigger, in this case, a 40b variant of Qwen 3.5 27b, but they seem to always let me down. I've resorted to not downloading any model with "Claude Opus 4.6" in the name.
Kudos to everyone who tries to make the foundation models more intelligent, but imo, it never works.
Note that this example is anecdotal evidence on a single prompt, but it's overall always the case of decreased intelligence when using with a local agent setup + llama.cpp in WSL2. This is irrespective of the quant as well - I've tried many.
One thing to notice however, the reasoning/thinking is significantly less, perhaps that's part of the problem.
Have any you found these better than base, ever?
The attached screenshots are:
./llama-server -hf mradermacher/Qwen3.5-27B-heretic-GGUF:Q4_K_S --temp 1.0 --top-p 0.8 --top-k 20 --min-p 0.00 --fit on --alias default --jinja --flash-attn on --ctx-size 262144 --ctx-checkpoints 256 --cache-ram -1 --cache-type-k q4_0 --cache-type-v q4_0 --threads 8 --threads-batch 16 --no-mmap --sleep-idle-seconds 600
~/llama.cpp/build/bin && ./llama-server -hf mradermacher/Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-i1-GGUF:i1-Q3_K_S --temp 1.0 --top-p 0.8 --top-k 20 --min-p 0.00 --fit on --alias default --jinja --flash-attn on --ctx-size 131072 --ctx-checkpoints 256 --cache-ram -1 --cache-type-k q4_0 --cache-type-v q4_0 --threads 8 --threads-batch 16 --no-mmap --sleep-idle-seconds 600
arbv@reddit
"We have Opus at home" moment
Alexi_Popov@reddit
TL;DR
I mean fine-tuning on 3k-100k samples, what can you expect other than repeating the format of the dataset.
Some industrial standard knowledge over the past 3 years this could make you understand why they are not "intelligent" like opus.
Generally for inducing heavy intelligence density in <70B models you do a massive 10T+ tokens based SFT on a mix of data Maths, reasoning, coding, agentic abilities or earlier you used to do is KD (Knowledge distillation [Now before you say, hey It is knowledge distillation from opus output to whatever smaller model so why doesn't it perform, the short answer is nope, totally wrong. It falls under SFT, SFT by itself has to be in trillions of toks to make the model understand intelligence since you are know making the model adapt to a certain standard instead of all the possible random less formatted base data used in pre-training the model], So KD is a means of using a superior teacher model output "logits", [This is what creates the confusion logits != output sequence but the set of possible output for a particular sequence token in the output. So for the model that said "Hey" in a user message what other possible words did it produce after "Hey" for it, say for example "Hello", "Hi" etc], the logits based KD is not available anymore in many inferences for the "cannot train on our model outputs" law or are reduced to 5 toks per inference call totally diminishing the results. The process of fetching building large samples of input/output responses from better model is still called distillation but it's largely not based on logits fetching but tons of samples created on frontier model outputs in various domains to incorporate.
Today you do is SFT over long contexts with objectivised multi step reasoning, function calling and so forth so the model is adept at it's trained abilities and for that you really need at least a row of 1Mil (If compute available then at least go for 10Mil rows of long context >8k toks trust me without this you will not see any results in "intelligence" but rather you will see the results in "response coherence" in respect to the training data). Trust me this is all the frontiers doing it, and what makes them different is the scale and the variety of domains, you can assume 30-40T toks of data that the model sees at least in all it's training phases (They even do this at mid-training and pre-training).
I mean it works but not generally if you post-train on an existing instruction tuned model (As in a finished production model) because of it's optimized policy (in plain words the output format it chooses) might not be aligned to your new SFT dataset (A large one as I said if it is small one like <100k samples then you are doing nothing but training it to respond in a certain format which makes it rigid if the params are set too high [Which generally are for small dataset]), so generally this is where the difference comes because if you train on an existing post-trained model it is going to go back and forth with it's original post training and your post-training contents, so the model becomes unstable and at times "dumber".
So to crunch it up:
Nah 3k or even 100k samples at \~4k toks with nothing but reasoning and output content from a larger better model is not going to yield the intelligence you are assuming more or less a simple formatted response in coherence to the sft samples you gave.
toothpastespiders@reddit
I'd make an exception for models that had a lot of reasoning data in their training but never got that final push to finalize it in the official instruct release. Which is obviously a very rare niche case, but not totally unheard of. Though probably never will again in any significant way now that it's exploded in popularity.
gurilagarden@reddit
I don't even know why I keep burning my internet cap and drive space for jack's finetunes, but..meh...one can dream. I downloaded 2 last night just to see what happens, but, I already know. Every time I test them head-to-head with qwen base models, i'm left...dissatisfied, but at this point, at least my expectations are so low i'm not disappointed.
toothpastespiders@reddit
Honestly I just find it fun to see what a shove and some moderate brain damage can produce. The recent trend OP is talking about isn't very interesting to me so I've yet to even bother trying them. But just in geneal I think it's fun to check out how someone's weird theory works out in practice. Amazing finds are rare but they can happen. Someone's little talked about tune of the first mistral small base model wound up as my swiss army knife choice for a really long time.
bobeeeeeeeee8964@reddit
I agree with that. Almost all the 'Claude' distilled models perform like someone who only has surface-level knowledge. They're heavily overfitted or just overhyped. You can see tons of ads for them on X, but once you actually use them, you run into a bunch of problems.
Aaaaaaaaaeeeee@reddit
I haven't looked into them, but when they say Distilled they make no distinction whether it's a Finetune or a possible type of distillation which have formal definitions. The term came with the first small thinking fine-tuned deepseek models. Don't think distillation is all bad just because of the models named.
ResidentPositive4122@reddit
None of these are distillations in the ML sense. Because none of these have access to the teacher model. They use SFT with text generations from the "teacher", a.k.a. poor man's distillation. Applied correctly it can improve certain aspects of a model, but they're not true distillations.
nickl@reddit
I mean OpenAI, Anthropic and Google call is distillation.....
https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks
> "We also know that DeepSeek employees developed code to access U.S. AI models and obtain outputs for distillation in programmatic ways," the memo added.
https://www.reuters.com/world/china/openai-accuses-deepseek-distilling-us-models-gain-advantage-bloomberg-news-2026-02-12/
de4dee@reddit
i guess there are two types of distillation nowadays. distillation using logits, or bare outpus.
first one only LLM holder can do. second one everybody that can talk to the LLM can do.
Aaaaaaaaaeeeee@reddit
Yes, there are newer "black-box" methods with no logits requirements, but this is what has potentially led to Anthropic's call for preventing this..
Ok_Helicopter_2294@reddit
According to their description, the model was fine-tuned using LoRA on a distilled Opus dataset.
Aaaaaaaaaeeeee@reddit
So the same process as usual.
Arcee-ai models and a couple of 3rd party research DS-R distills went through a different process, which had much better results than Deep seeks distilled which were finetuned despite naming.
Ok_Helicopter_2294@reddit
I agree with that view. Other companies may have gone through additional reinforcement learning, used better-curated datasets, cleaned their base datasets more thoroughly, and developed their own teacher–student training pipelines.
stoppableDissolution@reddit
None of them are distills to begin with
MoffKalast@reddit
Besides, there are Claude distill already, they're called Deepseek, Kimi and MiniMax lmao.
stoppableDissolution@reddit
Yeah, no, they are not either. Noone but anthropic can distill claudr because they do not provide raw logits
florinandrei@reddit
Genuine Rolex replicas.
Far-Low-4705@reddit
If you read the descriptions in one of the more popular ones, they measure “output efficiency” in character count instead of token count
Hydroskeletal@reddit
I have my own benchmark suite and I'd find that compared to the base model it might be slightly better on some tests but that's overshadowed by big drops on others.
I think this is one of those things where if you don't compare the models on the same task you're just missing the subtle failure modes and it "feels" better but isn't.
henk717@reddit
To me its always been logical, you aren't getting what makes claude good which is going to be that massive knowledge base inside the model. Its going to be by its nature a style transfer based on whatever many examples they managed to distill. Which I bet cover topics that are typically benchmarked.
Roleplay tuners doing it because they wish to copy opus's writing style makes sense, since that is all about flair. But actual intelligence you aren't going to copy with for example the Opus-4.6-Reasoning-3300x dataset, I can't imagine how with only 3000 examples even though that is many you'd train something more intelligent than the massive proprietary datasets from the big commercial teams.
But its very good at tricking people into thinking it is better of course from the name alone, they see that this model was made to be like this massive model they hear good things about. It may fix a think loop bug, and then its the best thing ever popularity wise. Until the newness comes off and people notice that in actual usage its not better like you show with your post.
WolfeheartGames@reddit
Its worse than that. Small models are extremely sensitive to data outside their learned distribution. If you take Gemma 3 1b/2b and the larger parent model its distilled from. Take the opus reasoning trace training data. Make a second dataset from that reasoning data that is the same data but rewritten by the parent Gemma to be in it's distribution, and then train 2 copies of the smaller Gemma on both data sets, the one that's more in distribution does much better.
The problem isn't the semantic information in the reasoning traces. Its the word distribution articulating it.
Odd_Science@reddit
Those 3000 examples could be useful for an initial reasoning finetune so the model learns the "form" of a reasoning trace, just sufficient so you can then spend millions of GPU-hours on reinforcement learning to actually make the model capable.
LA_rent_Aficionado@reddit
Exactly, not to mention any under the hood at the other end of the API, it may not just be a simple model API, it's all possible there are more services behind the veil be it RAG, vetting steps, etc..
de4dee@reddit
i think most of the time the small amount of fine tuning material forced into training with higher learning rate or higher rank or higher alpha to make an impact, ending up ruining general intelligence of the model.
what should have been done: more samples, less learning rate, and less rank and less alpha to preserve the smoothness of the original model. you cannot force your tokens to it. but you can use lots of tokens to make a proper/smooth impact.
the fine tuner maybe did much shorter reasoning tokens, hence the model learned that shorter reasoning as a habit.
Kagemand@reddit
Does Qwen3.5-27B pass the car wash question each time you ask it? Sometimes I think there might just be randomness to it.
Did you try Qwopus3.5-27B-v3?
reery7@reddit
I tried the MLX Qwopus3.5 v3 and the response was very decent. Even the thought process showed no flaw in logic.
BuffMcBigHuge@reddit (OP)
I turned off reasoning and it failed. So something is to be said about reasoning.
Global_Persimmon_469@reddit
Isn't Qwopus trained based on Claude Opus thinking patterns? If you disable reasoning of course you are going to lose all that
BuffMcBigHuge@reddit (OP)
Ran it back, full reasoning, and even Q6_K checkpoint, and it failed the test with Qwopus3.5.
spky-dev@reddit
By the very definition of how a model works, there will be randomness in it.
_qeternity_@reddit
This is not true at all. The model is deterministic.
Sampling may not be, and to a lesser extent, kernel launches may not be.
But it has nothing to do with the definition of how a model works.
spky-dev@reddit
No, a model is not deterministic, it’s probabilistic.
Dependent_Ad948@reddit
Sure, the vectors are by definition probabilistic, but if you and I download the same model and prompt it with the same prompt and the same non-random seed, what happens?
erizon@reddit
every probabilistic algorithm is deterministic when you fix the seed (and hardware/software). different seed-> different outcome is how randomness is universally understood
Spectrum1523@reddit
Can you explain what you mean here? If you use the same seed and same settings you'll get exactly the same output every time.
po_stulate@reddit
If it is really that easy (doable by a random independent contributer) to make qwen3.5-27B (or any other model) a smarter model then qwen (or whoever publishes the model) would've already done it.
It's more realistic to finetune the model to make it suit your specific usecases, not to increase its general smartness.
Global_Persimmon_469@reddit
This model is more intended to improve the performance on logical/analytical/coding tasks, at the cost of general knowledge and performance
Potential-Leg-639@reddit
I tried it recently and it was really good
Fabulous_Fact_606@reddit
can't you just add something like this to the main context "Before answering any question involving a decision, choice, or recommendation: pause and identify what object or entity must physically be where, what constraints exist beyond what was explicitly stated, and whether your initial answer satisfies all of them. If it doesn't, correct before responding."
kuhunaxeyive@reddit
One could add a more explicit prompt, but then it's not testing the intelligence of the model. The intelligence of the model shows when it can guess better what you intended to do, or when it can understand the problem without extra details. That is super important as we might not fully understand the case of our question, and the smarter the model, the more it can understand the logic by itself and help us getting the right answer.
Fabulous_Fact_606@reddit
So the question is. The experiment needs to use naked LLM or API. We can't really compare to the foundational models because we don't know what the 20K main context injection is given to the model before we ask the question.
ViRROOO@reddit
Because it’s just introducing noise into the weights. Complete waste of time and compute
KriosXVII@reddit
Model collapse.
quantum_splicer@reddit
Agreed !!! Once you read apples paper " simple self distillation " it becomes obvious that just feeding the model data is no good because you need to
" suppress distractor tails where precision matters while preserving useful diversity where exploration matters. "
I'm probably missing some points, but intuitively this makes sense, it's a shame a research paper had to find the answer and it took so long.
UnknownLesson@reddit
Any other important papers?
RevolutionaryLime758@reddit
I have a feeling nothing became obvious to you in particular by reading it. Why don’t you tell us exactly why that applies to this particular distillation topic lol fuggin idiot
9897969594938281@reddit
Damn, a genuine tough guy
Spectrum1523@reddit
hmm my money is on a *claw told to be aggressive
ViRROOO@reddit
Strong sentiments buddy. A quick look at your history shows a pattern, maybe do some mindfulness or look for professional help instead of lashing on people that enjoy running their llms on the weekends on their 3090s :)
Recoil42@reddit
Looks like this is it, for anyone needing a link:
https://arxiv.org/abs/2604.01193
BuffMcBigHuge@reddit (OP)
Yeah - I'm supposing the fine tunes are kinda like merging LoRAs for image/video models. They promote certain types of content while hindering the rest of the model.
Long_War8748@reddit
LoRAs for Image and Video models are actually useful though.
stoppableDissolution@reddit
Thats kinda the purpose of it, no?
MoffKalast@reddit
If your plan is to reduce guardrails yes, otherwise no.
Noob_Krusher3000@reddit
LoRAs are easy to switch out for very specific styles. A language model that smells like Claude, but struggles to do anything else useful because it's been lobotomized isn't very productive.
stoppableDissolution@reddit
I'm not saying this particular one is productive, but the purpose of finetuning (lora or not) is to make a model better in a narrow scope with little or no regard to the rest of its capabilities
No-Refrigerator-1672@reddit
Correct. Yet hose "opus" models claim to make it better at all tasks - and that's what a simple finetune on a synthetic dataset can't do; you need to be as smart as the original creator team to make the model better all across the board.
Velocita84@reddit
I wish all this compute was spent on training these models for alternative use cases instead of hopelessly trying to beat qwen at making their own models better with logic/code/tools
Long_comment_san@reddit
Yeah, like general chat for example. I believe at this point coding should be passed to adapters.
stoppableDissolution@reddit
~~Just donate it to me, please?~~
ASTRdeca@reddit
That seems a bit handwavy? You could write off any distilled model with that
ViRROOO@reddit
Why would I write a full explanation on a reddit comment when more competent people with more than 5 million total comp wrote better pappers available for free at arxiv
openSourcerer9000@reddit
Qwopus 27 v2 was actually a banger (v3 seemed the same as qwen reasoning). Best model for a 24gb card, running it right now actually
CucumberAccording813@reddit
A big reason for this is that many Opus-distilled fine-tunes cause the model to think less than it would by default. Since Claude Opus itself doesn't over-reason, the distilled models inherit that same pattern and performance takes a hit because of it. That said, I'd still take a fast Claude-distilled Qwen model that thinks concisely over a "smarter" undistilled model that burns 10k tokens second-guessing itself on a single question.
nickl@reddit
The Jackrong models work pretty well in my testing (8% better than the base models in on my agentic benchmark)
I don't see any way that a 40B model based on a 27B model is going to be better unless the trainer's copute budget is \~Qwen. That's just 13B undertrained parameters.
grumd@reddit
Confirmed by my testing on your benchmark, Qwopus v3 9B is much much better than Qwen 3.5 9B, it's not even close
nickl@reddit
glad you found it useful!
grumd@reddit
Yes, thanks, and please make temperature configurable at least with CLI!
qubridInc@reddit
Most “Opus-style” fine-tunes trade real reasoning for style mimicry so yeah, base models usually outperform them in actual agent workflows.
frozen_tuna@reddit
Opus already has reasoning. I remember reading even back in 2023 that fine-tuning was a bad way to add knowledge but a good way to enforce a style of response. Back when we used early versions of PEFT. Is that still a thing?
Traditional-Gap-3313@reddit
R1 paper showed you can SFT reasoning into a non-reasoning model. Remember all the DeepSeek R1 8B which was llama finetune (SFT). However they did it with \~200k reasoning traces from R1, not 3000
frozen_tuna@reddit
yes and? isn't adding{content} to the beginning of a response a style of response? or was there something more to it?
charles25565@reddit
Yes, and training small models on Gemini distillation datasets causes the AI to just hallucinate catchy reasoning summary headers like "Formulating the response" or "Analyzing query".
--Spaci--@reddit
Most people really just cant do it right and say like 1000 samples is essentially just noise, I think the idea is fine and the easier finetuning gets the more you will see this btw.
jacek2023@reddit
More finetunes is not a bad thing, this is how community lives
But more reviews like that is the most important thing, because nobody has time to try all finetunes, so we need some kind of knowledge what can be really useful
TheThoccnessMonster@reddit
This is something more of a rough one though.
Disastrous_Hope_9373@reddit
Not just 4.6-opus distill models, but also REAP and abliterated models.
They're all so bad, and they show benchmarks that mislead you about their performance (well at least abliterated gives you a cool side benefit).
I hate them all, and for the opus distill models, I think people should just learn to prompt better with the models from the original lab. Learn the prompt style for whatever new model you downloaded. Stop trying to make everything claude, you can't do that without lobotomizing the model (unless you're smart enough to work in a chinese lab).
Downvote me, just do it, just bite :)
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
PhantomGaming27249@reddit
Its because you can't just feed the model claude outputs and expect it to get better. You need to filter and curate the data set. Then tune it with specific parameters and make sure the finetuning or lora adapter your making is actually working. Then you need to genuinely test and validate. Fine tuning does work and can improve a model a decent bit over its base, doing it carelessly though will hurt performance rather than help it.
Jeidoz@reddit
Why you decided to compare Q4 27b vs Q3_i 40b? There are literally 10+ 27b Claude distilled models, but from some reason you decided to compare in a bit not "fair" using Q4 vs a bit lobotomized Q3_i?
JohnsGimpyHand@reddit
Uh but that answer is entirely right and many commercial models fail at it
PromptInjection_@reddit
Claude's intelligence probably doesn't stem, most likely, from its reasoning patterns being so great.
If you now slap those onto another model, it's like saying, "Let me phrase sentences a bit like Einstein." But that doesn't turn you into Einstein - just a poor linguistic clone who still can't come up with a theory of relativity.
Potential-Gold5298@reddit
This may improve the style of creative writing, but almost always leads to a deterioration in intelligence.
llitz@reddit
Considering my recent Claude interactions... I felt just as frustrated with them later... So... Good copy, but a downgrade nonetheless.
34574rd@reddit
people who think 3000 low quality, general pairs is enough to steer a model are so dumb, what makes you think alibaba and google would have not already done the same if the results would have been substantial?
PhilippeEiffel@reddit
Even if it is 3000 high quality. It's SO small compared to the original training material. It requires to be Q1 quantized human to consider it will improve some big LLM.
stoppableDissolution@reddit
(spoiler: they are doing it, just better)
rpkarma@reddit
(And at way larger scales)
charles25565@reddit
2,000 high quality rows on a high quality ~120M model can easily outperform models that are trained on dozens of thousands of rows.
lemon07r@reddit
I've been trying to tell people this. People really need to be more critical of this stuff, cause I still see these models as the most hearted finetunes simple cause they have "opus" in their names.
lumos675@reddit
The answer is not correct though? You must be with the car so you can wash it and all of the answer was pointing to this fact. No?
srigi@reddit
Your comparison is not scientifically accurate because u compared heretic with heretic+”so called distill”. Now you have two variables in the system.
Correct way would be comparison on vanilla gguf from Usloth or Bartowski with distill gguf. No heretic, no abliteration, no unrestriction. I believe these uncensoring are doing more damage than opus fine-tunes.
Long_comment_san@reddit
To my understanding these finetunes "match the output" and that's it. They're not trained per depth. Basically what is fed to them are some cases and model tries to mimic those in responses and reasoning. It's not really radically changing the though...bruh. In simple turns, it's just changing the coat to a jacket I guess.
shing3232@reddit
you need rl after finetune to ensure generalization
weiyong1024@reddit
For agent work specifically the overfit is way worse than it is in plain chat, i noticed this running OpenClaw with a couple of the Qwen-Claude-ish fine tunes versus vanilla Qwen3.5-27B, the tool call accuracy on the tuned versions dropped noticeably on anything that required chaining two tools, my guess is the distillation is optimizing for chat cadence and quietly erodes the structured JSON tool-use patterns the base model already had.
Su1tz@reddit
"Here's a Gemma fine tune of 1000 opus traces because Google is incompetent and couldn't think about this idea by themselves"
Ok_Helicopter_2294@reddit
Rather than improving intelligence and reasoning ability, it can be said to be closer to imitating the thought process. It wasn't completely useless to me because it had the advantage of saving the number of contexts.
However, in order to clearly increase intelligence and reasoning ability, a reinforcement learning model with a good data set rather than SFT would be better.
Yu2sama@reddit
I don't think we should expect every fine-tune to be good. The issue with Fine-tunes is that, until you try them, you can't know if they are broken af, quite different from a Lora on an image generation model. You can directly see the results easily, but in text it requires some discerning and extra work to test.
Two extra points relevant to this subject:
- Qwen 3.5 seems to be very sensitive, so I will expect more fine-tunes to suck than to work unless the fine-tuner works, tests and tries to fix until they find the correct version/sauce.
- That's a DavidAU fine-tune, bro has cooked some interesting stuff but his models are, most of the time, broken af.
Qwen seems to be very sensitive anyway, the only fine-tune I have tried that works equal with the base Qwen is Qwen3.5-9B-Aggressive at least from my own tests.
Borkato@reddit
I wonder if we should have an RP benchmark with example outputs given different prompts. Would be cool to see qwen’s writing style vs Gemma’s vs etc
Yu2sama@reddit
Is funny you say that I have been slowly working on my own RP benchmark with A/B tests and a few shot tests for different fields (spatial understanding, intelligence, character adherence, etc.). Nothing crazy, mostly for personal use and test my models to see which ones to delete and which ones to stay with lol.
On the topic of Qwen vs Gemma...
I only use 9b, E4B and a bit of the 26b moe (the prompt processing kills my soul on the last one, hence why I don't use it as much atm). From my tests, the Gemma family has better prose, better understanding and they use the character card very well. Will surprise you often with certain details of the character. Though, from my experience they SUCK at style adherence, and a bit at character adherence, they tend to be more realistic and homogeneous, which can be a good or a bad thing depending on who you ask.
Meanwhile Qwen has amazing instruction following, prose is not as good as Gemma but its style adherence is superior by a lot. Characters adherence tends to be very good though, maybe a bit better than Gemma.
Borkato@reddit
I find a similar thing true in my tests! I did the same lol, I used python to just get a bunch of writing tests, then I read a few responses from each model without knowing which is which and dump any that sink to the bottom of the list. It turns out ALL Gemma 3 finetunes almost always get a negative score from me, it’s hilarious, I became able to predict it. I’d be like “oh this SUCKS… wait is this Gemma” and it always was LOL
BUT Gemma 4 seems MUCH MUCH MUCH better so I’m happy about that.
Now we just need a model with perfect instruction AND style adherence! 😂
Yu2sama@reddit
Gemma 3 was so bad... no fine-tune could help it. Never understood the people that liked it haha. I don't expect such a perfect model to exists but if one appears I would be pleasantly surprised!
Borkato@reddit
People LOVED Gemma 3 big tiger though! I hated it so much 😭 I hope a new one comes out with 4 haha, bigger tiger!
robberviet@reddit
I never use those distill models. I doubt any non big lab like can do anything useful without huge computing power, high quality data and most of all talents.
And we already got the distills of Opus! It's called DeepSeek.
FatheredPuma81@reddit
Yea please don't use David's (HF user that always makes these upscaled buzzword stuffed) models.
Due-Memory-6957@reddit
That "puzzle" is nonsensical to begin with. There's no right answers to bullshit questions.
letsgoiowa@reddit
I found qwen 3.5 by default to be insanely verbose and kind of unhinged. I found these distills to be faster and more coherent.
lolwutdo@reddit
Fine tunes in general are a downgrade
Noob_Krusher3000@reddit
It's almost as if distilling a model on another's output, with little or no other additional work, *doesn't* give Chinese Open Weight labs an unfair edge over their honest and hardworking American counterparts...
Tell that to the media.
Top-Rub-4670@reddit
LOL
a_beautiful_rhind@reddit
Probably makes it sound like claude at best. Deepseek "distills" weren't deepseek either. Add in the tuners probably being grifty and it's over.
somerussianbear@reddit
Curious people that build these share an open source repo with benchmarks they ran with these Opus-inspired models vs base. I’ve never seen one, just screenshots of benchmarks.
If in deterministic systems we have a strong rule of collaboration in the OSS ecosystem that is “you add/keep test coverage”, in the non-deterministic world this doesn’t seem to matter, which is at least funny due to how hard it is to have reproducibility of behavior and quality gates.
wazymandias@reddit
seen this repeatedly. fine-tunes nail the claude voice but lose the reasoning. multi-step tasks fall apart because the model is optimizing for style over substance. base model is more reliable for actual work.
somerussianbear@reddit
1min14sec of reasoning and 4K tokens vs 5sec, 200 tokens, and wrong.
Not sure what I dislike the most.
aeroumbria@reddit
My thought is that if they had been distilling Claude, the base model would have done so already. If they hadn't, then the training data and process would be different enough between Claude and the target model, that a relatively small fine-tuning set is probably just going to push the model off its comfort zone.
Weird-Consequence366@reddit
Ah yes. The single question benchmark with no tool calls. Conclusive.
BuffMcBigHuge@reddit (OP)
I did mention it was anecdotal evidence for the purpose of this post. But my main use of llama.cpp with hermes-agent shows decreased intelligence overall compared to the base heretic.
ansmo@reddit
Data might show them performing differently per usecase and settings. I'm not saying you're wrong, but it would give us more to talk about.
ketosoy@reddit
Matches peak hours opus results.
sine120@reddit
Obviously one question does not a benchmark make, but I do wish that people had a more standardized method of testing their fine tunes. Maybe someone here with spare compute will be able to run a standardized set of tests for various quants/ finetunes to get a picture of how they compare to the normal base models. KLD divergence is cool, Qwopus and OmniCoder are cool names, but how do they compare to the original on LiveCodeBench, etc?
True_Requirement_891@reddit
Initially the impression of these models was very good, but slowly I realised that they were infact a downgrade in intelligence.