These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade

Posted by BuffMcBigHuge@reddit | LocalLLaMA | View on Reddit | 122 comments

Time and time again I find posts about these fine tunes that promise increased intelligence and reasoning with base models, and I continuously try them, realize they're botched, and delete them shortly after. I sometimes do resort to a lower quant since they are bigger, in this case, a 40b variant of Qwen 3.5 27b, but they seem to always let me down. I've resorted to not downloading any model with "Claude Opus 4.6" in the name.

Kudos to everyone who tries to make the foundation models more intelligent, but imo, it never works.

Note that this example is anecdotal evidence on a single prompt, but it's overall always the case of decreased intelligence when using with a local agent setup + llama.cpp in WSL2. This is irrespective of the quant as well - I've tried many.

One thing to notice however, the reasoning/thinking is significantly less, perhaps that's part of the problem.

Have any you found these better than base, ever?

The attached screenshots are:

./llama-server -hf mradermacher/Qwen3.5-27B-heretic-GGUF:Q4_K_S --temp 1.0 --top-p 0.8 --top-k 20 --min-p 0.00 --fit on --alias default --jinja --flash-attn on --ctx-size 262144 --ctx-checkpoints 256 --cache-ram -1 --cache-type-k q4_0 --cache-type-v q4_0 --threads 8 --threads-batch 16 --no-mmap --sleep-idle-seconds 600

~/llama.cpp/build/bin && ./llama-server -hf mradermacher/Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-i1-GGUF:i1-Q3_K_S --temp 1.0 --top-p 0.8 --top-k 20 --min-p 0.00 --fit on --alias default --jinja --flash-attn on --ctx-size 131072 --ctx-checkpoints 256 --cache-ram -1 --cache-type-k q4_0 --cache-type-v q4_0 --threads 8 --threads-batch 16 --no-mmap --sleep-idle-seconds 600

[-]

arbv@reddit

"We have Opus at home" moment

[-]

Alexi_Popov@reddit

TL;DR
I mean fine-tuning on 3k-100k samples, what can you expect other than repeating the format of the dataset.

Some industrial standard knowledge over the past 3 years this could make you understand why they are not "intelligent" like opus.

Generally for inducing heavy intelligence density in <70B models you do a massive 10T+ tokens based SFT on a mix of data Maths, reasoning, coding, agentic abilities or earlier you used to do is KD (Knowledge distillation [Now before you say, hey It is knowledge distillation from opus output to whatever smaller model so why doesn't it perform, the short answer is nope, totally wrong. It falls under SFT, SFT by itself has to be in trillions of toks to make the model understand intelligence since you are know making the model adapt to a certain standard instead of all the possible random less formatted base data used in pre-training the model], So KD is a means of using a superior teacher model output "logits", [This is what creates the confusion logits != output sequence but the set of possible output for a particular sequence token in the output. So for the model that said "Hey" in a user message what other possible words did it produce after "Hey" for it, say for example "Hello", "Hi" etc], the logits based KD is not available anymore in many inferences for the "cannot train on our model outputs" law or are reduced to 5 toks per inference call totally diminishing the results. The process of fetching building large samples of input/output responses from better model is still called distillation but it's largely not based on logits fetching but tons of samples created on frontier model outputs in various domains to incorporate.

Today you do is SFT over long contexts with objectivised multi step reasoning, function calling and so forth so the model is adept at it's trained abilities and for that you really need at least a row of 1Mil (If compute available then at least go for 10Mil rows of long context >8k toks trust me without this you will not see any results in "intelligence" but rather you will see the results in "response coherence" in respect to the training data). Trust me this is all the frontiers doing it, and what makes them different is the scale and the variety of domains, you can assume 30-40T toks of data that the model sees at least in all it's training phases (They even do this at mid-training and pre-training).

Kudos to everyone who tries to make the foundation models more intelligent, but imo, it never works.

I mean it works but not generally if you post-train on an existing instruction tuned model (As in a finished production model) because of it's optimized policy (in plain words the output format it chooses) might not be aligned to your new SFT dataset (A large one as I said if it is small one like <100k samples then you are doing nothing but training it to respond in a certain format which makes it rigid if the params are set too high [Which generally are for small dataset]), so generally this is where the difference comes because if you train on an existing post-trained model it is going to go back and forth with it's original post training and your post-training contents, so the model becomes unstable and at times "dumber".

So to crunch it up:
Nah 3k or even 100k samples at \~4k toks with nothing but reasoning and output content from a larger better model is not going to yield the intelligence you are assuming more or less a simple formatted response in coherence to the sft samples you gave.

[-]

toothpastespiders@reddit

Kudos to everyone who tries to make the foundation models more intelligent, but imo, it never works.

I'd make an exception for models that had a lot of reasoning data in their training but never got that final push to finalize it in the official instruct release. Which is obviously a very rare niche case, but not totally unheard of. Though probably never will again in any significant way now that it's exploded in popularity.

[-]

gurilagarden@reddit

I don't even know why I keep burning my internet cap and drive space for jack's finetunes, but..meh...one can dream. I downloaded 2 last night just to see what happens, but, I already know. Every time I test them head-to-head with qwen base models, i'm left...dissatisfied, but at this point, at least my expectations are so low i'm not disappointed.

[-]

toothpastespiders@reddit

Honestly I just find it fun to see what a shove and some moderate brain damage can produce. The recent trend OP is talking about isn't very interesting to me so I've yet to even bother trying them. But just in geneal I think it's fun to check out how someone's weird theory works out in practice. Amazing finds are rare but they can happen. Someone's little talked about tune of the first mistral small base model wound up as my swiss army knife choice for a really long time.

[-]

bobeeeeeeeee8964@reddit

I agree with that. Almost all the 'Claude' distilled models perform like someone who only has surface-level knowledge. They're heavily overfitted or just overhyped. You can see tons of ads for them on X, but once you actually use them, you run into a bunch of problems.

[-]

Aaaaaaaaaeeeee@reddit

I haven't looked into them, but when they say Distilled they make no distinction whether it's a Finetune or a possible type of distillation which have formal definitions. The term came with the first small thinking fine-tuned deepseek models. Don't think distillation is all bad just because of the models named.

[-]

ResidentPositive4122@reddit

None of these are distillations in the ML sense. Because none of these have access to the teacher model. They use SFT with text generations from the "teacher", a.k.a. poor man's distillation. Applied correctly it can improve certain aspects of a model, but they're not true distillations.

[-]

nickl@reddit

I mean OpenAI, Anthropic and Google call is distillation.....

https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

> "We also know that DeepSeek employees developed code to access U.S. AI models and obtain outputs for distillation in programmatic ways," the memo added.

https://www.reuters.com/world/china/openai-accuses-deepseek-distilling-us-models-gain-advantage-bloomberg-news-2026-02-12/

[-]

de4dee@reddit

i guess there are two types of distillation nowadays. distillation using logits, or bare outpus.

first one only LLM holder can do. second one everybody that can talk to the LLM can do.

[-]

Aaaaaaaaaeeeee@reddit

Yes, there are newer "black-box" methods with no logits requirements, but this is what has potentially led to Anthropic's call for preventing this..

[-]

Ok_Helicopter_2294@reddit

According to their description, the model was fine-tuned using LoRA on a distilled Opus dataset.

[-]

Aaaaaaaaaeeeee@reddit

So the same process as usual.

Arcee-ai models and a couple of 3rd party research DS-R distills went through a different process, which had much better results than Deep seeks distilled which were finetuned despite naming.

[-]

Ok_Helicopter_2294@reddit

I agree with that view. Other companies may have gone through additional reinforcement learning, used better-curated datasets, cleaned their base datasets more thoroughly, and developed their own teacher–student training pipelines.

[-]

stoppableDissolution@reddit

None of them are distills to begin with

[-]

MoffKalast@reddit

Besides, there are Claude distill already, they're called Deepseek, Kimi and MiniMax lmao.

[-]

stoppableDissolution@reddit

Yeah, no, they are not either. Noone but anthropic can distill claudr because they do not provide raw logits

[-]

florinandrei@reddit

Genuine Rolex replicas.

[-]

Far-Low-4705@reddit

If you read the descriptions in one of the more popular ones, they measure “output efficiency” in character count instead of token count

[-]

Hydroskeletal@reddit

I have my own benchmark suite and I'd find that compared to the base model it might be slightly better on some tests but that's overshadowed by big drops on others.

I think this is one of those things where if you don't compare the models on the same task you're just missing the subtle failure modes and it "feels" better but isn't.

[-]

henk717@reddit

To me its always been logical, you aren't getting what makes claude good which is going to be that massive knowledge base inside the model. Its going to be by its nature a style transfer based on whatever many examples they managed to distill. Which I bet cover topics that are typically benchmarked.

Roleplay tuners doing it because they wish to copy opus's writing style makes sense, since that is all about flair. But actual intelligence you aren't going to copy with for example the Opus-4.6-Reasoning-3300x dataset, I can't imagine how with only 3000 examples even though that is many you'd train something more intelligent than the massive proprietary datasets from the big commercial teams.

But its very good at tricking people into thinking it is better of course from the name alone, they see that this model was made to be like this massive model they hear good things about. It may fix a think loop bug, and then its the best thing ever popularity wise. Until the newness comes off and people notice that in actual usage its not better like you show with your post.

[-]

WolfeheartGames@reddit

Its worse than that. Small models are extremely sensitive to data outside their learned distribution. If you take Gemma 3 1b/2b and the larger parent model its distilled from. Take the opus reasoning trace training data. Make a second dataset from that reasoning data that is the same data but rewritten by the parent Gemma to be in it's distribution, and then train 2 copies of the smaller Gemma on both data sets, the one that's more in distribution does much better.

The problem isn't the semantic information in the reasoning traces. Its the word distribution articulating it.

[-]

Odd_Science@reddit

Those 3000 examples could be useful for an initial reasoning finetune so the model learns the "form" of a reasoning trace, just sufficient so you can then spend millions of GPU-hours on reinforcement learning to actually make the model capable.

[-]

LA_rent_Aficionado@reddit

Exactly, not to mention any under the hood at the other end of the API, it may not just be a simple model API, it's all possible there are more services behind the veil be it RAG, vetting steps, etc..

[-]

de4dee@reddit

i think most of the time the small amount of fine tuning material forced into training with higher learning rate or higher rank or higher alpha to make an impact, ending up ruining general intelligence of the model.

what should have been done: more samples, less learning rate, and less rank and less alpha to preserve the smoothness of the original model. you cannot force your tokens to it. but you can use lots of tokens to make a proper/smooth impact.

the fine tuner maybe did much shorter reasoning tokens, hence the model learned that shorter reasoning as a habit.

[-]

Kagemand@reddit

Does Qwen3.5-27B pass the car wash question each time you ask it? Sometimes I think there might just be randomness to it.

Did you try Qwopus3.5-27B-v3?

[-]

reery7@reddit

I tried the MLX Qwopus3.5 v3 and the response was very decent. Even the thought process showed no flaw in logic.

[-]

BuffMcBigHuge@reddit (OP)

I turned off reasoning and it failed. So something is to be said about reasoning.

[-]

Global_Persimmon_469@reddit

Isn't Qwopus trained based on Claude Opus thinking patterns? If you disable reasoning of course you are going to lose all that

[-]

BuffMcBigHuge@reddit (OP)

Ran it back, full reasoning, and even Q6_K checkpoint, and it failed the test with Qwopus3.5.

/llama-server -hf mradermacher/Qwopus3.5-27B-v3-i1-GGUF:i1-Q6_K --temp 1.0 --top-p 0.8 --top-k 20 --min-p 0.00 --fit on --alias default --jinja --flash-attn on --ctx-size 262144 --ctx-checkpoints 256 --cache-ram -1 --cache-type-k q4_0 --cache-type-v q4_0 --threads 8 --threads-batch 16 --no-mmap

[-]

spky-dev@reddit

By the very definition of how a model works, there will be randomness in it.

[-]

_qeternity_@reddit

This is not true at all. The model is deterministic.

Sampling may not be, and to a lesser extent, kernel launches may not be.

But it has nothing to do with the definition of how a model works.

[-]

spky-dev@reddit

No, a model is not deterministic, it’s probabilistic.

[-]

Dependent_Ad948@reddit

Sure, the vectors are by definition probabilistic, but if you and I download the same model and prompt it with the same prompt and the same non-random seed, what happens?

[-]

erizon@reddit

every probabilistic algorithm is deterministic when you fix the seed (and hardware/software). different seed-> different outcome is how randomness is universally understood

[-]

Spectrum1523@reddit

Can you explain what you mean here? If you use the same seed and same settings you'll get exactly the same output every time.

[-]

po_stulate@reddit

If it is really that easy (doable by a random independent contributer) to make qwen3.5-27B (or any other model) a smarter model then qwen (or whoever publishes the model) would've already done it.

It's more realistic to finetune the model to make it suit your specific usecases, not to increase its general smartness.

[-]

Global_Persimmon_469@reddit

This model is more intended to improve the performance on logical/analytical/coding tasks, at the cost of general knowledge and performance

[-]

Potential-Leg-639@reddit

I tried it recently and it was really good

[-]

Fabulous_Fact_606@reddit

can't you just add something like this to the main context "Before answering any question involving a decision, choice, or recommendation: pause and identify what object or entity must physically be where, what constraints exist beyond what was explicitly stated, and whether your initial answer satisfies all of them. If it doesn't, correct before responding."

[-]

kuhunaxeyive@reddit

One could add a more explicit prompt, but then it's not testing the intelligence of the model. The intelligence of the model shows when it can guess better what you intended to do, or when it can understand the problem without extra details. That is super important as we might not fully understand the case of our question, and the smarter the model, the more it can understand the logic by itself and help us getting the right answer.

[-]

Fabulous_Fact_606@reddit

So the question is. The experiment needs to use naked LLM or API. We can't really compare to the foundational models because we don't know what the 20K main context injection is given to the model before we ask the question.

[-]

ViRROOO@reddit

Because it’s just introducing noise into the weights. Complete waste of time and compute

[-]

KriosXVII@reddit

Model collapse.

[-]

quantum_splicer@reddit

Agreed !!! Once you read apples paper " simple self distillation " it becomes obvious that just feeding the model data is no good because you need to

" suppress distractor tails where precision matters while preserving useful diversity where exploration matters. "

I'm probably missing some points, but intuitively this makes sense, it's a shame a research paper had to find the answer and it took so long.

[-]

UnknownLesson@reddit

Any other important papers?

[-]

RevolutionaryLime758@reddit

I have a feeling nothing became obvious to you in particular by reading it. Why don’t you tell us exactly why that applies to this particular distillation topic lol fuggin idiot

[-]

9897969594938281@reddit

Damn, a genuine tough guy

[-]

Spectrum1523@reddit

hmm my money is on a *claw told to be aggressive

[-]

ViRROOO@reddit

Strong sentiments buddy. A quick look at your history shows a pattern, maybe do some mindfulness or look for professional help instead of lashing on people that enjoy running their llms on the weekends on their 3090s :)

[-]

Recoil42@reddit

Looks like this is it, for anyone needing a link:

https://arxiv.org/abs/2604.01193

[-]

BuffMcBigHuge@reddit (OP)

Yeah - I'm supposing the fine tunes are kinda like merging LoRAs for image/video models. They promote certain types of content while hindering the rest of the model.

[-]

Long_War8748@reddit

LoRAs for Image and Video models are actually useful though.

[-]

stoppableDissolution@reddit

Thats kinda the purpose of it, no?

[-]

MoffKalast@reddit

If your plan is to reduce guardrails yes, otherwise no.

[-]

Noob_Krusher3000@reddit

LoRAs are easy to switch out for very specific styles. A language model that smells like Claude, but struggles to do anything else useful because it's been lobotomized isn't very productive.

[-]

stoppableDissolution@reddit

I'm not saying this particular one is productive, but the purpose of finetuning (lora or not) is to make a model better in a narrow scope with little or no regard to the rest of its capabilities

[-]

No-Refrigerator-1672@reddit

Correct. Yet hose "opus" models claim to make it better at all tasks - and that's what a simple finetune on a synthetic dataset can't do; you need to be as smart as the original creator team to make the model better all across the board.

[-]

Velocita84@reddit

I wish all this compute was spent on training these models for alternative use cases instead of hopelessly trying to beat qwen at making their own models better with logic/code/tools

[-]

Long_comment_san@reddit

Yeah, like general chat for example. I believe at this point coding should be passed to adapters.

[-]

stoppableDissolution@reddit

~~Just donate it to me, please?~~

[-]

ASTRdeca@reddit

That seems a bit handwavy? You could write off any distilled model with that

[-]

ViRROOO@reddit

Why would I write a full explanation on a reddit comment when more competent people with more than 5 million total comp wrote better pappers available for free at arxiv

[-]

openSourcerer9000@reddit

Qwopus 27 v2 was actually a banger (v3 seemed the same as qwen reasoning). Best model for a 24gb card, running it right now actually

[-]

CucumberAccording813@reddit

A big reason for this is that many Opus-distilled fine-tunes cause the model to think less than it would by default. Since Claude Opus itself doesn't over-reason, the distilled models inherit that same pattern and performance takes a hit because of it. That said, I'd still take a fast Claude-distilled Qwen model that thinks concisely over a "smarter" undistilled model that burns 10k tokens second-guessing itself on a single question.

[-]

nickl@reddit

The Jackrong models work pretty well in my testing (8% better than the base models in on my agentic benchmark)

I don't see any way that a 40B model based on a 27B model is going to be better unless the trainer's copute budget is \~Qwen. That's just 13B undertrained parameters.

[-]

grumd@reddit

Confirmed by my testing on your benchmark, Qwopus v3 9B is much much better than Qwen 3.5 9B, it's not even close

[-]

nickl@reddit

glad you found it useful!

[-]

grumd@reddit

Yes, thanks, and please make temperature configurable at least with CLI!

[-]

qubridInc@reddit

Most “Opus-style” fine-tunes trade real reasoning for style mimicry so yeah, base models usually outperform them in actual agent workflows.

[-]

frozen_tuna@reddit

Opus already has reasoning. I remember reading even back in 2023 that fine-tuning was a bad way to add knowledge but a good way to enforce a style of response. Back when we used early versions of PEFT. Is that still a thing?

[-]

Traditional-Gap-3313@reddit

R1 paper showed you can SFT reasoning into a non-reasoning model. Remember all the DeepSeek R1 8B which was llama finetune (SFT). However they did it with \~200k reasoning traces from R1, not 3000

[-]

frozen_tuna@reddit

yes and? isn't adding {content} to the beginning of a response a style of response? or was there something more to it?

[-]

charles25565@reddit

Yes, and training small models on Gemini distillation datasets causes the AI to just hallucinate catchy reasoning summary headers like "Formulating the response" or "Analyzing query".

[-]

--Spaci--@reddit

Most people really just cant do it right and say like 1000 samples is essentially just noise, I think the idea is fine and the easier finetuning gets the more you will see this btw.

[-]

jacek2023@reddit

More finetunes is not a bad thing, this is how community lives

But more reviews like that is the most important thing, because nobody has time to try all finetunes, so we need some kind of knowledge what can be really useful

[-]

TheThoccnessMonster@reddit

This is something more of a rough one though.

[-]

Disastrous_Hope_9373@reddit

Not just 4.6-opus distill models, but also REAP and abliterated models.

They're all so bad, and they show benchmarks that mislead you about their performance (well at least abliterated gives you a cool side benefit).

I hate them all, and for the opus distill models, I think people should just learn to prompt better with the models from the original lab. Learn the prompt style for whatever new model you downloaded. Stop trying to make everything claude, you can't do that without lobotomizing the model (unless you're smart enough to work in a chinese lab).

Downvote me, just do it, just bite :)

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

PhantomGaming27249@reddit

Its because you can't just feed the model claude outputs and expect it to get better. You need to filter and curate the data set. Then tune it with specific parameters and make sure the finetuning or lora adapter your making is actually working. Then you need to genuinely test and validate. Fine tuning does work and can improve a model a decent bit over its base, doing it carelessly though will hurt performance rather than help it.

[-]

Jeidoz@reddit

Why you decided to compare Q4 27b vs Q3_i 40b? There are literally 10+ 27b Claude distilled models, but from some reason you decided to compare in a bit not "fair" using Q4 vs a bit lobotomized Q3_i?

[-]

JohnsGimpyHand@reddit

Uh but that answer is entirely right and many commercial models fail at it

[-]

PromptInjection_@reddit

Claude's intelligence probably doesn't stem, most likely, from its reasoning patterns being so great.

If you now slap those onto another model, it's like saying, "Let me phrase sentences a bit like Einstein." But that doesn't turn you into Einstein - just a poor linguistic clone who still can't come up with a theory of relativity.

[-]

Potential-Gold5298@reddit

This may improve the style of creative writing, but almost always leads to a deterioration in intelligence.

[-]

llitz@reddit

Considering my recent Claude interactions... I felt just as frustrated with them later... So... Good copy, but a downgrade nonetheless.

[-]

34574rd@reddit

people who think 3000 low quality, general pairs is enough to steer a model are so dumb, what makes you think alibaba and google would have not already done the same if the results would have been substantial?

[-]

PhilippeEiffel@reddit

Even if it is 3000 high quality. It's SO small compared to the original training material. It requires to be Q1 quantized human to consider it will improve some big LLM.

[-]

stoppableDissolution@reddit

(spoiler: they are doing it, just better)

[-]

rpkarma@reddit

(And at way larger scales)

[-]

charles25565@reddit

2,000 high quality rows on a high quality ~120M model can easily outperform models that are trained on dozens of thousands of rows.

[-]

lemon07r@reddit

I've been trying to tell people this. People really need to be more critical of this stuff, cause I still see these models as the most hearted finetunes simple cause they have "opus" in their names.

[-]

lumos675@reddit

The answer is not correct though? You must be with the car so you can wash it and all of the answer was pointing to this fact. No?

[-]

srigi@reddit

Your comparison is not scientifically accurate because u compared heretic with heretic+”so called distill”. Now you have two variables in the system.

Correct way would be comparison on vanilla gguf from Usloth or Bartowski with distill gguf. No heretic, no abliteration, no unrestriction. I believe these uncensoring are doing more damage than opus fine-tunes.

[-]

Long_comment_san@reddit

To my understanding these finetunes "match the output" and that's it. They're not trained per depth. Basically what is fed to them are some cases and model tries to mimic those in responses and reasoning. It's not really radically changing the though...bruh. In simple turns, it's just changing the coat to a jacket I guess.

[-]

shing3232@reddit

you need rl after finetune to ensure generalization

[-]

weiyong1024@reddit

For agent work specifically the overfit is way worse than it is in plain chat, i noticed this running OpenClaw with a couple of the Qwen-Claude-ish fine tunes versus vanilla Qwen3.5-27B, the tool call accuracy on the tuned versions dropped noticeably on anything that required chaining two tools, my guess is the distillation is optimizing for chat cadence and quietly erodes the structured JSON tool-use patterns the base model already had.

[-]

Su1tz@reddit

"Here's a Gemma fine tune of 1000 opus traces because Google is incompetent and couldn't think about this idea by themselves"

[-]

Ok_Helicopter_2294@reddit

Rather than improving intelligence and reasoning ability, it can be said to be closer to imitating the thought process. It wasn't completely useless to me because it had the advantage of saving the number of contexts.

However, in order to clearly increase intelligence and reasoning ability, a reinforcement learning model with a good data set rather than SFT would be better.

[-]

Yu2sama@reddit

I don't think we should expect every fine-tune to be good. The issue with Fine-tunes is that, until you try them, you can't know if they are broken af, quite different from a Lora on an image generation model. You can directly see the results easily, but in text it requires some discerning and extra work to test.

Two extra points relevant to this subject:
- Qwen 3.5 seems to be very sensitive, so I will expect more fine-tunes to suck than to work unless the fine-tuner works, tests and tries to fix until they find the correct version/sauce.
- That's a DavidAU fine-tune, bro has cooked some interesting stuff but his models are, most of the time, broken af.

Qwen seems to be very sensitive anyway, the only fine-tune I have tried that works equal with the base Qwen is Qwen3.5-9B-Aggressive at least from my own tests.

[-]

Borkato@reddit

I wonder if we should have an RP benchmark with example outputs given different prompts. Would be cool to see qwen’s writing style vs Gemma’s vs etc

[-]

Yu2sama@reddit

Is funny you say that I have been slowly working on my own RP benchmark with A/B tests and a few shot tests for different fields (spatial understanding, intelligence, character adherence, etc.). Nothing crazy, mostly for personal use and test my models to see which ones to delete and which ones to stay with lol.

On the topic of Qwen vs Gemma...

I only use 9b, E4B and a bit of the 26b moe (the prompt processing kills my soul on the last one, hence why I don't use it as much atm). From my tests, the Gemma family has better prose, better understanding and they use the character card very well. Will surprise you often with certain details of the character. Though, from my experience they SUCK at style adherence, and a bit at character adherence, they tend to be more realistic and homogeneous, which can be a good or a bad thing depending on who you ask.

Meanwhile Qwen has amazing instruction following, prose is not as good as Gemma but its style adherence is superior by a lot. Characters adherence tends to be very good though, maybe a bit better than Gemma.

[-]

Borkato@reddit

I find a similar thing true in my tests! I did the same lol, I used python to just get a bunch of writing tests, then I read a few responses from each model without knowing which is which and dump any that sink to the bottom of the list. It turns out ALL Gemma 3 finetunes almost always get a negative score from me, it’s hilarious, I became able to predict it. I’d be like “oh this SUCKS… wait is this Gemma” and it always was LOL

BUT Gemma 4 seems MUCH MUCH MUCH better so I’m happy about that.

Now we just need a model with perfect instruction AND style adherence! 😂

[-]

Yu2sama@reddit

Gemma 3 was so bad... no fine-tune could help it. Never understood the people that liked it haha. I don't expect such a perfect model to exists but if one appears I would be pleasantly surprised!

[-]

Borkato@reddit

People LOVED Gemma 3 big tiger though! I hated it so much 😭 I hope a new one comes out with 4 haha, bigger tiger!

[-]

robberviet@reddit

I never use those distill models. I doubt any non big lab like can do anything useful without huge computing power, high quality data and most of all talents.

And we already got the distills of Opus! It's called DeepSeek.

[-]

FatheredPuma81@reddit

I sometimes do resort to a lower quant since they are bigger, in this case, a 40b variant of Qwen 3.5 27b,

Yea please don't use David's (HF user that always makes these upscaled buzzword stuffed) models.

[-]

Due-Memory-6957@reddit

That "puzzle" is nonsensical to begin with. There's no right answers to bullshit questions.

[-]

letsgoiowa@reddit

I found qwen 3.5 by default to be insanely verbose and kind of unhinged. I found these distills to be faster and more coherent.

[-]

lolwutdo@reddit

Fine tunes in general are a downgrade

[-]

Noob_Krusher3000@reddit

It's almost as if distilling a model on another's output, with little or no other additional work, *doesn't* give Chinese Open Weight labs an unfair edge over their honest and hardworking American counterparts...
Tell that to the media.

[-]

Top-Rub-4670@reddit

their honest and hardworking American counterparts

LOL

[-]

a_beautiful_rhind@reddit

Probably makes it sound like claude at best. Deepseek "distills" weren't deepseek either. Add in the tuners probably being grifty and it's over.

[-]

somerussianbear@reddit

Curious people that build these share an open source repo with benchmarks they ran with these Opus-inspired models vs base. I’ve never seen one, just screenshots of benchmarks.

If in deterministic systems we have a strong rule of collaboration in the OSS ecosystem that is “you add/keep test coverage”, in the non-deterministic world this doesn’t seem to matter, which is at least funny due to how hard it is to have reproducibility of behavior and quality gates.

[-]

wazymandias@reddit

seen this repeatedly. fine-tunes nail the claude voice but lose the reasoning. multi-step tasks fall apart because the model is optimizing for style over substance. base model is more reliable for actual work.

[-]

somerussianbear@reddit

1min14sec of reasoning and 4K tokens vs 5sec, 200 tokens, and wrong.

Not sure what I dislike the most.

[-]

aeroumbria@reddit

My thought is that if they had been distilling Claude, the base model would have done so already. If they hadn't, then the training data and process would be different enough between Claude and the target model, that a relatively small fine-tuning set is probably just going to push the model off its comfort zone.

[-]

Weird-Consequence366@reddit

Ah yes. The single question benchmark with no tool calls. Conclusive.

[-]

BuffMcBigHuge@reddit (OP)

I did mention it was anecdotal evidence for the purpose of this post. But my main use of llama.cpp with hermes-agent shows decreased intelligence overall compared to the base heretic.

[-]

ansmo@reddit

Data might show them performing differently per usecase and settings. I'm not saying you're wrong, but it would give us more to talk about.

[-]

ketosoy@reddit

Matches peak hours opus results.

[-]

sine120@reddit

Obviously one question does not a benchmark make, but I do wish that people had a more standardized method of testing their fine tunes. Maybe someone here with spare compute will be able to run a standardized set of tests for various quants/ finetunes to get a picture of how they compare to the normal base models. KLD divergence is cool, Qwopus and OmniCoder are cool names, but how do they compare to the original on LiveCodeBench, etc?

[-]

True_Requirement_891@reddit

Initially the impression of these models was very good, but slowly I realised that they were infact a downgrade in intelligence.