One year’s benchmark progress: comparing Sonnet 3.5 with open weight 2025 non-thinking models

[-]

AppearanceHeavy6724@reddit

When that site dies already. All they do is metabenchmarking, their results are very misleading.

Reply

[-]

AppearanceHeavy6724@reddit

Do you really believe Llama 4 Maverick is on par with GPT 4.1 and Deepseek V3 0324? This benchmark says that.

Reply

[-]

I’d say maverick is slightly behind both of them which is what the benchmark says. The idea that Maverick is trash is something I could never understand. A bit lame for being the flagship of a major lab, yes. But a decent middle of the road non-reasoning and super fast model. But, there’s no reason why benchmarks would align perfectly with your personal experiences and preferences. Are you saying that artificial analysis is bad at benchmarking? If so could you clarify why you think they’re bad? Are they getting wrong scores, or choosing the wrong benchmarks? It must be one of these but you haven’t given any hint.

Reply

[-]

AppearanceHeavy6724@reddit

> Are you saying that artificial analysis is bad at benchmarking? If so could you clarify why you think they’re bad? I am saying their benchmark is worthless. The synthetic score they produce does not reflect the real performance of the model as simple as that. > I’d say maverick is slightly behind both of them which is what the benchmark says Did you actually try it? It is awful at coding (massively worse that DS V3 0324 and GPT 4.1), worse at math than deepsek (I checked), terrible, abysmal at creative fiction. So in what way it is " slightly behind both of them"?

Reply

[-]

nomorebuttsplz@reddit (OP)

So you’re saying their scores are incorrect? So it should be easy to find an example of them giving model score x on a test, but another bencher giving score y on the same test.

Reply

[-]

AppearanceHeavy6724@reddit

> messing around with research agents Have zero idea what that means..

Reply

[-]

perelmanych@reddit

I don't understand why have you been downvoted. Pretty everyone agrees that because of labs AI race benchmark became useless due to benchmaxing. If AA scores are metascores then obviously we have garbage in garbage out with the difference that now we have really vague idea what these metascores even are supposed to measure.

Reply

[-]

AppearanceHeavy6724@reddit

I think people here hate the idea of models stagnating, all these teenagers here dream about 1.5B Claude Opus models.

Reply

[-]

perelmanych@reddit

Honestly, I don't think that models are stagnating. It is not so difficult to create a model better in some specific area by using a better dataset. The problem is that now each new model beats all previous models in almost everything, which is obviously BS.

Reply

[-]

UnionCounty22@reddit

Research agents. Think deep research. Think really deep about this

Reply

[-]

nomorebuttsplz@reddit (OP)

Because this is the first time you said that in this convo idiot.

Reply

[-]

Prestigious_Scene971@reddit

They trained on the benchmark like data. It is as simple as that.

Reply

[-]

jovialfaction@reddit

Is there a good alternative to generally compare LLMs?

Reply

[-]

Mkengine@reddit

For me dubesor benchmark corresponds mostly to my own experience: https://dubesor.de/benchtable.html

Reply

[-]

jovialfaction@reddit

What a great little corner of the internet. Thanks for sharing

Reply

[-]

AppearanceHeavy6724@reddit

no. general comparisons make no sense, as LLM uses are various.

Reply

[-]

nuclearbananana@reddit

I still use sonnet 3.5 daily. It was something special

Reply

[-]

nomorebuttsplz@reddit (OP)

How does it compare to 4.0 and other newer models in your experience?

Reply

[-]

nuclearbananana@reddit

It has worse world knowledge and is a little worse at coding and complex explanations and and generating a lot of text (capped at 8192, though I've never hit that). But it's much better at paying attention to what you say in your chats (i.e outside the prompt) to the point where it sometimes seems to read my mind. Much better at rp/stories (no absurd positivity bias, but it has some annoying quirks), much better at concise answers. I've also found it a bit better at emotional intelligence and pleasantness in general chatting.

Reply

[-]

AppearanceHeavy6724@reddit

Actually you are right. These folks (https://research.trychroma.com/context-rot) have shown that below 8k context 3.5 is the best at context handling, compared to newer models.

Reply

[-]

nuclearbananana@reddit

huh interesting. Though this seems to be mainly for a dummy task of repeating a bunch of words with one chagned

Reply

[-]

AppearanceHeavy6724@reddit

No, not only; they have variety of tasks.

Reply

[-]

lly0571@reddit

I think Sonnet 3.5 is Llama-405B level in anything besides coding. Kimi K2 and Deepseek-0324 should be better than Sonnet 3.5 overall. Qwen3-235B-Inst-2507 could be better than claude-3.5 if you regard gpt-4o as a model on par with claude-3.5, as it vibes just like a gpt-4o with much better coding/math capability. I think L4 Maverick can serve as a fair Llama4-80B(bad in coding as L3, with world knowledge close to L3-405B and much improved multimodal performance), but still slightly worse than 4o in multilingual tasks or multimodal tasks. However, Scout is bad overall, worse than Qwen3-32B, sometimes worse than Gemma3-27B, nowhere close to L3.3-70B.

Reply

[-]

a_beautiful_rhind@reddit

>at least in benchmarks Models definitely got better at code but worse at chat. I did not need charts for this.

Reply

[-]

TheRealMasonMac@reddit

I still think 3.5 is better than all current open models for chat. Kimi is in the middle, but it honestly kind of feels undertrained.

Reply

[-]

a_beautiful_rhind@reddit

The closed models are falling off too. Massive trend of parroting, summarizing and expanding instead of actually *replying*. In RP-RP a bit of *you do that* in the message is ok. In pure conversation it sticks out badly. 3.5 sonnet/opus? Newer claude downgraded too. Granted, I never tried new opus, too rich for my blood and never got a proxy with it.

Reply

[-]

Down_The_Rabbithole@reddit

New Opus is superior to old Opus in creative writing, understanding nuance and understanding your inherent intent behind whatever your prompt is.

Reply

[-]

TheRealMasonMac@reddit

\> Massive trend of parroting, summarizing and expanding instead of actually *replying*. Now that you mention it, LLMs are becoming like Microsoft Clippy.

Reply

[-]

nomorebuttsplz@reddit (OP)

Kimi k2 is amazing for chat as long as you are casually discussing your phd thesis.

Reply

[-]

mindful_maven_25@reddit

Well, I can't comment on the chat. Haven't used chat a lot. But coding is getting better maybe because everyone realized that there's lots of money in coding and it is possibly easier to get it right than reasoning?

Reply

[-]

noage@reddit

It does seem like a lot of models have somehow agreed that GTPisms are the best idea ever and have colluded to put them into all responses.

Reply

[-]

nomorebuttsplz@reddit (OP)

These modern models have become geniuses but unwilling/unable to let loose and have some fun.

Reply

[-]

Traditional-Gap-3313@reddit

In what universe is Scout or Maverick smarter then Sonnet 3.5 in \*anything\*?

Reply

[-]

nomorebuttsplz@reddit (OP)

keep in mind this is a version of sonnet 1.5 from a time when an early, now inferior version of 4o was SOTA

Reply

[-]

nomorebuttsplz@reddit (OP)

the link doesn’t seem to work right on mobile. it’s supposed to compare sonnet 1.5 (June 24 version only) with current open weight models, sort of like this: [https://imgur.com/a/NN1oKJl](https://imgur.com/a/NN1oKJl)

Reply

One year’s benchmark progress: comparing Sonnet 3.5 with open weight 2025 non-thinking models

Reply to Post

36 Comments

AppearanceHeavy6724@reddit

nomorebuttsplz@reddit (OP)

AppearanceHeavy6724@reddit

nomorebuttsplz@reddit (OP)

AppearanceHeavy6724@reddit

nomorebuttsplz@reddit (OP)

AppearanceHeavy6724@reddit

perelmanych@reddit

AppearanceHeavy6724@reddit

perelmanych@reddit

UnionCounty22@reddit

nomorebuttsplz@reddit (OP)

Prestigious_Scene971@reddit

jovialfaction@reddit

Mkengine@reddit

jovialfaction@reddit

AppearanceHeavy6724@reddit

nuclearbananana@reddit

nomorebuttsplz@reddit (OP)

nuclearbananana@reddit

AppearanceHeavy6724@reddit

nuclearbananana@reddit

AppearanceHeavy6724@reddit

lly0571@reddit

a_beautiful_rhind@reddit

TheRealMasonMac@reddit

a_beautiful_rhind@reddit

Down_The_Rabbithole@reddit

TheRealMasonMac@reddit

nomorebuttsplz@reddit (OP)

mindful_maven_25@reddit

noage@reddit

nomorebuttsplz@reddit (OP)

Traditional-Gap-3313@reddit

nomorebuttsplz@reddit (OP)

nomorebuttsplz@reddit (OP)