TheaterFire

One year’s benchmark progress: comparing Sonnet 3.5 with open weight 2025 non-thinking models

Posted by nomorebuttsplz@reddit | LocalLLaMA | View on Reddit | 36 comments

AI did not hit a plateau, at least in benchmarks. Pretty impressive with one year’s hindsight. Of course benchmarks aren’t everything. They aren’t nothing either.

Reply to Post

36 Comments

AppearanceHeavy6724@reddit

When that site dies already. All they do is metabenchmarking, their results are very misleading.
View on Reddit #62859613

nomorebuttsplz@reddit (OP)

How are they misleading? 
View on Reddit #62865206

AppearanceHeavy6724@reddit

Do you really believe Llama 4 Maverick is on par with GPT 4.1 and Deepseek V3 0324? This benchmark says that.
View on Reddit #62867230

nomorebuttsplz@reddit (OP)

I’d say maverick is slightly behind both of them which is what the benchmark says. The idea that Maverick is trash is something I could never understand. A bit lame for being the flagship of a major lab, yes. But a decent middle of the road non-reasoning and super fast model. But, there’s no reason why benchmarks would align perfectly with your personal experiences and preferences. Are you saying that artificial analysis is bad at benchmarking? If so could you clarify why you think they’re bad? Are they getting wrong scores, or choosing the wrong benchmarks?  It must be one of these but you haven’t given any hint.
View on Reddit #62871015

AppearanceHeavy6724@reddit

> Are you saying that artificial analysis is bad at benchmarking? If so could you clarify why you think they’re bad? I am saying their benchmark is worthless. The synthetic score they produce does not reflect the real performance of the model as simple as that. > I’d say maverick is slightly behind both of them which is what the benchmark says Did you actually try it? It is awful at coding (massively worse that DS V3 0324 and GPT 4.1), worse at math than deepsek (I checked), terrible, abysmal at creative fiction. So in what way it is " slightly behind both of them"?
View on Reddit #62872277

nomorebuttsplz@reddit (OP)

So you’re saying their scores are incorrect? So it should be easy to find an example of them giving model score x on a test, but another bencher giving score y on the same test.
View on Reddit #62872593

AppearanceHeavy6724@reddit

> messing around with research agents Have zero idea what that means..
View on Reddit #62874141

perelmanych@reddit

I don't understand why have you been downvoted. Pretty everyone agrees that because of labs AI race benchmark became useless due to benchmaxing. If AA scores are metascores then obviously we have garbage in garbage out with the difference that now we have really vague idea what these metascores even are supposed to measure.
View on Reddit #62958184

AppearanceHeavy6724@reddit

I think people here hate the idea of models stagnating, all these teenagers here dream about 1.5B Claude Opus models.
View on Reddit #62993136

perelmanych@reddit

Honestly, I don't think that models are stagnating. It is not so difficult to create a model better in some specific area by using a better dataset. The problem is that now each new model beats all previous models in almost everything, which is obviously BS.
View on Reddit #63011744

UnionCounty22@reddit

Research agents. Think deep research. Think really deep about this
View on Reddit #62878544

nomorebuttsplz@reddit (OP)

Because this is the first time you said that in this convo idiot.
View on Reddit #62874673

Prestigious_Scene971@reddit

They trained on the benchmark like data. It is as simple as that.
View on Reddit #62901952

jovialfaction@reddit

Is there a good alternative to generally compare LLMs?
View on Reddit #62882126

Mkengine@reddit

For me dubesor benchmark corresponds mostly to my own experience: https://dubesor.de/benchtable.html
View on Reddit #62902524

jovialfaction@reddit

What a great little corner of the internet. Thanks for sharing
View on Reddit #62942916

AppearanceHeavy6724@reddit

no. general comparisons make no sense, as LLM uses are various.
View on Reddit #62907846

nuclearbananana@reddit

I still use sonnet 3.5 daily. It was something special
View on Reddit #62870338

nomorebuttsplz@reddit (OP)

How does it compare to 4.0 and other newer models in your experience?
View on Reddit #62871458

nuclearbananana@reddit

It has worse world knowledge and is a little worse at coding and complex explanations and and generating a lot of text (capped at 8192, though I've never hit that). But it's much better at paying attention to what you say in your chats (i.e outside the prompt) to the point where it sometimes seems to read my mind. Much better at rp/stories (no absurd positivity bias, but it has some annoying quirks), much better at concise answers. I've also found it a bit better at emotional intelligence and pleasantness in general chatting.
View on Reddit #62894648

AppearanceHeavy6724@reddit

Actually you are right. These folks (https://research.trychroma.com/context-rot) have shown that below 8k context 3.5 is the best at context handling, compared to newer models.
View on Reddit #62911244

nuclearbananana@reddit

huh interesting. Though this seems to be mainly for a dummy task of repeating a bunch of words with one chagned
View on Reddit #62973833

AppearanceHeavy6724@reddit

No, not only; they have variety of tasks.
View on Reddit #62993223

lly0571@reddit

I think Sonnet 3.5 is Llama-405B level in anything besides coding. Kimi K2 and Deepseek-0324 should be better than Sonnet 3.5 overall. Qwen3-235B-Inst-2507 could be better than claude-3.5 if you regard gpt-4o as a model on par with claude-3.5, as it vibes just like a gpt-4o with much better coding/math capability. I think L4 Maverick can serve as a fair Llama4-80B(bad in coding as L3, with world knowledge close to L3-405B and much improved multimodal performance), but still slightly worse than 4o in multilingual tasks or multimodal tasks. However, Scout is bad overall, worse than Qwen3-32B, sometimes worse than Gemma3-27B, nowhere close to L3.3-70B.
View on Reddit #62905995

a_beautiful_rhind@reddit

>at least in benchmarks Models definitely got better at code but worse at chat. I did not need charts for this.
View on Reddit #62857750

TheRealMasonMac@reddit

I still think 3.5 is better than all current open models for chat. Kimi is in the middle, but it honestly kind of feels undertrained.
View on Reddit #62881721

a_beautiful_rhind@reddit

The closed models are falling off too. Massive trend of parroting, summarizing and expanding instead of actually *replying*. In RP-RP a bit of *you do that* in the message is ok. In pure conversation it sticks out badly. 3.5 sonnet/opus? Newer claude downgraded too. Granted, I never tried new opus, too rich for my blood and never got a proxy with it.
View on Reddit #62883855

Down_The_Rabbithole@reddit

New Opus is superior to old Opus in creative writing, understanding nuance and understanding your inherent intent behind whatever your prompt is.
View on Reddit #62902687

TheRealMasonMac@reddit

\> Massive trend of parroting, summarizing and expanding instead of actually *replying*. Now that you mention it, LLMs are becoming like Microsoft Clippy.
View on Reddit #62884950

nomorebuttsplz@reddit (OP)

Kimi k2 is amazing for chat as long as you are casually discussing your phd thesis. 
View on Reddit #62884379

mindful_maven_25@reddit

Well, I can't comment on the chat. Haven't used chat a lot. But coding is getting better maybe because everyone realized that there's lots of money in coding and it is possibly easier to get it right than reasoning?
View on Reddit #62889298

noage@reddit

It does seem like a lot of models have somehow agreed that GTPisms are the best idea ever and have colluded to put them into all responses.
View on Reddit #62859615

nomorebuttsplz@reddit (OP)

These modern models have become geniuses but unwilling/unable to let loose and have some fun.
View on Reddit #62859610

Traditional-Gap-3313@reddit

In what universe is Scout or Maverick smarter then Sonnet 3.5 in \*anything\*?
View on Reddit #62859722

nomorebuttsplz@reddit (OP)

keep in mind this is a version of sonnet 1.5 from a time when an early, now inferior version of 4o was SOTA
View on Reddit #62861358

nomorebuttsplz@reddit (OP)

the link doesn’t seem to work right on mobile. it’s supposed to compare sonnet 1.5 (June 24 version only) with current open weight models, sort of like this: [https://imgur.com/a/NN1oKJl](https://imgur.com/a/NN1oKJl)
View on Reddit #62860124