TheaterFire

New LiveBench results just released. Sonnet 3.7 reasoning now tops the charts and Sonnet 3.7 is also top non-reasoning model

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 59 comments

New LiveBench results just released. Sonnet 3.7 reasoning now tops the charts and Sonnet 3.7 is also top non-reasoning model

Reply to Post

59 Comments

teachersecret@reddit

It’s substantially better than o1 pro and o3 mini high in my testing. Amazing.
View on Reddit #49553220

MikeyTheGuy@reddit

Yeah, coding is the only thing I care about, and LiveBench is saying o1-mini is still substantially ahead of 3.7 in coding, but anecdotally it seems like people are refuting that. Why does o1-mini have such a higher score?
View on Reddit #49556680

YouIsTheQuestion@reddit

3.7 thinking blew me away today. I have a 1.5k bash script that I needed to rework how parmas are handled in. I threw it in asked it to refactor and add some additional validation and it spat out 1.7k lines in a single shot that worked flawlessly. The amount of output it can put out is amazing
View on Reddit #49728679

MikeyTheGuy@reddit

Yeah I've been having great success with it so far. I've had a few flubs if it has to do anything too mathy, but otherwise I've literally been directing it for what I want done, and it does it in one shot with zero errors. Very cool model.
View on Reddit #49747107

Zulfiqaar@reddit

OpenAI reasoning models are really good at code generation and architecting "make me a script that does this", but Anthropic models are really good at code modification and agentic edits "change these files to have this feature". In my experience that more or less checks out - I use sonnet on larger codebases with multiple files, but o1 anytime I want a single one-off script for a specific use.
View on Reddit #49567375

hapliniste@reddit

O3 mini is real good at competitive coding. Sonnet is more about real work
View on Reddit #49559159

ForsookComparison@reddit

Benchmarks, even when played fairly, only test how well a model does on that benchmark. Claude has been defying the benchmarks for some time now
View on Reddit #49557793

edgan@reddit

I still find O1 noticeably better, but it has more usage restrictions unless you have Pro. 3.5 = 3.7 for me, and 3.7 thinking is maybe 10% better than 3.5/3.7. I have been using 3.5 for weeks, and 3.7 all day.
View on Reddit #49561126

lc19-@reddit

Why is grok-3-thinking missing a lot of evals?
View on Reddit #49564732

jd_3d@reddit (OP)

No API access yet. They manually benched one category
View on Reddit #49588262

lc19-@reddit

Ok thanks
View on Reddit #49626383

Proud_Fox_684@reddit

Still amazed at the performance of Deepseek-R1.
View on Reddit #49621636

Roshlev@reddit

I feel like we are topping out when it comes to raw model strength. We need more efficient usage of these models. Faster t/s, better hardware, better usage of current hardware, etc.
View on Reddit #49552225

SpecificTeaching8918@reddit

Topping out? Are you high? Look back just 4 months, we had no o1 full. Now o3 full exists, which is miles ahead of o1 preview or whatever. Like guys, if it isn’t progress for legit like 1-2 month it doesent mean we are topping out?
View on Reddit #49609152

danielv123@reddit

No progress for 1-2 months? We have had new records by both frontier models and cheap models in the last few weeks.
View on Reddit #49614923

SpecificTeaching8918@reddit

I dont mean we have had no progress. I just said that even if there was no progress for 1-2 months doesent mean we have hit a plateau. I worded myself a bit poorly.
View on Reddit #49615227

Short_Ad_8841@reddit

You mean like deepseek and gemini models ?
View on Reddit #49583105

gzzhongqi@reddit

how can its math score be so high? I thought it got a pretty bad score in AIME in the official benchmark from Anthropic.
View on Reddit #49551115

Noak3@reddit

Probably because Anthropic just does a better job than anyone else about being super tryhard at not overfitting to benchmarks
View on Reddit #49592098

Thomas-Lore@reddit

It got low score with thinking disabled, woth thinking enabled it did ok
View on Reddit #49561261

TheActualStudy@reddit

Aider leaderboard shows 3.7 being 8.8 percentage points ahead of 3.5 (and 23% more tokens needed) for the polyglot leaderboard. Coding is why I give Anthropic money, so this looks generally positive.
View on Reddit #49549370

GodComplecs@reddit

Not to rain on your Anthropic (glazing) parade, but in general Claude is garbage for coding projects. I've made many, many full stack projects and it's always the worst and goes off rails. I always wonder why on Reddit it is suggested so much when even basic chatgpt 3.5 was better... Not even mentioning R1 or local Qwen 32b...
View on Reddit #49561431

Paradigmind@reddit

Nice try Sam..
View on Reddit #49561674

GodComplecs@reddit

Altman? If I have higher regards for R1 and Qwen? You can't even read or comprehend, so 0,5B parameter of you.
View on Reddit #49569996

Paradigmind@reddit

That's what Sam would say!
View on Reddit #49577739

Biggest_Cans@reddit

Sam's just here because he loves it.
View on Reddit #49586044

Evening_Ad6637@reddit

Enemy of your enemy?
View on Reddit #49574447

FUS3N@reddit

It was the best for coding for so long still is cuz it understand the task you give it, no model is good at full on projects none was good if you ask anything other than basic games or things that would already be in their dataset, but for straight forward task if the developer understands their own codebase they can prompt it in a way to make things work and it has always worked really good that way that gpt4o and other similar struggled, r1 was similarly good this way but it was a reasoning model.
View on Reddit #49570426

animealt46@reddit

(Most) consumers: Give us 3.5 Sonnet but better! Anthro: Ok here's the model but better. Easy layup tbh.
View on Reddit #49550316

Blolbly@reddit

Is there a place where humans can take the same test? I want to see how I compare
View on Reddit #49582076

Narrow-Ad6201@reddit

sonnet thinking is locked behind a paywall and gemini 2 flash sill beats 3.7 sonnet.
View on Reddit #49549593

Thomas-Lore@reddit

> gemini 2 flash still beats 3.7 sonnet As much as I like Flash, they are not even comparable.
View on Reddit #49561399

Narrow-Ad6201@reddit

i mean idk what your usecase is but i dont do any coding whatsoever so i do actually find them pretty comparable. infact the longer responses of flash are infinitely more useful to me than the somewhat abbreviated claude answers that i get.
View on Reddit #49581345

DefNattyBoii@reddit

Yes but the api is heavily restricted and besides chatting it's hard to use with any integrations.
View on Reddit #49559883

extopico@reddit

And it is all true, not gamed, and even if you don't use the API you have MCPs that make it insanely more powerful and useful than anything else.
View on Reddit #49574498

alw9@reddit

why is o1 pro never in these tables? is it o1 high
View on Reddit #49568726

Sporeboss@reddit

stuck at loop to fix a bug for me . despite tried 4 time export the same output . seems 5000 line is too much for 3.7
View on Reddit #49562668

iamnotdeadnuts@reddit

I am not tired of the notifications, "This model just dethroned OPENAI" xD
View on Reddit #49562244

rcparts@reddit

Why is Grok ranked above R1 if it does not have a global average and if we consider its only score, it should be 2 positions below?
View on Reddit #49561257

edgan@reddit

I found Claude 3.7 to be just like 3.5 in Cursor. I found Claude 3.7 thinking in Cursor better Claude 3.7 thinking has two annoying behaviors. One it is extremely verbose, and sometimes gets stuck repeating itself. Two it has this annoying, Oh but here is an extra idea on top of the main idea. I understand that is somewhat to be expected with thinking, but it comes across as more that they said almost always give the user two thoughts as part of the prompt. So it comes across as scripted.
View on Reddit #49560992

jd_3d@reddit (OP)

Full list is here: [https://livebench.ai/](https://livebench.ai/) Also interesting here is they used 64k thinking tokens for the evaluation. Not sure if they are going to re-try with the 128k max, but I'd be interested to see if it improves the score.
View on Reddit #49547624

coder543@reddit

Clause 3.5 Sonnet generated about 44 tokens per second according to Artificial Analysis… 64k tokens would be 24 minutes for a single response. 128k would be 48 minutes. Not much “live” about these latencies.
View on Reddit #49549166

ihexx@reddit

on a side note, this is why I'm so happy about deepseek going open source, because companies like SambaNova who build ultra-fast compute infra like groq can pick it up and serve it at 198 tokens/sec [https://pressreleasehub.pa.media/article/sambanova-launches-the-fastest-deepseek-r1-671b-with-the-highest-efficiency-38402.html](https://pressreleasehub.pa.media/article/sambanova-launches-the-fastest-deepseek-r1-671b-with-the-highest-efficiency-38402.html) reasoning models have s terrible ux because latency, and I hope this sort of shift to infra catches on with other competitors and scales up as we go to longer and longer reasoning chains
View on Reddit #49560138

mxforest@reddit

You are assuming they run on the same hardware and have the same size/quantization.
View on Reddit #49555677

tengo_harambe@reddit

So when is Claude releasing the weights so we can run this locally?
View on Reddit #49550986

nuclearbananana@reddit

Unlike some companies, Anthropic has never pretended to be open. So probably never. I bet you'll see a half dozen open models trained on it soon enough though
View on Reddit #49551816

ForsookComparison@reddit

The API cost is insanely high. That's an expensive synthetic dataset right there
View on Reddit #49557774

nuclearbananana@reddit

Ppl do it. Qwen models were clearly trained extensively on claude
View on Reddit #49557902

ForsookComparison@reddit

Qwen3 will be lit
View on Reddit #49558028

ForsookComparison@reddit

I just spent an hour or two on a brand new project as well as modifying and extending an existing project. This is the real deal. Only one error the entire time, and it was a silly import issue that it quickly corrected.
View on Reddit #49557707

mehyay76@reddit

The ultimate test is pelican riding a bicycle in SVG🚲 🦢
View on Reddit #49551129

ninjasaid13@reddit

>The ultimate test is pelican riding a bicycle in SVG🚲 🦢 the ultimate test should never be a static test.
View on Reddit #49557577

teachersecret@reddit

https://claude.site/artifacts/38085b40-ed21-4c31-a809-ab4344db4330 Here’s Claude pro 3.7 with extended thinking giving it a shot.
View on Reddit #49553147

TreeAlight@reddit

I have no idea what prompt you used, but here's this with the prompt: "Create an SVG illustration of a pelican riding a bicycle." [https://claude.site/artifacts/2576efda-d23e-4304-85b6-2a6e062cb7bb](https://claude.site/artifacts/2576efda-d23e-4304-85b6-2a6e062cb7bb)
View on Reddit #49551565

mehyay76@reddit

Thank you! didn't realize when sharing the prompt is not shared. My prompt was pretty close to yours: > SVG file of a pelican riding on a bicycle
View on Reddit #49552235

bot_exe@reddit

I find the SWE bench improvement more interesting than the coding in this benchmark. https://preview.redd.it/ie4su26027le1.jpeg?width=1920&format=pjpg&auto=webp&s=484af6150cc2df30676771a773af12ef6c3cfd90
View on Reddit #49549638

soulhacker@reddit

This is from Anthropic so …
View on Reddit #49552433

jd_3d@reddit (OP)

Yes, but until its independently verified I don't trust it. Why didn't they submit it to the official leaderboard? Or maybe it just hasn't been updated yet...
View on Reddit #49550310

ihaag@reddit

It’s a hybrid model there is thinking and reasoning… just when you’re happy with open source Claude smashes it… wish it was open source or at least they shared more technical information.:
View on Reddit #49547285