New LiveBench results just released. Sonnet 3.7 reasoning now tops the charts and Sonnet 3.7 is also top non-reasoning model

[-]

teachersecret@reddit

It’s substantially better than o1 pro and o3 mini high in my testing. Amazing.

Reply

[-]

Yeah, coding is the only thing I care about, and LiveBench is saying o1-mini is still substantially ahead of 3.7 in coding, but anecdotally it seems like people are refuting that. Why does o1-mini have such a higher score?

Reply

[-]

YouIsTheQuestion@reddit

3.7 thinking blew me away today. I have a 1.5k bash script that I needed to rework how parmas are handled in. I threw it in asked it to refactor and add some additional validation and it spat out 1.7k lines in a single shot that worked flawlessly. The amount of output it can put out is amazing

Reply

[-]

MikeyTheGuy@reddit

Yeah I've been having great success with it so far. I've had a few flubs if it has to do anything too mathy, but otherwise I've literally been directing it for what I want done, and it does it in one shot with zero errors. Very cool model.

Reply

[-]

Zulfiqaar@reddit

OpenAI reasoning models are really good at code generation and architecting "make me a script that does this", but Anthropic models are really good at code modification and agentic edits "change these files to have this feature". In my experience that more or less checks out - I use sonnet on larger codebases with multiple files, but o1 anytime I want a single one-off script for a specific use.

Reply

[-]

hapliniste@reddit

O3 mini is real good at competitive coding. Sonnet is more about real work

Reply

[-]

ForsookComparison@reddit

Benchmarks, even when played fairly, only test how well a model does on that benchmark. Claude has been defying the benchmarks for some time now

Reply

[-]

edgan@reddit

I still find O1 noticeably better, but it has more usage restrictions unless you have Pro. 3.5 = 3.7 for me, and 3.7 thinking is maybe 10% better than 3.5/3.7. I have been using 3.5 for weeks, and 3.7 all day.

Reply

[-]

lc19-@reddit

Why is grok-3-thinking missing a lot of evals?

Reply

[-]

jd_3d@reddit (OP)

No API access yet. They manually benched one category

Reply

[-]

lc19-@reddit

Ok thanks

Reply

[-]

Proud_Fox_684@reddit

Still amazed at the performance of Deepseek-R1.

Reply

[-]

Roshlev@reddit

I feel like we are topping out when it comes to raw model strength. We need more efficient usage of these models. Faster t/s, better hardware, better usage of current hardware, etc.

Reply

[-]

SpecificTeaching8918@reddit

Topping out? Are you high? Look back just 4 months, we had no o1 full. Now o3 full exists, which is miles ahead of o1 preview or whatever. Like guys, if it isn’t progress for legit like 1-2 month it doesent mean we are topping out?

Reply

[-]

danielv123@reddit

No progress for 1-2 months? We have had new records by both frontier models and cheap models in the last few weeks.

Reply

[-]

SpecificTeaching8918@reddit

I dont mean we have had no progress. I just said that even if there was no progress for 1-2 months doesent mean we have hit a plateau. I worded myself a bit poorly.

Reply

[-]

Short_Ad_8841@reddit

You mean like deepseek and gemini models ?

Reply

[-]

gzzhongqi@reddit

how can its math score be so high? I thought it got a pretty bad score in AIME in the official benchmark from Anthropic.

Reply

[-]

Noak3@reddit

Probably because Anthropic just does a better job than anyone else about being super tryhard at not overfitting to benchmarks

Reply

[-]

Thomas-Lore@reddit

It got low score with thinking disabled, woth thinking enabled it did ok

Reply

[-]

TheActualStudy@reddit

Aider leaderboard shows 3.7 being 8.8 percentage points ahead of 3.5 (and 23% more tokens needed) for the polyglot leaderboard. Coding is why I give Anthropic money, so this looks generally positive.

Reply

[-]

GodComplecs@reddit

Not to rain on your Anthropic (glazing) parade, but in general Claude is garbage for coding projects. I've made many, many full stack projects and it's always the worst and goes off rails. I always wonder why on Reddit it is suggested so much when even basic chatgpt 3.5 was better... Not even mentioning R1 or local Qwen 32b...

Reply

[-]

Paradigmind@reddit

Nice try Sam..

Reply

[-]

GodComplecs@reddit

Altman? If I have higher regards for R1 and Qwen? You can't even read or comprehend, so 0,5B parameter of you.

Reply

[-]

Paradigmind@reddit

That's what Sam would say!

Reply

[-]

Biggest_Cans@reddit

Sam's just here because he loves it.

Reply

[-]

Evening_Ad6637@reddit

Enemy of your enemy?

Reply

[-]

FUS3N@reddit

It was the best for coding for so long still is cuz it understand the task you give it, no model is good at full on projects none was good if you ask anything other than basic games or things that would already be in their dataset, but for straight forward task if the developer understands their own codebase they can prompt it in a way to make things work and it has always worked really good that way that gpt4o and other similar struggled, r1 was similarly good this way but it was a reasoning model.

Reply

[-]

animealt46@reddit

(Most) consumers: Give us 3.5 Sonnet but better! Anthro: Ok here's the model but better. Easy layup tbh.

Reply

[-]

Blolbly@reddit

Is there a place where humans can take the same test? I want to see how I compare

Reply

[-]

Narrow-Ad6201@reddit

sonnet thinking is locked behind a paywall and gemini 2 flash sill beats 3.7 sonnet.

Reply

[-]

Thomas-Lore@reddit

> gemini 2 flash still beats 3.7 sonnet As much as I like Flash, they are not even comparable.

Reply

[-]

Narrow-Ad6201@reddit

i mean idk what your usecase is but i dont do any coding whatsoever so i do actually find them pretty comparable. infact the longer responses of flash are infinitely more useful to me than the somewhat abbreviated claude answers that i get.

Reply

[-]

DefNattyBoii@reddit

Yes but the api is heavily restricted and besides chatting it's hard to use with any integrations.

Reply

[-]

extopico@reddit

And it is all true, not gamed, and even if you don't use the API you have MCPs that make it insanely more powerful and useful than anything else.

Reply

[-]

alw9@reddit

why is o1 pro never in these tables? is it o1 high

Reply

[-]

Sporeboss@reddit

stuck at loop to fix a bug for me . despite tried 4 time export the same output . seems 5000 line is too much for 3.7

Reply

[-]

iamnotdeadnuts@reddit

I am not tired of the notifications, "This model just dethroned OPENAI" xD

Reply

[-]

rcparts@reddit

Why is Grok ranked above R1 if it does not have a global average and if we consider its only score, it should be 2 positions below?

Reply

[-]

edgan@reddit

I found Claude 3.7 to be just like 3.5 in Cursor. I found Claude 3.7 thinking in Cursor better Claude 3.7 thinking has two annoying behaviors. One it is extremely verbose, and sometimes gets stuck repeating itself. Two it has this annoying, Oh but here is an extra idea on top of the main idea. I understand that is somewhat to be expected with thinking, but it comes across as more that they said almost always give the user two thoughts as part of the prompt. So it comes across as scripted.

Reply

[-]

jd_3d@reddit (OP)

Full list is here: [https://livebench.ai/](https://livebench.ai/) Also interesting here is they used 64k thinking tokens for the evaluation. Not sure if they are going to re-try with the 128k max, but I'd be interested to see if it improves the score.

Reply

[-]

coder543@reddit

Clause 3.5 Sonnet generated about 44 tokens per second according to Artificial Analysis… 64k tokens would be 24 minutes for a single response. 128k would be 48 minutes. Not much “live” about these latencies.

Reply

[-]

ihexx@reddit

on a side note, this is why I'm so happy about deepseek going open source, because companies like SambaNova who build ultra-fast compute infra like groq can pick it up and serve it at 198 tokens/sec [https://pressreleasehub.pa.media/article/sambanova-launches-the-fastest-deepseek-r1-671b-with-the-highest-efficiency-38402.html](https://pressreleasehub.pa.media/article/sambanova-launches-the-fastest-deepseek-r1-671b-with-the-highest-efficiency-38402.html) reasoning models have s terrible ux because latency, and I hope this sort of shift to infra catches on with other competitors and scales up as we go to longer and longer reasoning chains

Reply

[-]

mxforest@reddit

You are assuming they run on the same hardware and have the same size/quantization.

Reply

[-]

tengo_harambe@reddit

So when is Claude releasing the weights so we can run this locally?

Reply

[-]

nuclearbananana@reddit

Unlike some companies, Anthropic has never pretended to be open. So probably never. I bet you'll see a half dozen open models trained on it soon enough though

Reply

[-]

ForsookComparison@reddit

The API cost is insanely high. That's an expensive synthetic dataset right there

Reply

[-]

nuclearbananana@reddit

Ppl do it. Qwen models were clearly trained extensively on claude

Reply

[-]

ForsookComparison@reddit

Qwen3 will be lit

Reply

[-]

ForsookComparison@reddit

I just spent an hour or two on a brand new project as well as modifying and extending an existing project. This is the real deal. Only one error the entire time, and it was a silly import issue that it quickly corrected.

Reply

[-]

mehyay76@reddit

The ultimate test is pelican riding a bicycle in SVG🚲 🦢

Reply

[-]

ninjasaid13@reddit

>The ultimate test is pelican riding a bicycle in SVG🚲 🦢 the ultimate test should never be a static test.

Reply

[-]

teachersecret@reddit

https://claude.site/artifacts/38085b40-ed21-4c31-a809-ab4344db4330 Here’s Claude pro 3.7 with extended thinking giving it a shot.

Reply

[-]

TreeAlight@reddit

I have no idea what prompt you used, but here's this with the prompt: "Create an SVG illustration of a pelican riding a bicycle." [https://claude.site/artifacts/2576efda-d23e-4304-85b6-2a6e062cb7bb](https://claude.site/artifacts/2576efda-d23e-4304-85b6-2a6e062cb7bb)

Reply

[-]

mehyay76@reddit

Thank you! didn't realize when sharing the prompt is not shared. My prompt was pretty close to yours: > SVG file of a pelican riding on a bicycle

Reply

[-]

bot_exe@reddit

I find the SWE bench improvement more interesting than the coding in this benchmark. https://preview.redd.it/ie4su26027le1.jpeg?width=1920&format=pjpg&auto=webp&s=484af6150cc2df30676771a773af12ef6c3cfd90

Reply

[-]

soulhacker@reddit

This is from Anthropic so …

Reply

[-]

jd_3d@reddit (OP)

Yes, but until its independently verified I don't trust it. Why didn't they submit it to the official leaderboard? Or maybe it just hasn't been updated yet...

Reply

[-]

ihaag@reddit

It’s a hybrid model there is thinking and reasoning… just when you’re happy with open source Claude smashes it… wish it was open source or at least they shared more technical information.:

Reply

Reply to Post

59 Comments