Yeah, coding is the only thing I care about, and LiveBench is saying o1-mini is still substantially ahead of 3.7 in coding, but anecdotally it seems like people are refuting that. Why does o1-mini have such a higher score?
3.7 thinking blew me away today. I have a 1.5k bash script that I needed to rework how parmas are handled in. I threw it in asked it to refactor and add some additional validation and it spat out 1.7k lines in a single shot that worked flawlessly. The amount of output it can put out is amazing
Yeah I've been having great success with it so far. I've had a few flubs if it has to do anything too mathy, but otherwise I've literally been directing it for what I want done, and it does it in one shot with zero errors. Very cool model.
OpenAI reasoning models are really good at code generation and architecting "make me a script that does this", but Anthropic models are really good at code modification and agentic edits "change these files to have this feature".
In my experience that more or less checks out - I use sonnet on larger codebases with multiple files, but o1 anytime I want a single one-off script for a specific use.
I still find O1 noticeably better, but it has more usage restrictions unless you have Pro.
3.5 = 3.7 for me, and 3.7 thinking is maybe 10% better than 3.5/3.7. I have been using 3.5 for weeks, and 3.7 all day.
I feel like we are topping out when it comes to raw model strength. We need more efficient usage of these models. Faster t/s, better hardware, better usage of current hardware, etc.
Topping out? Are you high? Look back just 4 months, we had no o1 full. Now o3 full exists, which is miles ahead of o1 preview or whatever. Like guys, if it isn’t progress for legit like 1-2 month it doesent mean we are topping out?
I dont mean we have had no progress. I just said that even if there was no progress for 1-2 months doesent mean we have hit a plateau. I worded myself a bit poorly.
Aider leaderboard shows 3.7 being 8.8 percentage points ahead of 3.5 (and 23% more tokens needed) for the polyglot leaderboard. Coding is why I give Anthropic money, so this looks generally positive.
Not to rain on your Anthropic (glazing) parade, but in general Claude is garbage for coding projects. I've made many, many full stack projects and it's always the worst and goes off rails. I always wonder why on Reddit it is suggested so much when even basic chatgpt 3.5 was better... Not even mentioning R1 or local Qwen 32b...
It was the best for coding for so long still is cuz it understand the task you give it, no model is good at full on projects none was good if you ask anything other than basic games or things that would already be in their dataset, but for straight forward task if the developer understands their own codebase they can prompt it in a way to make things work and it has always worked really good that way that gpt4o and other similar struggled, r1 was similarly good this way but it was a reasoning model.
i mean idk what your usecase is but i dont do any coding whatsoever so i do actually find them pretty comparable. infact the longer responses of flash are infinitely more useful to me than the somewhat abbreviated claude answers that i get.
I found Claude 3.7 to be just like 3.5 in Cursor. I found Claude 3.7 thinking in Cursor better
Claude 3.7 thinking has two annoying behaviors. One it is extremely verbose, and sometimes gets stuck repeating itself. Two it has this annoying, Oh but here is an extra idea on top of the main idea. I understand that is somewhat to be expected with thinking, but it comes across as more that they said almost always give the user two thoughts as part of the prompt. So it comes across as scripted.
Full list is here: [https://livebench.ai/](https://livebench.ai/)
Also interesting here is they used 64k thinking tokens for the evaluation. Not sure if they are going to re-try with the 128k max, but I'd be interested to see if it improves the score.
Clause 3.5 Sonnet generated about 44 tokens per second according to Artificial Analysis… 64k tokens would be 24 minutes for a single response. 128k would be 48 minutes. Not much “live” about these latencies.
on a side note, this is why I'm so happy about deepseek going open source, because companies like SambaNova who build ultra-fast compute infra like groq can pick it up and serve it at 198 tokens/sec
[https://pressreleasehub.pa.media/article/sambanova-launches-the-fastest-deepseek-r1-671b-with-the-highest-efficiency-38402.html](https://pressreleasehub.pa.media/article/sambanova-launches-the-fastest-deepseek-r1-671b-with-the-highest-efficiency-38402.html)
reasoning models have s terrible ux because latency, and I hope this sort of shift to infra catches on with other competitors and scales up as we go to longer and longer reasoning chains
Unlike some companies, Anthropic has never pretended to be open. So probably never.
I bet you'll see a half dozen open models trained on it soon enough though
I just spent an hour or two on a brand new project as well as modifying and extending an existing project.
This is the real deal. Only one error the entire time, and it was a silly import issue that it quickly corrected.
I have no idea what prompt you used, but here's this with the prompt: "Create an SVG illustration of a pelican riding a bicycle."
[https://claude.site/artifacts/2576efda-d23e-4304-85b6-2a6e062cb7bb](https://claude.site/artifacts/2576efda-d23e-4304-85b6-2a6e062cb7bb)
I find the SWE bench improvement more interesting than the coding in this benchmark.
https://preview.redd.it/ie4su26027le1.jpeg?width=1920&format=pjpg&auto=webp&s=484af6150cc2df30676771a773af12ef6c3cfd90
Yes, but until its independently verified I don't trust it. Why didn't they submit it to the official leaderboard? Or maybe it just hasn't been updated yet...
It’s a hybrid model there is thinking and reasoning… just when you’re happy with open source Claude smashes it… wish it was open source or at least they shared more technical information.:
59 Comments
teachersecret@reddit
MikeyTheGuy@reddit
YouIsTheQuestion@reddit
MikeyTheGuy@reddit
Zulfiqaar@reddit
hapliniste@reddit
ForsookComparison@reddit
edgan@reddit
lc19-@reddit
jd_3d@reddit (OP)
lc19-@reddit
Proud_Fox_684@reddit
Roshlev@reddit
SpecificTeaching8918@reddit
danielv123@reddit
SpecificTeaching8918@reddit
Short_Ad_8841@reddit
gzzhongqi@reddit
Noak3@reddit
Thomas-Lore@reddit
TheActualStudy@reddit
GodComplecs@reddit
Paradigmind@reddit
GodComplecs@reddit
Paradigmind@reddit
Biggest_Cans@reddit
Evening_Ad6637@reddit
FUS3N@reddit
animealt46@reddit
Blolbly@reddit
Narrow-Ad6201@reddit
Thomas-Lore@reddit
Narrow-Ad6201@reddit
DefNattyBoii@reddit
extopico@reddit
alw9@reddit
Sporeboss@reddit
iamnotdeadnuts@reddit
rcparts@reddit
edgan@reddit
jd_3d@reddit (OP)
coder543@reddit
ihexx@reddit
mxforest@reddit
tengo_harambe@reddit
nuclearbananana@reddit
ForsookComparison@reddit
nuclearbananana@reddit
ForsookComparison@reddit
ForsookComparison@reddit
mehyay76@reddit
ninjasaid13@reddit
teachersecret@reddit
TreeAlight@reddit
mehyay76@reddit
bot_exe@reddit
soulhacker@reddit
jd_3d@reddit (OP)
ihaag@reddit