Two weeks after release, old Sonnet 3.5 still beats new Sonnet on LiveCodeBench

[-]

afajohn@reddit

I have a complex prompt for news translation and editing in Turkish. Style, ad. phrases to omit, picking valuable information etc. There are a bunch of few-shot examples and style guidelines. They are engineered through a lot of iterations and the outputs are perfect so far. Whenever I switch to v2, performance degrades drastically. The prompts need a lot of re-engineering and mostly still fails. I hope they will not deprecate this version any time soon.

[-]

Substantial-Thing303@reddit

Why is o1-mini crushing everything else in this bench, even o1-preview, in this test, when it is 10th in the Aider Leaderboard, with o1-preview and New Sonnet clearly on top, as expected?

[-]

ObnoxiouslyVivid@reddit (OP)

o1 has been trained to one-shot problems. If it can't write the whole code in 1 answer it doesn't perform well. Aider Leaderboard, on the other hand, is all about editing existing code files, which o1 completely fails at understading.

[-]

slumdogbi@reddit

Still doesn’t makes any sense. O1 mini is a complete trash comparing to sonnet

[-]

AryanEmbered@reddit

Newer one is clearly a smaller model, designed for the new copilot integration

[-]

Plums_Raider@reddit

what annoyed me the most about old sonnet was the repetitive talking style. this feels better with the new sonnet to me.

[-]

wakigatameth@reddit

That's funny because New Sonnet solved a nagging programming issue that neither GPT-4o nor Old Sonnet could solve, and I spent DAYS with them trying to solve it.

It involved rewriting my renderer from using DirectX9 sprite interface, into a vertex-based renderer. The resulting image was always corrupt, with blurry sprites and seams, and nothing seemed to fix it.

Then new Claude came, and fixed it in one try. The blitting is now pixel-perfect.

[-]

estebansaa@reddit

New version is way better, but it needs more precise prompting.

[-]

ObnoxiouslyVivid@reddit (OP)

Do you have any examples of what works and what doesn't?

[-]

estebansaa@reddit

a better way to solve it will be if you paste a prompt you used, and then I can try to give you one that works.

[-]

Armym@reddit

Can I use old sonnet? I

[-]

ObnoxiouslyVivid@reddit (OP)

Yes, it's guaranteed available until June 2025: Model Deprecations - Anthropic

[-]

no3ther@reddit

Coding is my core LLM use case, and this tracks with what I've experienced, based on personal use.

Surprising since 10.22 beats 06.20 on SWE-bench.

Sus IMO.

[-]

ObnoxiouslyVivid@reddit (OP)

Is this a case of classic overfitting on the benchmarks?

[-]

Orolol@reddit

1022 also beat on old sonnet on aider leaderboard, lmarena and on livebench

https://livebench.ai/

https://aider.chat/docs/leaderboards/

https://lmarena.ai/?leaderboard

By personal experience, new Sonnet is far better than old.

[-]

no3ther@reddit

Interesting. For coding? This is really not my experience.

What's your workflow like / what language do you use?

[-]

Zaratsu_Daddy@reddit

Anecdotally I find that it’s often impressive in ways that old sonnet wasn’t, and also often stupid in ways old sonnet wasn’t.

Purely vibe based

[-]

knvn8@reddit

I find that it's good at writing code but bad at reflecting on whether it's writing the right code

[-]

dhrumil-@reddit

Thank god someone bought this up, i realised this the first day but was scared of being bombarded by downvotes by hype bros in the beginning lol

[-]

johnnyXcrane@reddit

or maybe theres differences in use cases? One benchmark means nothing. For me the new Sonnet performs noticeably better and I am not the only one. https://aider.chat/docs/leaderboards/

[-]

PlantFlat4056@reddit

You mean hype bots

[-]

AloHiWhat@reddit

It depends on task, I do not know what they test. I found sonnet super impressive but chatgpt also does well. Haiku made errors, not impressed

[-]

Temsirolimus555@reddit

hobby coder here, new sonnet blows the old one out of the water for me. Then again, I am no where near doing this professionally so take that for what its worth.

[-]

Evening_Ad6637@reddit

Yes, your observation is consistent with the bench results and observations of other users. The new sonnet seems to score better for simple tasks. But I can also confirm that the new sonnet frustrated me quite often when it came to more difficult and complex tasks, which is why I returned to the old dude. What I have also noticed is that the new sonnet's answers are much more concise. This can be good or bad and depends on personal preferences, but also on the task itself.

[-]

Formal-Narwhal-1610@reddit

Bring back our Beloved Old Sonnet 3.5 and Apologise for Haiku 3.5 price increase!

[-]

AcanthaceaeNo5503@reddit

Yeah, I use it a lot for coding and I can feel it performs worse actually. But the good thing is at least the speed is improved

[-]

FaultInteresting3856@reddit

OpenAI coincidentally released a product called Swarm right around the o1 releases. The o1 models clearly have 'something' in them that is allowing them to straight up massacre every other model in existence. It is not CoT, other models employ this too. There is only one other logical conclusion as to what (x) could be in this instance.

[-]

SpecialistStory336@reddit

Why is this getting downvoted? Swarm agents are a real thing that OpenAI has been working on recently and is worth looking into.

[-]

Sudden-Lingonberry-8@reddit

because this is local llama and that ain't local

[-]

dydhaw@reddit

Swarm is "An educational framework exploring ergonomic, lightweight multi-agent orchestration.". The code is on github. o1 was custom trained, likely using RL. That's the secret sauce, not some generic agent pattern.

[-]

FaultInteresting3856@reddit

People are sheep and 99% of people are too lazy to learn simple mathematics, yet a lot more than 1% of people want to discuss AI.

[-]

SpecialistStory336@reddit

Yep. I've been working with CoT recently and I definitely think OpenAI has some other stuff going on beneath the hood to make their models perform well. That is the entire reason they are not showing the CoT process. They want to gatekeep their lead, so it is important to crack the code and figure out how they are doing it so that we can implement it ourselves with local models.

[-]

FaultInteresting3856@reddit

I agree. That is why I open sourced literally everything related to swarm algorithms a few months before OpenAI was kind enough to open source 'Swarm' out of the sheer kindness of their hearts. I have also been looking into Entropix recently. I also open sourced literally everything I possibly could around High Dimensional Computing (HDC) and Transformers last month. I expect OpenAI to gift that to the world next month too.

[-]

SpecialistStory336@reddit

Can you post the links for this?

[-]

Master-Meal-77@reddit

Which is?

[-]

FaultInteresting3856@reddit

Swarm.Algorithms.

[-]

QD____@reddit

What is a swarm algorithm? As someone who has read hundreds of research papers in the past 2 years, this doesn't mean anything to me other than it maybe being a nebulous market tactic to get VCs hyped for a funding round.

[-]

FaultInteresting3856@reddit

That is because swarm algorithms are 20 years old. It is exactly what it sounds like, a swarm of agents. Like an ant swarm. They are kind of stupid, they do not have a brain. This modern technology came out in the last few years that is basically a brain and a mouth. Coincidentally, when you couple it with a couple thousand 'ants', it is like the queen. OpenAI literally released a product called Swarm, that you all just flat out ignore.

[-]

dydhaw@reddit

That has absolutely nothing to do with what OpenAI's swarm is lol. You can actually read the code, you know. It's on github. Besides, OpenAI released some details on how they trained o1. We know they are using RL to train the model. They may be using agentic patterns but that's not really groundbreaking stuff on its own.

[-]

FaultInteresting3856@reddit

They released a research paper you know. I don't debate, 'yeah this isn't what you said..' Drop a link or GTFO.

[-]

dydhaw@reddit

Oh a research paper you say? You used their code? Wow, incredible. You must be a super genius OpenAI insider. Strange that you didn't read the first sentence of the o1 release post.

We are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.

[-]

FaultInteresting3856@reddit

You must be a stalker or extremely dense to be this semantic and unscientific about this topic. I am going to assume it is the former and move on. " If I have to research who you are I sue you. That's my new motto."

[-]

dydhaw@reddit

I'm sorry if I've been rude, but I promise, I'm not stalking you, merely replying to this thread. It seems you're going through a lot right now. I don't know what kind of support network you have but talking to a close friend or family member or even a mental health professional can help when you're having intense thoughts and feelings. It might be helpful to step away from the screen for a bit. Sometimes a little rest or a change of environment can give things some clarity.

[-]

ObnoxiouslyVivid@reddit (OP)

New Sonnet 3.5 is performing better on easy problems and worse on medium ones

[-]