Two weeks after release, old Sonnet 3.5 still beats new Sonnet on LiveCodeBench
Posted by ObnoxiouslyVivid@reddit | LocalLLaMA | View on Reddit | 56 comments

afajohn@reddit
I have a complex prompt for news translation and editing in Turkish. Style, ad. phrases to omit, picking valuable information etc. There are a bunch of few-shot examples and style guidelines. They are engineered through a lot of iterations and the outputs are perfect so far. Whenever I switch to v2, performance degrades drastically. The prompts need a lot of re-engineering and mostly still fails. I hope they will not deprecate this version any time soon.
Substantial-Thing303@reddit
Why is o1-mini crushing everything else in this bench, even o1-preview, in this test, when it is 10th in the Aider Leaderboard, with o1-preview and New Sonnet clearly on top, as expected?
ObnoxiouslyVivid@reddit (OP)
o1 has been trained to one-shot problems. If it can't write the whole code in 1 answer it doesn't perform well. Aider Leaderboard, on the other hand, is all about editing existing code files, which o1 completely fails at understading.
slumdogbi@reddit
Still doesn’t makes any sense. O1 mini is a complete trash comparing to sonnet
AryanEmbered@reddit
Newer one is clearly a smaller model, designed for the new copilot integration
Plums_Raider@reddit
what annoyed me the most about old sonnet was the repetitive talking style. this feels better with the new sonnet to me.
wakigatameth@reddit
That's funny because New Sonnet solved a nagging programming issue that neither GPT-4o nor Old Sonnet could solve, and I spent DAYS with them trying to solve it.
It involved rewriting my renderer from using DirectX9 sprite interface, into a vertex-based renderer. The resulting image was always corrupt, with blurry sprites and seams, and nothing seemed to fix it.
Then new Claude came, and fixed it in one try. The blitting is now pixel-perfect.
estebansaa@reddit
New version is way better, but it needs more precise prompting.
ObnoxiouslyVivid@reddit (OP)
Do you have any examples of what works and what doesn't?
estebansaa@reddit
a better way to solve it will be if you paste a prompt you used, and then I can try to give you one that works.
Armym@reddit
Can I use old sonnet? I
ObnoxiouslyVivid@reddit (OP)
Yes, it's guaranteed available until June 2025: Model Deprecations - Anthropic
no3ther@reddit
Coding is my core LLM use case, and this tracks with what I've experienced, based on personal use.
Surprising since 10.22 beats 06.20 on SWE-bench.
Sus IMO.
ObnoxiouslyVivid@reddit (OP)
Is this a case of classic overfitting on the benchmarks?
Orolol@reddit
1022 also beat on old sonnet on aider leaderboard, lmarena and on livebench
https://livebench.ai/
https://aider.chat/docs/leaderboards/
https://lmarena.ai/?leaderboard
By personal experience, new Sonnet is far better than old.
no3ther@reddit
Interesting. For coding? This is really not my experience.
What's your workflow like / what language do you use?
Zaratsu_Daddy@reddit
Anecdotally I find that it’s often impressive in ways that old sonnet wasn’t, and also often stupid in ways old sonnet wasn’t.
Purely vibe based
knvn8@reddit
I find that it's good at writing code but bad at reflecting on whether it's writing the right code
dhrumil-@reddit
Thank god someone bought this up, i realised this the first day but was scared of being bombarded by downvotes by hype bros in the beginning lol
johnnyXcrane@reddit
or maybe theres differences in use cases? One benchmark means nothing. For me the new Sonnet performs noticeably better and I am not the only one. https://aider.chat/docs/leaderboards/
PlantFlat4056@reddit
You mean hype bots
AloHiWhat@reddit
It depends on task, I do not know what they test. I found sonnet super impressive but chatgpt also does well. Haiku made errors, not impressed
Temsirolimus555@reddit
hobby coder here, new sonnet blows the old one out of the water for me. Then again, I am no where near doing this professionally so take that for what its worth.
Evening_Ad6637@reddit
Yes, your observation is consistent with the bench results and observations of other users. The new sonnet seems to score better for simple tasks. But I can also confirm that the new sonnet frustrated me quite often when it came to more difficult and complex tasks, which is why I returned to the old dude. What I have also noticed is that the new sonnet's answers are much more concise. This can be good or bad and depends on personal preferences, but also on the task itself.
Formal-Narwhal-1610@reddit
Bring back our Beloved Old Sonnet 3.5 and Apologise for Haiku 3.5 price increase!
AcanthaceaeNo5503@reddit
Yeah, I use it a lot for coding and I can feel it performs worse actually. But the good thing is at least the speed is improved
FaultInteresting3856@reddit
OpenAI coincidentally released a product called Swarm right around the o1 releases. The o1 models clearly have 'something' in them that is allowing them to straight up massacre every other model in existence. It is not CoT, other models employ this too. There is only one other logical conclusion as to what (x) could be in this instance.
SpecialistStory336@reddit
Why is this getting downvoted? Swarm agents are a real thing that OpenAI has been working on recently and is worth looking into.
Sudden-Lingonberry-8@reddit
because this is local llama and that ain't local
dydhaw@reddit
Swarm is "An educational framework exploring ergonomic, lightweight multi-agent orchestration.". The code is on github. o1 was custom trained, likely using RL. That's the secret sauce, not some generic agent pattern.
FaultInteresting3856@reddit
People are sheep and 99% of people are too lazy to learn simple mathematics, yet a lot more than 1% of people want to discuss AI.
SpecialistStory336@reddit
Yep. I've been working with CoT recently and I definitely think OpenAI has some other stuff going on beneath the hood to make their models perform well. That is the entire reason they are not showing the CoT process. They want to gatekeep their lead, so it is important to crack the code and figure out how they are doing it so that we can implement it ourselves with local models.
FaultInteresting3856@reddit
I agree. That is why I open sourced literally everything related to swarm algorithms a few months before OpenAI was kind enough to open source 'Swarm' out of the sheer kindness of their hearts. I have also been looking into Entropix recently. I also open sourced literally everything I possibly could around High Dimensional Computing (HDC) and Transformers last month. I expect OpenAI to gift that to the world next month too.
SpecialistStory336@reddit
Can you post the links for this?
Master-Meal-77@reddit
Which is?
FaultInteresting3856@reddit
Swarm.Algorithms.
QD____@reddit
What is a swarm algorithm? As someone who has read hundreds of research papers in the past 2 years, this doesn't mean anything to me other than it maybe being a nebulous market tactic to get VCs hyped for a funding round.
FaultInteresting3856@reddit
That is because swarm algorithms are 20 years old. It is exactly what it sounds like, a swarm of agents. Like an ant swarm. They are kind of stupid, they do not have a brain. This modern technology came out in the last few years that is basically a brain and a mouth. Coincidentally, when you couple it with a couple thousand 'ants', it is like the queen. OpenAI literally released a product called Swarm, that you all just flat out ignore.
dydhaw@reddit
That has absolutely nothing to do with what OpenAI's swarm is lol. You can actually read the code, you know. It's on github. Besides, OpenAI released some details on how they trained o1. We know they are using RL to train the model. They may be using agentic patterns but that's not really groundbreaking stuff on its own.
FaultInteresting3856@reddit
They released a research paper you know. I don't debate, 'yeah this isn't what you said..' Drop a link or GTFO.
dydhaw@reddit
Oh a research paper you say? You used their code? Wow, incredible. You must be a super genius OpenAI insider. Strange that you didn't read the first sentence of the o1 release post.
FaultInteresting3856@reddit
You must be a stalker or extremely dense to be this semantic and unscientific about this topic. I am going to assume it is the former and move on. " If I have to research who you are I sue you. That's my new motto."
dydhaw@reddit
I'm sorry if I've been rude, but I promise, I'm not stalking you, merely replying to this thread. It seems you're going through a lot right now. I don't know what kind of support network you have but talking to a close friend or family member or even a mental health professional can help when you're having intense thoughts and feelings. It might be helpful to step away from the screen for a bit. Sometimes a little rest or a change of environment can give things some clarity.
Master-Meal-77@reddit
You're not making any sense frankly
FaultInteresting3856@reddit
OpenAI released a product called Swarm, then they released o1. You don't even know what swarm algorithms do or how they work. So now you want to yap at me about not making sense about this. What don't you understand now?
Master-Meal-77@reddit
You must be having a bad day. Sorry buddy
FaultInteresting3856@reddit
So what you're saying is, this is academically clearly out of your league and you cannot comment on the topic at hand, but you're American so you need to give your opinion so everyone can hear it. 10-4! Do you!
Master-Meal-77@reddit
Roger that
Ill_Yam_9994@reddit
Wombo?
FaultInteresting3856@reddit
Yes.
_meaty_ochre_@reddit
I hate nuSonnet. It sends me emojis.
TheRealGentlefox@reddit
I like its personality a lot more, and I never see anyone else mentioning it.
It's funnier, more personable, and way less strict about the rules. I've told it about certain refusals by the old Sonnet and it made fun of it for them.
ObnoxiouslyVivid@reddit (OP)
New Sonnet 3.5 is performing better on easy problems and worse on medium ones
MemoryEmptyAgain@reddit
All my problems are easy problems I'm just too stupid (or dont have enough time) to solve, so I'm winning I guess...
OrangeESP32x99@reddit
I’ve got 99 problems, but Claude usage limits are definitely one.
zmanning@reddit
Yeah, we rolled back our upgrade after issues in accuracy for the new version