Two weeks after release, old Sonnet 3.5 still beats new Sonnet on LiveCodeBench
Posted by ObnoxiouslyVivid@reddit | LocalLLaMA | View on Reddit | 36 comments
no3ther@reddit
Coding is my core LLM use case, and this tracks with what I've experienced, based on personal use.
Surprising since 10.22 beats 06.20 on SWE-bench.
Sus IMO.
ObnoxiouslyVivid@reddit (OP)
Is this a case of classic overfitting on the benchmarks?
Orolol@reddit
1022 also beat on old sonnet on aider leaderboard, lmarena and on livebench
https://livebench.ai/
https://aider.chat/docs/leaderboards/
https://lmarena.ai/?leaderboard
By personal experience, new Sonnet is far better than old.
FaultInteresting3856@reddit
OpenAI coincidentally released a product called Swarm right around the o1 releases. The o1 models clearly have 'something' in them that is allowing them to straight up massacre every other model in existence. It is not CoT, other models employ this too. There is only one other logical conclusion as to what (x) could be in this instance.
SpecialistStory336@reddit
Why is this getting downvoted? Swarm agents are a real thing that OpenAI has been working on recently and is worth looking into.
Sudden-Lingonberry-8@reddit
because this is local llama and that ain't local
dydhaw@reddit
Swarm is "An educational framework exploring ergonomic, lightweight multi-agent orchestration.". The code is on github. o1 was custom trained, likely using RL. That's the secret sauce, not some generic agent pattern.
FaultInteresting3856@reddit
People are sheep and 99% of people are too lazy to learn simple mathematics, yet a lot more than 1% of people want to discuss AI.
SpecialistStory336@reddit
Yep. I've been working with CoT recently and I definitely think OpenAI has some other stuff going on beneath the hood to make their models perform well. That is the entire reason they are not showing the CoT process. They want to gatekeep their lead, so it is important to crack the code and figure out how they are doing it so that we can implement it ourselves with local models.
FaultInteresting3856@reddit
I agree. That is why I open sourced literally everything related to swarm algorithms a few months before OpenAI was kind enough to open source 'Swarm' out of the sheer kindness of their hearts. I have also been looking into Entropix recently. I also open sourced literally everything I possibly could around High Dimensional Computing (HDC) and Transformers last month. I expect OpenAI to gift that to the world next month too.
SpecialistStory336@reddit
Can you post the links for this?
Master-Meal-77@reddit
Which is?
FaultInteresting3856@reddit
Swarm.Algorithms.
QD____@reddit
What is a swarm algorithm? As someone who has read hundreds of research papers in the past 2 years, this doesn't mean anything to me other than it maybe being a nebulous market tactic to get VCs hyped for a funding round.
FaultInteresting3856@reddit
That is because swarm algorithms are 20 years old. It is exactly what it sounds like, a swarm of agents. Like an ant swarm. They are kind of stupid, they do not have a brain. This modern technology came out in the last few years that is basically a brain and a mouth. Coincidentally, when you couple it with a couple thousand 'ants', it is like the queen. OpenAI literally released a product called Swarm, that you all just flat out ignore.
dydhaw@reddit
That has absolutely nothing to do with what OpenAI's swarm is lol. You can actually read the code, you know. It's on github. Besides, OpenAI released some details on how they trained o1. We know they are using RL to train the model. They may be using agentic patterns but that's not really groundbreaking stuff on its own.
FaultInteresting3856@reddit
They released a research paper you know. I don't debate, 'yeah this isn't what you said..' Drop a link or GTFO.
dydhaw@reddit
Oh a research paper you say? You used their code? Wow, incredible. You must be a super genius OpenAI insider. Strange that you didn't read the first sentence of the o1 release post.
FaultInteresting3856@reddit
You must be a stalker or extremely dense to be this semantic and unscientific about this topic. I am going to assume it is the former and move on. " If I have to research who you are I sue you. That's my new motto."
dydhaw@reddit
I'm sorry if I've been rude, but I promise, I'm not stalking you, merely replying to this thread. It seems you're going through a lot right now. I don't know what kind of support network you have but talking to a close friend or family member or even a mental health professional can help when you're having intense thoughts and feelings. It might be helpful to step away from the screen for a bit. Sometimes a little rest or a change of environment can give things some clarity.
Master-Meal-77@reddit
You're not making any sense frankly
FaultInteresting3856@reddit
OpenAI released a product called Swarm, then they released o1. You don't even know what swarm algorithms do or how they work. So now you want to yap at me about not making sense about this. What don't you understand now?
Master-Meal-77@reddit
You must be having a bad day. Sorry buddy
FaultInteresting3856@reddit
So what you're saying is, this is academically clearly out of your league and you cannot comment on the topic at hand, but you're American so you need to give your opinion so everyone can hear it. 10-4! Do you!
Master-Meal-77@reddit
Roger that
Ill_Yam_9994@reddit
Wombo?
FaultInteresting3856@reddit
Yes.
_meaty_ochre_@reddit
I hate nuSonnet. It sends me emojis.
Temsirolimus555@reddit
hobby coder here, new sonnet blows the old one out of the water for me. Then again, I am no where near doing this professionally so take that for what its worth.
TheRealGentlefox@reddit
I like its personality a lot more, and I never see anyone else mentioning it.
It's funnier, more personable, and way less strict about the rules. I've told it about certain refusals by the old Sonnet and it made fun of it for them.
dhrumil-@reddit
Thank god someone bought this up, i realised this the first day but was scared of being bombarded by downvotes by hype bros in the beginning lol
ObnoxiouslyVivid@reddit (OP)
New Sonnet 3.5 is performing better on easy problems and worse on medium ones
MemoryEmptyAgain@reddit
All my problems are easy problems I'm just too stupid (or dont have enough time) to solve, so I'm winning I guess...
OrangeESP32x99@reddit
I’ve got 99 problems, but Claude usage limits are definitely one.
zmanning@reddit
Yeah, we rolled back our upgrade after issues in accuracy for the new version
Zaratsu_Daddy@reddit
Anecdotally I find that it’s often impressive in ways that old sonnet wasn’t, and also often stupid in ways old sonnet wasn’t.
Purely vibe based