What leaderboard do you trust for ranking LLMs in coding tasks?
Posted by rageagainistjg@reddit | LocalLLaMA | View on Reddit | 43 comments
I came across this one: https://aider.chat/docs/leaderboards/, but I have no idea how often it’s updated or how reliable it is.
Is there a "go-to" leaderboard that people trust for coding rankings? Or even something that also includes creativity, like image generation, alongside coding? I’m curious if there’s a gold standard that a lot of people on Reddit seem to agree on.
SomeOddCodeGuy@reddit
I keep a notepad of various LLM benchmarks that Im constantly peeking at. Here's the ones that I look at for coding:
nidhishs@reddit
Thanks for the shout-out for ProLLM! We aim to update the leaderboard within 48 hours of popular model releases. I'm curious to know which model we kept you waiting for 😄.
sahil1572@reddit
Dubesor's is useless
Ralph_mao@reddit
proLLM is using gpt4 as the judge, will this gives openai models an advantage since it may prefer models trained with similar data/preference?
Affectionate-Hat-536@reddit
Can you keep it on GitHub ? People can * and watch or contribute etc.
DontBuyMeGoldGiveBTC@reddit
Hands up, this is a robbery, give me your text file
SomeOddCodeGuy@reddit
So, I tried to respond with it but unfortunately you-know-what ate the comment; likely too many links at once. If you go to my profile on the old reddit, you can see it.
overview for SomeOddCodeGuy
DontBuyMeGoldGiveBTC@reddit
I see it on RIF. Very thanks.
swapripper@reddit
Nice! Curious abt other LLM benchmarks.
MusicTait@reddit
+1 for aider and livebench.
IMHO there is a big problem out there with leader boards.
its hard to find trustworthy leaderboards that use a good methodology.
knvn8@reddit
Aider is more consistent with my own experience, even more than arena
SomeOddCodeGuy@reddit
I like getting a mix, because one issue with Aider is that it misses some of the contextual weaknesses of models. Models like Qwen 32b coder are fantastic at solving pure code issues, but may struggle with talking about or resolving plain language coding problems. So just in case other leaderboards are doing that, I enjoy looking at them to get a good idea of what's going on there.
thomash@reddit
I believe the Aider leaderboard is good because the person behind it is very active and involved in the open source community. They usually put new benchmarks up the day after a new model is released.
kohlerm@reddit
I think this is a good point. It depends on what features your IDE needs.For example Aider uses a special diff format for outputting code. Other tools might use some Json format.Also if your IDE supports adding documentation for libraries, missing knowledge might not be a big problem.
gopietz@reddit
I think this is a terrible point.
I love Aider, respect the dev and use the benchmark as one of my go-to rankings too, but trusting a benchmark because you like a dev and their work isn't a good argument.
I like the benchmark because the testing methodology is very close to my use cases and because based on the models I worked with, the results roughly reflect my experience too.
kohlerm@reddit
Not sure why this is "terrible" point. You admit you like to use Aider, which exactly is my point. If it works well with Aider than that is your benchmark. But there are other tools out there which work differently I looked at the source code of several tools and I am also prototyping my own tool.Aiders approach requires the LLMs to understand one of their diff formats. Other tools do not use the same diff format or in a lot of cases do not use a diff format at all. The results might therefore very much vary depending on the tool used.
takuonline@reddit
Anthropic's CEO said he likes the SWE benchmark.
LoafyLemon@reddit
Is that an attempt at discrediting SWE? Because it's working... xD
MarceloTT@reddit
To have a better perception of coding, I have a proprietary coding base that has many complex challenges that use language tricks, analogies, codes encoded with intentional errors, etc. In my base, no LLM obtained more than 40%, the best performance was o1. With 37% this happens, at least in my tests, as the need for context increases or the language becomes more vague and nebulous. But I found o1 impressive. With more finetunning, training and costs at least 10 times lower, perhaps o1 or language models that use RL at runtime can achieve even better performance. For now and for my specific use cases, I can't get the results needed to use LLMs because I'm still cheaper. But I am hopeful that by the end of 2025 I will be able to achieve performances close to 80%. With costs 10 times lower. Then I can start using it in more situations. I waited 10 years, I can wait another 1.
Jesus359@reddit
Dang! 10 years? Teach me! Lol.
MarceloTT@reddit
My master's thesis was on LSTMs, oddly enough this already existed in an incipient form in 2014. But it didn't work for many things.
Jesus359@reddit
That is super interesting. That cover my question on how speech and video synthesis was created. Can I read your thesis or im not sure if there are things you recommend to learn more.
Objective-Rub-9085@reddit
Hey, what are your coding basics? Can you tell me more?
MarceloTT@reddit
Yes, I program security and encryption systems for the payments industry and in the development of distributed data communication security systems. Additionally, I study LLMs for detecting configurations or attack patterns on decentralized trading systems. I think it's a summary of what I currently do.
DickMasterGeneral@reddit
Have you tried the new DeepSeek reasoning model yet? Would love to know what kind of score it gets on private benchmarks
MarceloTT@reddit
Deepseek with recursive parameter tuning in RL by processing time over 1000 steps. I got around 26% but I think I can improve with new techniques that I'm perfecting. I can give an honest guess that about 29% I can achieve. I know it's not in your question, but in 2022 with the llms at the time I couldn't get anything that made sense. So I consider what we have achieved so far to be an impressive leap forward. If it continues like this, in 2025 I can consider LLMs for professional use, of course, I don't make simple apps, I create and maintain systems. And the problem is that the more steps you have in a logical chain of actions, the more degraded the accuracy of a coding agent becomes. Therefore, a very complex system can have its code completely ruined by an LLM. Let's see how the big labs will fix this.
Emotional-Metal4879@reddit
MMLU-PRO
bias_guy412@reddit
Why? Any specific reason?
DinoAmino@reddit
Mob rules?
SnooPeanuts1152@reddit
I feel everyone has their own needs so leaderboards only tell one side of a story. For example, you can have a leaderboard of which LLM solves coding challenges but this is not practical. Prompting is another factor. Your result can be so much different based on what you feed it.
DinoAmino@reddit
💯 I don't trust benchmarks anymore. I need to test drive it myself on my own real world problems. Like, everyone holds MMLU-PRO in high regard, and it is a good measure for general usage. But I don't ask biology or history questions in my daily use so I don't care too much about it.
LoafyLemon@reddit
None. I perform my own tests.
sammcj@reddit
Aider
gabe_dos_santos@reddit
None. I test the models myself.
Whotea@reddit
Everyone knows the best tests have a sample size of n=3.
randombsname1@reddit
Livebench and aider are the best, imo.
3-4pm@reddit
My 2017 laptop
AcanthaceaeNo5503@reddit
I actually prefer livebench and SWE actually. Aider is Leetcode like, old, fixed problems, easy contamination. SWE is hard and for Agent-like, but it's the best for real world scenarios.
Livebench uses good hand craft newly added exercises, and they make updates every month. High quality data, no contamination=> my goto benchmark
AcanthaceaeNo5503@reddit
If you use LLM to solve your Coding interview session, aider bench can help. But if you do research what it is, I don't think it reflects the best coding skill for day to day tasks.
magnetesk@reddit
I like to see whatever is too on openrouter
AaronFeng47@reddit
https://livecodebench.github.io/leaderboard.html
Nervous_Video_6364@reddit
artificialanalysis is pretty solid
matfat55@reddit
not really, it's just stats anyone can see consolidated. not really anything to do with llm output, just a broad 'quality', not coding either.