[-]

thomash@reddit

I believe the Aider leaderboard is good because the person behind it is very active and involved in the open source community. They usually put new benchmarks up the day after a new model is released.

[-]

gladic_hl2@reddit

It's an artificial test, livecodebench, livebench are much closer to the real tasks.

[-]

kohlerm@reddit

I think this is a good point. It depends on what features your IDE needs.For example Aider uses a special diff format for outputting code. Other tools might use some Json format.Also if your IDE supports adding documentation for libraries, missing knowledge might not be a big problem.

[-]

gopietz@reddit

I think this is a terrible point.

I love Aider, respect the dev and use the benchmark as one of my go-to rankings too, but trusting a benchmark because you like a dev and their work isn't a good argument.

I like the benchmark because the testing methodology is very close to my use cases and because based on the models I worked with, the results roughly reflect my experience too.

[-]

kohlerm@reddit

Not sure why this is "terrible" point. You admit you like to use Aider, which exactly is my point. If it works well with Aider than that is your benchmark. But there are other tools out there which work differently I looked at the source code of several tools and I am also prototyping my own tool.Aiders approach requires the LLMs to understand one of their diff formats. Other tools do not use the same diff format or in a lot of cases do not use a diff format at all. The results might therefore very much vary depending on the tool used.

[-]

mocha-001@reddit

I skim a few benchmarks and threads, then run quick repros against our own repo to see what actually holds up. When we turned those experiments into content engineers care about, we focused on example-first posts, minimal fluff, and clear ""here's what worked, here's what didn't"" steps.

To keep it sustainable and inside each sub's rules, I later brought in Red-Engage, they tightened the walkthroughs, set posting guardrails, and quietly repackaged the clearest explanations so they surface in AI tools, which turned into steady inbound from people who actually build things.

[-]

MarceloTT@reddit

To have a better perception of coding, I have a proprietary coding base that has many complex challenges that use language tricks, analogies, codes encoded with intentional errors, etc. In my base, no LLM obtained more than 40%, the best performance was o1. With 37% this happens, at least in my tests, as the need for context increases or the language becomes more vague and nebulous. But I found o1 impressive. With more finetunning, training and costs at least 10 times lower, perhaps o1 or language models that use RL at runtime can achieve even better performance. For now and for my specific use cases, I can't get the results needed to use LLMs because I'm still cheaper. But I am hopeful that by the end of 2025 I will be able to achieve performances close to 80%. With costs 10 times lower. Then I can start using it in more situations. I waited 10 years, I can wait another 1.

[-]

panda_anda@reddit

This post was 8 months ago...

I am very interested to hear your results now. Please update!

[-]

Jesus359@reddit

Dang! 10 years? Teach me! Lol.

[-]

MarceloTT@reddit

My master's thesis was on LSTMs, oddly enough this already existed in an incipient form in 2014. But it didn't work for many things.

[-]

Jesus359@reddit

That is super interesting. That cover my question on how speech and video synthesis was created. Can I read your thesis or im not sure if there are things you recommend to learn more.

[-]

Objective-Rub-9085@reddit

Hey, what are your coding basics? Can you tell me more?

[-]

MarceloTT@reddit

Yes, I program security and encryption systems for the payments industry and in the development of distributed data communication security systems. Additionally, I study LLMs for detecting configurations or attack patterns on decentralized trading systems. I think it's a summary of what I currently do.

[-]

DickMasterGeneral@reddit

Have you tried the new DeepSeek reasoning model yet? Would love to know what kind of score it gets on private benchmarks

[-]

MarceloTT@reddit

Deepseek with recursive parameter tuning in RL by processing time over 1000 steps. I got around 26% but I think I can improve with new techniques that I'm perfecting. I can give an honest guess that about 29% I can achieve. I know it's not in your question, but in 2022 with the llms at the time I couldn't get anything that made sense. So I consider what we have achieved so far to be an impressive leap forward. If it continues like this, in 2025 I can consider LLMs for professional use, of course, I don't make simple apps, I create and maintain systems. And the problem is that the more steps you have in a logical chain of actions, the more degraded the accuracy of a coding agent becomes. Therefore, a very complex system can have its code completely ruined by an LLM. Let's see how the big labs will fix this.

[-]

SnooPeanuts1152@reddit

I feel everyone has their own needs so leaderboards only tell one side of a story. For example, you can have a leaderboard of which LLM solves coding challenges but this is not practical. Prompting is another factor. Your result can be so much different based on what you feed it.

[-]

DinoAmino@reddit

💯 I don't trust benchmarks anymore. I need to test drive it myself on my own real world problems. Like, everyone holds MMLU-PRO in high regard, and it is a good measure for general usage. But I don't ask biology or history questions in my daily use so I don't care too much about it.

[-]

Realistic_Force688@reddit

Totally agree real-world relevance matters most! But it makes me wonder, when deploying LLMs in production without clear ground truth answers, how do you measure or evaluate their performance effectively? What metrics or strategies do you rely on to ensure they’re actually solving your specific problems?

[-]

Unlikely-Bee-5575@reddit

Overall Best, Most Famous & Trusted: Chatbot Arena (lmsys.org) is the gold standard for general chatbot performance based on human preference.
Best for Real-World Instruction Following: LiveBench provides a unique and challenging test of practical usefulness.
Best for In-Project Coding: Aider is the go-to benchmark for the specific task of AI-assisted code editing.

[-]

SomeOddCodeGuy@reddit

I keep a notepad of various LLM benchmarks that Im constantly peeking at. Here's the ones that I look at for coding:

Aider: https://aider.chat/docs/leaderboards/
Livebench: https://livebench.ai/#/
ProLLM (though it isn't always up to date): https://prollm.toqan.ai/leaderboard/coding-assistant
Duvebsor: https://dubesor.de/benchtable.html

[-]

knvn8@reddit

Aider is more consistent with my own experience, even more than arena

[-]

SomeOddCodeGuy@reddit

I like getting a mix, because one issue with Aider is that it misses some of the contextual weaknesses of models. Models like Qwen 32b coder are fantastic at solving pure code issues, but may struggle with talking about or resolving plain language coding problems. So just in case other leaderboards are doing that, I enjoy looking at them to get a good idea of what's going on there.

[-]

Specialist-Shine8927@reddit

What's the best leaderboard?

[-]

SomeOddCodeGuy@reddit

Aider, Livebench and Dubersor are my three favorites right now. The rest that I used to use no longer really update as much.

[-]

Specialist-Shine8927@reddit

Thanks !

[-]

nidhishs@reddit

Thanks for the shout-out for ProLLM! We aim to update the leaderboard within 48 hours of popular model releases. I'm curious to know which model we kept you waiting for 😄.

[-]

Specialist-Shine8927@reddit

Also it shows gpt 4.1 as first..?

[-]

Specialist-Shine8927@reddit

Is pro llm the best?

[-]

MusicTait@reddit

+1 for aider and livebench.

IMHO there is a big problem out there with leader boards.

its hard to find trustworthy leaderboards that use a good methodology.

[-]

Specialist-Shine8927@reddit

What's best

[-]

DontBuyMeGoldGiveBTC@reddit

Hands up, this is a robbery, give me your text file

[-]

SomeOddCodeGuy@reddit

So, I tried to respond with it but unfortunately you-know-what ate the comment; likely too many links at once. If you go to my profile on the old reddit, you can see it.

overview for SomeOddCodeGuy

[-]

Specialist-Shine8927@reddit

Why use old reddit

[-]

GrabbenD@reddit

6 months later, seems like it's gone. Could you possibly upload it to Pastebin or similar?

[-]

DontBuyMeGoldGiveBTC@reddit

I see it on RIF. Very thanks.

[-]

Freenrg8888@reddit

Trying to find it... What is RIF?

[-]

DontBuyMeGoldGiveBTC@reddit

I'm not on rif. I tried to find it for you but I can't find it on this app. You'll have to do some work for it.

[-]

sahil1572@reddit

Dubesor's is useless

[-]

CherryNexus@reddit

Yeah it's pretty bad, the reasoning test is a joke

[-]

Ralph_mao@reddit

proLLM is using gpt4 as the judge, will this gives openai models an advantage since it may prefer models trained with similar data/preference?

[-]

Affectionate-Hat-536@reddit

Can you keep it on GitHub ? People can * and watch or contribute etc.

[-]

swapripper@reddit

Nice! Curious abt other LLM benchmarks.

[-]

Specialist-Shine8927@reddit

What's the best?

[-]

takuonline@reddit

Anthropic's CEO said he likes the SWE benchmark.

[-]

LoafyLemon@reddit

Is that an attempt at discrediting SWE? Because it's working... xD

[-]

Emotional-Metal4879@reddit

MMLU-PRO

[-]

bias_guy412@reddit

Why? Any specific reason?

[-]

DinoAmino@reddit

Mob rules?

[-]

LoafyLemon@reddit

None. I perform my own tests.

[-]

sammcj@reddit

Aider

[-]

gabe_dos_santos@reddit

None. I test the models myself.

[-]

Whotea@reddit

Everyone knows the best tests have a sample size of n=3.

[-]

randombsname1@reddit

Livebench and aider are the best, imo.

[-]

3-4pm@reddit

My 2017 laptop

[-]

AcanthaceaeNo5503@reddit

I actually prefer livebench and SWE actually. Aider is Leetcode like, old, fixed problems, easy contamination. SWE is hard and for Agent-like, but it's the best for real world scenarios.

Livebench uses good hand craft newly added exercises, and they make updates every month. High quality data, no contamination=> my goto benchmark

[-]