LiveBench team just dropped a leaderboard for coding agent tools

[-]

Complex_Cockroach699@reddit

the advantage i see of cursor is you pay $20 usd and you have access to 6 models. the others you have to get an api key yourself.

but let's say i want to use SWE agent or open hands, considering how fast the leaderboards change, do i go for claude? do i go for gpt4? does anyone know what's the best combo these days?

[-]

g0pherman@reddit

Now I'm curious about testing swe agent

[-]

mp3m4k3r@reddit

I'm interested to see what model was used with it

[-]

MichalDobak@reddit

GitHub Copilot can finally use its agent to release a version for the JetBrains IDE if it is truly that awesome at coding ;)

[-]

Dr_Karminski@reddit

I'm curious about the 'Agent Solved' test item, where three agents all received a score of 43.40%.

Does this indicate insufficient differentiation? For example, based on the 43.40% score, I speculate that there might be a total of 53 test items, with 23 completed. 23/53 = 0.4339, which is approximately 43.4%.

[-]

fiery_prometheus@reddit

Would be useful to have a column which shows total tokens generated, so we can see how long it needs to reason and how costly it is.

[-]

kohlerm@reddit

Speed is also important but might correlate well with tokens used

[-]

fiery_prometheus@reddit

For the same reason I didn't want the actual cost in dollars, speed itself is a wildly variable metric, highly dependent on the hardware you run everything on. Price changes for many reasons as well, so does speed itself. I want to measure the agent tools themselves and have the measurement vary only with different software versions of the tools, and tokens generated can be used to calculate cost at a point in time and a ballpark figure of how long reasoning is necessary for a model.

Also, I would want them to sample runs as well, and give me some statistics of how likely a model is to solve a problem within the first 10 percent of the overall tries, but the world ain't perfect. Wacky variations in this from one benchmark to another would also clearly show if someone changed their model or the benchmark itself got updated or experienced a regression.

[-]

vibjelo@reddit

and how costly it is

Costly in terms of watts used? Sir, this is r/LocalLLaMA, we run LLMs on our own computers.

[-]

yur_mom@reddit

Unfortunately, the window size and processing power makes running local for 99% of us impossible..the 8b llms just don't come close to something like Sonnet 3.7 or even say DeekSeek V3..I wish I had the hardware to run DeepSeek v3 local. My company is considering investing though, but from a personal level it isn't practical for most of us.

[-]

WhereIsYourMind@reddit

I don’t know what you consider practical, but I’ve been running deepseek-v3-0324 using unsloth’s UD-IQ2_XSS with 80k context on my 512GB Mac Studio which costs about $10k.

People have complained about prompt processing, but I get 37 t/s pp. The 80k context window lets me provide over 100 code snippets with 400 tok chunk size via RAG, which greatly improves code quality when using an open source library or making changes to a large project.

[-]

yur_mom@reddit

Honestly, I was very close to buying the 512 Mac Studio, but 10K is still a lot of money and it would still be inferior to the API llms we can use remote.

[-]

GTHell@reddit

You run 3.7 local on Github Copilot?

[-]

vibjelo@reddit

3.7 what? Not sure what you're talking about, obviously I don't, and if I did, I wouldn't bring it up here anyways.

[-]

yur_mom@reddit

The test results were done use Sonnet 3.7 so seems like a good reason to bring it up in a thread about benchmarks using it.

[-]

GTHell@reddit

smh 🤦‍♂️

[-]

taylorwilsdon@reddit

Tbh that’s an even more relevant benchmark figure when running locally because it’ll tell you how long it’ll take (even if you don’t care about energy usage)

[-]

vibjelo@reddit

Yeah, more numbers are always welcome, agree. The whole ecosystem would be better off from benchmark results being more extensive with statistics attached to them.

[-]

ahmetegesel@reddit

I came to say this as well. Cline and Cursor are token eaters. We need a ratio based on token usage as well.

[-]

lordpuddingcup@reddit

Same with roo it’s not the models either I think the apps themselves need to do better at rolling up context for things like failures to edit code so half the context isn’t eaten by repeatedly failing to run a roll and then finally swapping to a different tool all those failures should get backend cleaned up in context

[-]

MoffKalast@reddit

Cursor: "Ok ima fight!"

Cursor: "Damn, Openhands got hands."

[-]

frankh07@reddit

Honestly I feel Claude is better than Copilot, is it because they used a pre-release version?

[-]

HNipps@reddit

Where’s Zencoder?

[-]

krileon@reddit

Those are some pretty terrible numbers if I'm being honest. So just 50% of the time, or worse, it just.. fails. It's gambling. Code gambling, lol.

[-]

xrailgun@reddit

You're assuming that randomly written code has a 50% chance of working as intended? Not all probabilities are accurately modelled by coin flips.

[-]

Cromzinc@reddit

Good point. I feel uncomfortable when agent code fails though. Whereas something I wrote I typically know exactly what happened or where to go when I see the error code.

[-]

GTHell@reddit

Github Copilot above Aider lol

[-]

xrailgun@reddit

Aider's a weird one. Came out so strong, then for the past year has been running around in circles, refusing to improve or fix issues users care about. So many commits, so little done.

[-]

golden_monkey_and_oj@reddit

I have yet to use Aider.

My (apparently flawed) understanding is that its a kind of framework that allows an LLM of your choosing to operate on and modify the files in your project structure.

How is it that Aider can perform so poorly on this benchmark, if the logic it is using comes from an external LLM?

[-]

GTHell@reddit

It used to be in the last 2 months. Not anymore at the moment. I always give it a try everytime they update something. I can say now it's very usable to vibe coding with Aider with R1 and V3 0324. Not to mention using /copy-context and paste the instruction into Gemini 2.5 then /paste back the Gemini response and let V3 finished it. It is way too powerful now. I expect at the end of 2025 both LLM and Aider able to reach +90% leaderboard.

[-]

ctrl-brk@reddit

Check out https://github.com/qemqemqem/aider-advanced

[-]

hand___banana@reddit

They used Insiders w/ the pre-release. So it is an agentic version of copilot, not the normal one most are accustomed to using.

[-]

davewolfs@reddit

Considering that they have QwQ as a top model on their board when it's ranked below 30% on Aider makes me question their entire benchmark.

[-]

AriyaSavaka@reddit

Why Aider so low. It's the best one in my experience with realworld gigantic enterprise monorepo.

[-]

cant-find-user-name@reddit

its good that copoilot is close to cursor here. Cursor needs a wake up call because their past few updates have been horrendous.

[-]

Maximum@reddit

Can someone explain me why people use cursor? Why not use vacode with a plugin like cline or continue? Or why not aider or openhands?

[-]

evia89@reddit

Cursor is just $20. Cline can be $10 in a hour

[-]

evia89@reddit

Roo is same, Claude code is x1.5-2 times more expensive. Augment code has nice search but not so good for generating new code

[-]

Maximum@reddit

Alright, cline has a long prompt. How about others?

[-]

cant-find-user-name@reddit

I have used cline / aider and still use them. Cursor's UX is far better.

[-]

estebansaa@reddit

Anyone that used both, Is Cursor really that much better than Claude Code?

[-]

Utoko@reddit

Good start, Cline is a big one missing here. Also would be great to have it with different LLM's.
How does it change running it with Gemini 2.5 pro?

[-]

ladz@reddit

This post is totally all about local LLMs.

[-]

mr_no_one3@reddit

Personally i am using copilot, and it is amazing

[-]

if47@reddit

It's not about which tool is superior, but which one isn't as terrible as the rest.

[-]

pier4r@reddit

Interesting that despite the reddit take "having a wrapper around the same API gives no advantage" actually the wrapper needs to be good as well. Every agent had Claude 3.7 as LLM and still the difference was remarkable.

What I find funny is that Claude Code is not at the top (one would imagine that the Claude authors would have the best moat)

[-]

DefNattyBoii@reddit

The repo this points to is missing from github: https://github.com/Princeton-SysML/SWE-agent

Hopefully they'll add it later.

https://liveswebench.ai

[-]

frivolousfidget@reddit

Yeah the openhands is also broken.

[-]

ResidentPositive4122@reddit

Cool, seems like there are some readily available gains in agentic frameworks (39 vs 47) even when using the same model (all were tested w/ claude 3.7)

Curious to see how the open models stack, if the gains are the same just from the cradle used. OH recently released a fine-tuned 32b (from q2.5-coder) that apparently works great w/ their agentic framework.

[-]

frivolousfidget@reddit

Would be really nice to see a open model agent competition.

I think openhands and cline both have models trained for them. The others would probably be way behind.

In my experience the non finetuned open models have a really hard time using the correct tools over a long run which is why the finetune makes a huge difference (and why the open models arent as bad in other contexts like chat)

[-]

ihexx@reddit (OP)

https://liveswebench.ai/