LiveBench team just dropped a leaderboard for coding agent tools
Posted by ihexx@reddit | LocalLLaMA | View on Reddit | 58 comments

Posted by ihexx@reddit | LocalLLaMA | View on Reddit | 58 comments
Comfortable-Gate5693@reddit
https://liveswebench.ai/
Complex_Cockroach699@reddit
the advantage i see of cursor is you pay $20 usd and you have access to 6 models. the others you have to get an api key yourself.
but let's say i want to use SWE agent or open hands, considering how fast the leaderboards change, do i go for claude? do i go for gpt4? does anyone know what's the best combo these days?
g0pherman@reddit
Now I'm curious about testing swe agent
mp3m4k3r@reddit
I'm interested to see what model was used with it
MichalDobak@reddit
GitHub Copilot can finally use its agent to release a version for the JetBrains IDE if it is truly that awesome at coding ;)
Dr_Karminski@reddit
I'm curious about the 'Agent Solved' test item, where three agents all received a score of 43.40%.
Does this indicate insufficient differentiation? For example, based on the 43.40% score, I speculate that there might be a total of 53 test items, with 23 completed. 23/53 = 0.4339, which is approximately 43.4%.
fiery_prometheus@reddit
Would be useful to have a column which shows total tokens generated, so we can see how long it needs to reason and how costly it is.
kohlerm@reddit
Speed is also important but might correlate well with tokens used
fiery_prometheus@reddit
For the same reason I didn't want the actual cost in dollars, speed itself is a wildly variable metric, highly dependent on the hardware you run everything on. Price changes for many reasons as well, so does speed itself. I want to measure the agent tools themselves and have the measurement vary only with different software versions of the tools, and tokens generated can be used to calculate cost at a point in time and a ballpark figure of how long reasoning is necessary for a model.
Also, I would want them to sample runs as well, and give me some statistics of how likely a model is to solve a problem within the first 10 percent of the overall tries, but the world ain't perfect. Wacky variations in this from one benchmark to another would also clearly show if someone changed their model or the benchmark itself got updated or experienced a regression.
vibjelo@reddit
Costly in terms of watts used? Sir, this is r/LocalLLaMA, we run LLMs on our own computers.
yur_mom@reddit
Unfortunately, the window size and processing power makes running local for 99% of us impossible..the 8b llms just don't come close to something like Sonnet 3.7 or even say DeekSeek V3..I wish I had the hardware to run DeepSeek v3 local. My company is considering investing though, but from a personal level it isn't practical for most of us.
WhereIsYourMind@reddit
I don’t know what you consider practical, but I’ve been running deepseek-v3-0324 using unsloth’s UD-IQ2_XSS with 80k context on my 512GB Mac Studio which costs about $10k.
People have complained about prompt processing, but I get 37 t/s pp. The 80k context window lets me provide over 100 code snippets with 400 tok chunk size via RAG, which greatly improves code quality when using an open source library or making changes to a large project.
yur_mom@reddit
Honestly, I was very close to buying the 512 Mac Studio, but 10K is still a lot of money and it would still be inferior to the API llms we can use remote.
GTHell@reddit
You run 3.7 local on Github Copilot?
vibjelo@reddit
3.7 what? Not sure what you're talking about, obviously I don't, and if I did, I wouldn't bring it up here anyways.
yur_mom@reddit
The test results were done use Sonnet 3.7 so seems like a good reason to bring it up in a thread about benchmarks using it.
GTHell@reddit
smh 🤦♂️
taylorwilsdon@reddit
Tbh that’s an even more relevant benchmark figure when running locally because it’ll tell you how long it’ll take (even if you don’t care about energy usage)
vibjelo@reddit
Yeah, more numbers are always welcome, agree. The whole ecosystem would be better off from benchmark results being more extensive with statistics attached to them.
ahmetegesel@reddit
I came to say this as well. Cline and Cursor are token eaters. We need a ratio based on token usage as well.
lordpuddingcup@reddit
Same with roo it’s not the models either I think the apps themselves need to do better at rolling up context for things like failures to edit code so half the context isn’t eaten by repeatedly failing to run a roll and then finally swapping to a different tool all those failures should get backend cleaned up in context
MoffKalast@reddit
Cursor: "Ok ima fight!"
Cursor: "Damn, Openhands got hands."
frankh07@reddit
Honestly I feel Claude is better than Copilot, is it because they used a pre-release version?
HNipps@reddit
Where’s Zencoder?
krileon@reddit
Those are some pretty terrible numbers if I'm being honest. So just 50% of the time, or worse, it just.. fails. It's gambling. Code gambling, lol.
xrailgun@reddit
You're assuming that randomly written code has a 50% chance of working as intended? Not all probabilities are accurately modelled by coin flips.
Cromzinc@reddit
Good point. I feel uncomfortable when agent code fails though. Whereas something I wrote I typically know exactly what happened or where to go when I see the error code.
GTHell@reddit
Github Copilot above Aider lol
xrailgun@reddit
Aider's a weird one. Came out so strong, then for the past year has been running around in circles, refusing to improve or fix issues users care about. So many commits, so little done.
golden_monkey_and_oj@reddit
I have yet to use Aider.
My (apparently flawed) understanding is that its a kind of framework that allows an LLM of your choosing to operate on and modify the files in your project structure.
How is it that Aider can perform so poorly on this benchmark, if the logic it is using comes from an external LLM?
GTHell@reddit
It used to be in the last 2 months. Not anymore at the moment. I always give it a try everytime they update something. I can say now it's very usable to vibe coding with Aider with R1 and V3 0324. Not to mention using /copy-context and paste the instruction into Gemini 2.5 then /paste back the Gemini response and let V3 finished it. It is way too powerful now. I expect at the end of 2025 both LLM and Aider able to reach +90% leaderboard.
ctrl-brk@reddit
Check out https://github.com/qemqemqem/aider-advanced
hand___banana@reddit
They used Insiders w/ the pre-release. So it is an agentic version of copilot, not the normal one most are accustomed to using.
davewolfs@reddit
Considering that they have QwQ as a top model on their board when it's ranked below 30% on Aider makes me question their entire benchmark.
AriyaSavaka@reddit
Why Aider so low. It's the best one in my experience with realworld gigantic enterprise monorepo.
cant-find-user-name@reddit
its good that copoilot is close to cursor here. Cursor needs a wake up call because their past few updates have been horrendous.
__Maximum__@reddit
Can someone explain me why people use cursor? Why not use vacode with a plugin like cline or continue? Or why not aider or openhands?
evia89@reddit
Cursor is just $20. Cline can be $10 in a hour
evia89@reddit
Roo is same, Claude code is x1.5-2 times more expensive. Augment code has nice search but not so good for generating new code
__Maximum__@reddit
Alright, cline has a long prompt. How about others?
cant-find-user-name@reddit
I have used cline / aider and still use them. Cursor's UX is far better.
estebansaa@reddit
Anyone that used both, Is Cursor really that much better than Claude Code?
Utoko@reddit
Good start, Cline is a big one missing here. Also would be great to have it with different LLM's.
How does it change running it with Gemini 2.5 pro?
ladz@reddit
This post is totally all about local LLMs.
mr_no_one3@reddit
Personally i am using copilot, and it is amazing
if47@reddit
It's not about which tool is superior, but which one isn't as terrible as the rest.
pier4r@reddit
Interesting that despite the reddit take "having a wrapper around the same API gives no advantage" actually the wrapper needs to be good as well. Every agent had Claude 3.7 as LLM and still the difference was remarkable.
What I find funny is that Claude Code is not at the top (one would imagine that the Claude authors would have the best moat)
DefNattyBoii@reddit
The repo this points to is missing from github: https://github.com/Princeton-SysML/SWE-agent
Hopefully they'll add it later.
https://liveswebench.ai
frivolousfidget@reddit
Yeah the openhands is also broken.
ResidentPositive4122@reddit
Cool, seems like there are some readily available gains in agentic frameworks (39 vs 47) even when using the same model (all were tested w/ claude 3.7)
Curious to see how the open models stack, if the gains are the same just from the cradle used. OH recently released a fine-tuned 32b (from q2.5-coder) that apparently works great w/ their agentic framework.
frivolousfidget@reddit
Would be really nice to see a open model agent competition.
I think openhands and cline both have models trained for them. The others would probably be way behind.
In my experience the non finetuned open models have a really hard time using the correct tools over a long run which is why the finetune makes a huge difference (and why the open models arent as bad in other contexts like chat)
nikzart@reddit
Where’s cline?
xAragon_@reddit
Seems like someone opened a GitHub issue:
https://github.com/LiveBench/liveswebench/issues/2
Effective_Degree2225@reddit
i tried roo couple of times and it was too weird for a simple task to write unit test
Kooky-Somewhere-2883@reddit
cline dude
xqoe@reddit
What
Reddit_Bot9999@reddit
Bro what ? SWE Agent beats them all ?
ihexx@reddit (OP)
https://liveswebench.ai/