I built a benchmark that tests coding LLMs on REAL codebases (65 tasks, ELO ranked)
Posted by hauhau901@reddit | LocalLLaMA | View on Reddit | 87 comments
Hey everyone, been working on something for a while and figured it's time to share it.
I kept seeing new models drop every week with claims of being 10x better, benchmarks that don't translate to actual coding, and demos that look great but fall apart on real work. so I started building my own benchmark to figure out what actually works.
It's called APEX Testing. every task is an actual codebase with real code, real dependencies, and a real problem to solve. fix this bug, add this feature, refactor this module, build this from scratch. It's (currently) comprising of 65 tasks across 8 categories, ranging from React components to race condition debugging to building CLI tools. Each model gets a fresh clone of the same repo with the exact same starting point and exact same conditions.
Grading is done by multiple SOTA models independently, and then I also personally review every single output to catch anything unfair like timeouts or infra hiccups. If a model got unlucky, I rerun it (which ended up causing a lot bigger of a hole in my wallet haha). The whole thing is ranked with ELO, and you can filter by category to see where models actually shine vs where they struggle.
A couple things that caught me off guard so far:
- GPT 5.1 Codex Mini beating GPT 5.2 Codex pretty convincingly even though smaller and older, it came out way more consistent (but it also seemed to REALLY splurge on tokens)
- Some models look great on average but completely bomb certain task types
- The cost difference between models with similar scores is huge
It's a solo project, funded out of my own pocket (you can see total spend on the homepage lol). hope it helps you cut through the noise and pick the right model for your work.
Hope you all find it useful!
P.S. I will work on testing more quanted models as well and I might add more tests as well in the future.

rpeck@reddit
Are you planning on benchmarking DeepSeek v4? I've seen conflicting reports of its performance...
XMRNeighbor@reddit
Is there a repo of those tasks? Can I run the benchmarks independently on my local models?
zerospatial@reddit
awesome - would be cool to see for local models the model size (gb) so users could know if they could run them locally - also maybe a cost efficiency score - so it something gets you 75% accuracy at a much lower cost than 80% accuracy for ex
Kaomet@reddit
I'm suprised there is such a low score difference between opus and sonnet : can you try to add harder tasks ?
Kaomet@reddit
Could you add harder tasks ?
pol_phil@reddit
Hi, congrats for the great benchmark!
Perhaps I missed it somewhere, but what agentic scaffold do you use? SWE-Agent? OpenHands? Something else entirely?
hauhau901@reddit (OP)
"Homemade" tests entirely :) real repos and real agentic coding. You can see some info on the site.
pol_phil@reddit
I mean what type of agentic pipeline/harness/scaffold do you use to get these models to solve these tasks. In other words, what kind of system message/tools have they been given. Via Claude Code? OpenCode?
SWE-Agent and OpenHands are just "minimal" agentic frameworks commonly used in benchmarks.
hristothristov@reddit
Could you add a filter for open weights vs proprietary models and also filter by parameter count? Basically, I want to check which is the best model I could deploy and use locally with ollama.
sgmv@reddit
GPT 5.4 testing pretty late, not yet online. Would add more interest to the project if thinks were reliably current, imho.
hauhau901@reddit (OP)
Yep, I'm a one man show and finishing working on qwen3.5 35b-a3b uncensoring now. Once that's done, I'll add gpt 5.4 to the leaderboard.
It performs above 5.3 codex :)
Uranday@reddit
I tried the uncensored 27b, slow but very cool. Hope you finish the 35B soon so I can use that.
hauhau901@reddit (OP)
Thanks, I'm looking at the 35B-A3B dashboard as we speak, it's mid run. Looking promising and I hope to release it soon.
Uranday@reddit
<3
skymen75@reddit
Hey it would be really cool if it was possible to sort by size of the model and wether it's open source or not. I am often only interested in OSS models I can run on my machine (48GB Mac) and how they compare with other models including bigger/proprietary ones.
Great job though, thanks a lot for this.
clocksmith@reddit
hm would have thought Gemini 3.1 would have done better at at least one category lol
DeltaSqueezer@reddit
Can you add Qwen3.5 9B?
DeepWisdomGuy@reddit
WTF, comparing crippled quants against frontier?!? Dude, at least run Q8_0.
lemon07r@reddit
It actually scored worse than the older qwen coder model in my own evals. I dont think the new qwen models are very good for coding
SemaMod@reddit
This is great! Are you planning on adding gpt-5.3-codex? With the current results it seems like Opus 4.6 blows everyone else out of the water, but I've had generally good 5.3-codex experiences.
Virtamancer@reddit
Also, the leaderboard makes it unclear what reasoning level was used for any model. So it’s kind of pointless.
No-Mountain3817@reddit
If not specified, always assume the maximum. That way, you won’t go wrong in your estimation.
It may seem pointless to you, as you’re clearly missing the point, but the work put in by the OP as an independent benchmark can still be useful to filter out noise from other benchmarks and leaderboard ratings.
hauhau901@reddit (OP)
All reasoning models are used at their highest setting (i.e. xhigh for openai) but you could work on your wording to be less rude.
hauhau901@reddit (OP)
Hi, currently only codex sub of 200$ offers it i think :) will add it once I can find it from somewhere like OpenRouter
_yustaguy_@reddit
Actually, there is a promotion right now where even the free tier can use it with generous weekly limits.
hauhau901@reddit (OP)
That's weird, I can't see it, could you please link it? I'm not getting the model as available through the API
_yustaguy_@reddit
Check if you have limits here:
https://chatgpt.com/codex/settings/usage
Try updating your codex installation if you're still not seeing 5.3 in there.
hauhau901@reddit (OP)
Thanks for getting back to me! I found it now - limits are extremely easy to hit. I've started the benchmark process for Codex 5.3 but it'll take a while (seems to hit limits every....2-3 benchmarks and they stop it for several hours until resets)
_yustaguy_@reddit
Oh, I guess they seem pretty high to me since I use AI sparingly haha
Howdareme9@reddit
It’s not easily accessible right now (no api)
sabotage3d@reddit
It's impressive that small models are performing that good. I am also unsure if the methodology is perfect. I had myself some strange results where Qwen Coder Next wrote a better 2D fluid simulation app than Kimi K2.5 and GLM 4.7 flash wasn't that far off.
-dysangel-@reddit
Different models can have different strengths and weaknesses.
hauhau901@reddit (OP)
Don't know of many things in life that are perfect 😂 methodology is made to reduce variance as much as possible but cannot fully eliminate it.
rorowhat@reddit
Can these tests be run locally?
Far-Application1714@reddit
glm 4.7 handled the React + CLI tasks pretty solid imo, consistent enough for real work without going overboard on tokens.
Kuumikoo@reddit
Glm 5 worse than Glm 4.7 as a much bigger model? I wonder what could be the reason.
hauhau901@reddit (OP)
Bigger size doesn't equate better quality (datasets are super important). I suspect the extra training was focused on 'general intelligence' rather than coding.
Kuumikoo@reddit
Interesting. Qwen 3.5 being so strong here is also surprising. From what I see Qwen is never rated that high in coding apart from small model competitions?
hauhau901@reddit (OP)
Keep in mind, qwen3.5 is almost 400b now AND they started using the datasets from people's subscriptions on Qwen. Similar to GLM and Minimax :)
Kuumikoo@reddit
Make sense. Did they ever mention when to release Qwen 3.5 coder?
hauhau901@reddit (OP)
No, only 'smaller' general models are due to come out today/tomorrow
Kuumikoo@reddit
It looks like the best subscription plan for cheap is GLM now. But I am so sick of their unstable services.
About the benchmark, I wonder how coding languages play role here. From what I know China is quite one dimensional with Springboot and Vue.
hauhau901@reddit (OP)
(For coding) Compare it to GLM 4.7 and it comes off as inferior.
GarbageOk5505@reddit
The GPT 5.1 Mini consistency finding is interesting; token spend as a proxy for effort is a pattern worth tracking across models. What categories see the biggest spread between average performers and bombers?
hauhau901@reddit (OP)
Great idea, I will add it as a public metric!
mr_riptano@reddit
Love to see more benchmarks that aren't hopelessly contaminated, great work!
I gotta say tho I'm very very skeptical of having LLMs judge code vs actual test suites.
angelin1978@reddit
the real codebase angle is what makes this actually useful imo. the main thing i wonder about is how you handle the variance from non-deterministic model outputs, like does the same model score differently across runs? also curious what the average task complexity looks like, is it mostly single file edits or multi-file refactors
hauhau901@reddit (OP)
Hi,
In all fairness, most models have had their tasks retaken several times. Scoring has rarely varied more than +/- 5 points. You cannot fully remove variance though (could be done with 0 temperature only) because it'd limit most models capabilities as well sadly.
angelin1978@reddit
+/- 5 is honestly pretty tight for this kind of benchmark. makes sense that temp 0 would hurt the creative problem solving side. solid methodology
hauhau901@reddit (OP)
Also to reply on avg task complexity: 99% of all tasks are a real codebase so multi files and in some cases folders as well. Different can be as few as tens of lines of codes edits to as much as 3000-5000 lines.
angelin1978@reddit
thats a good range honestly. the multi-file stuff is where most benchmarks fall apart because they only test isolated single-file edits. 3000-5000 lines is gnarly though, curious how many models even attempt changes that large vs just giving up
odomobo@reddit
Very useful info. My only complaint is that score/$ is not very useful, because although cost is linear, score is not. Getting from 80 to 90 should be an enormous increase in capability, but it would barely make a dent in score/$ .
hauhau901@reddit (OP)
That's true. ELO (and obviously, score) work exactly like that, but if you start reading the comments on this thread, you'll see a lot of people either don't care about it or don't see it the same way. There is no pleasing everyone.
odomobo@reddit
I understand people not caring, and I'm not asking you to placate me, but take sonnet 4.5 and sonnet 4.6 . They're nearly identical costs and nearly identical score/$ , yet 4.6 is over 150 elo higher than 4.5 .
Of course, this isn't an objectively solvable problem since elo or score can't be turned into a quantitatively-meaningful linear value, but I think there are ways to get a somewhat meaningful heuristic out of it. A couple of formulas that make sense to me:
"Ability" doubles every 200 elo: 2^(elo/200)
Halving distance to a perfect 100 score doubles ability: 1 / (100-score)
Those are just my thoughts anyhow. The data you present is already very helpful and informative, and a motivated viewer can perform their own analysis (of course).
Icy_Butterscotch6661@reddit
Keep the scores around and test if the older SOTA models indeed get dumber when a new model comes out
Lixa8@reddit
Damn this is a cool benchmark, I really like the different categories
So basically glm 4.7 is the open weight goat uh? I wonder how capabilities degrade with quantization, I am looking forward to future tests!
hauhau901@reddit (OP)
Currently adding q4kxl as well! Tha ks for the kind words.
jmager@reddit
You made a very useful and beautiful website, thank you! Would you consider adding an additional column that shows the "score/$" metric? This to me is the most insightful part of the stats. If a model passed a given test 67% of the time but costs a hundredth of the one passing 100% of the time, running the agent 3 times in parallel is likely to have at least one agent succeed at 3% of the cost. That is simplified of course, there are other variable such as time and confounding factors, but it is interesting to think about.
hauhau901@reddit (OP)
Hi, cost/score is already included so I'm unsure what you mean exactly?
jmager@reddit
I see that metric on the detailed page for individual models, but not on the overall list of all models.
guiopen@reddit
The results seem to align very well to real world usage
hauhau901@reddit (OP)
Because they are real world usage! 😊
guiopen@reddit
Yes, unfortunately that is an exception for benchmarks, I am very thankful for this one, thank you
Also loved the inclusion of quantized models
yeah-ok@reddit
Superb work. Very nice to have a new solid take on rankings! Looking forward to the next Kimi model is my take at the end of reviewing this..!
tarruda@reddit
Is this something we can run locally against llama-server? I'd love to test how much quantization impacts the results of some of those models.
hauhau901@reddit (OP)
No, I will be adding more quanted models for everyone soon.
tarruda@reddit
One interesting quant to try is Qwen 3.5 smol-IQ2_XS from ubergarm, here's my experience using it: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/2
Would be great if you could add that one, as it seems the best quant that can run on 128G macs!
hauhau901@reddit (OP)
Will do
rm-rf-rm@reddit
This is great! I think we desperately need something like this as the main benchmark rather than the bs gamed ones, LM arena etc.
Things I think that will make this get widely adopted:
hauhau901@reddit (OP)
Hi, thank you for the kind words!
rm-rf-rm@reddit
Unfortunately, this is the trade off. But its a case of chicken and egg as well - you need to make the test available for others to run. Without that, no one has any reason to trust your scores. The other option is get a bunch of money and market your test like Andon labs or have insider connections like LM arena. But then we'd be back to square one with an unreliable test.
Thats why SWE-Rebench continually updates their test and is probably the best available benchmark today
hauhau901@reddit (OP)
Yeah, my project is free for everyone so because of that it's a "take it or leave it" since I don't have funding coming in from somewhere for this. We'll see how time progresses, it's not my intention to ask for money or whatever either and I'd like it to never resort to that.
Yorn2@reddit
Can you make the leaderboard bigger than 5 models or at least extend it so I can see the top two or three open weights models? I mean, that's like 95% of the reason I look at benchmarks.
Err nm. I see how to look it up now. You should probably make the "View Full Leaderboard" a bigger option or just a full on button on the main page.
So, a questions. Why did you say yesterday that the new Qwen was worse than MiniMax M2.5 and that you'd post the results showing this soon and then today you released a leaderboard showing the exact opposite? Did you mean Kimi K2.5 instead?
Is your plan to run this once every month or so like SWE Rebench?
hauhau901@reddit (OP)
Hello, the tests were still ongoing at the time of writing that with more favoring Minimax.
Ideally I will work on keeping it updated whenever new (worthwhile) models come up.
sabotage3d@reddit
Given the cost-to-performance ratio of Minimax 2.5, it's a no-brainer. Did you update the score on your website?
hauhau901@reddit (OP)
Everything in thr website updates as soon as I finish it locally, so hes :)
philmarcracken@reddit
Like it so far, wouldn't mind a model size parameter. Throw us vram poor a bone ༼ つ ◕_◕ ༽つ
hauhau901@reddit (OP)
I will work on adding quanted models as well
tomleelive@reddit
The cost/performance analysis is really interesting here. For those of us running Claude Code daily, knowing that Sonnet 4.6 hits the sweet spot of 75+ score at 400-800 pts/$ confirms what I've been seeing in practice. Would love to see this benchmark include agentic coding tasks too — multi-file refactors, test generation across modules. That's where the real gap between models shows up.
hauhau901@reddit (OP)
All of these tasks are strictly agentic coding :)
rm-rf-rm@reddit
If true, Haiku 4.5 (regarded as significantly worse than Sonnet 4.5 by users) is better than Minimax 2.5 which was claiming near SOTA performance
Zc5Gwu@reddit
Minimax is great but not quite sonnet level in my subjective experience.
angelin1978@reddit
the real codebase angle is what makes this actually useful imo. the main thing i wonder about is how you handle the variance from non-deterministic model outputs, like does the same model score differently across runs? also curious what the average task complexity looks like, is it mostly single file edits or multi-file refactors
debackerl@reddit
This is wonderful! So cool! Don't hesitate to setup a Patreon thing to get some sponsorship
notdba@reddit
Thank you so much ♥️
This is a great list and much more comprehensive than the one from u/mr_riptano, in both models selection and tasks diversity.
Very interesting to see that only a few open weight models do better than Haiku 4.5. This kinda explain why Claude Code can afford to farm out important tasks (e.g. Explore) to sub agents that use Haiku.
touristtam@reddit
website down?
FPham@reddit
If this is true, and the results kinda look like true, this is a pretty interesting although expensive project.
I would say, you should add some sort of Avg Score / Avg Cost metrics. By messing with the data using Grok, it came up with :
Quick takeaways :
So basically $20 claude sub and using only Sonet looks like the best winner for me then using $20 Codex. Stay away from Opus as it eats all your money while only marginally better than Sonet.