I built a benchmark that tests coding LLMs on REAL codebases (65 tasks, ELO ranked)

Posted by hauhau901@reddit | LocalLLaMA | View on Reddit | 87 comments

Hey everyone, been working on something for a while and figured it's time to share it.

I kept seeing new models drop every week with claims of being 10x better, benchmarks that don't translate to actual coding, and demos that look great but fall apart on real work. so I started building my own benchmark to figure out what actually works.

It's called APEX Testing. every task is an actual codebase with real code, real dependencies, and a real problem to solve. fix this bug, add this feature, refactor this module, build this from scratch. It's (currently) comprising of 65 tasks across 8 categories, ranging from React components to race condition debugging to building CLI tools. Each model gets a fresh clone of the same repo with the exact same starting point and exact same conditions.

Grading is done by multiple SOTA models independently, and then I also personally review every single output to catch anything unfair like timeouts or infra hiccups. If a model got unlucky, I rerun it (which ended up causing a lot bigger of a hole in my wallet haha). The whole thing is ranked with ELO, and you can filter by category to see where models actually shine vs where they struggle.

A couple things that caught me off guard so far:

- GPT 5.1 Codex Mini beating GPT 5.2 Codex pretty convincingly even though smaller and older, it came out way more consistent (but it also seemed to REALLY splurge on tokens)

- Some models look great on average but completely bomb certain task types

- The cost difference between models with similar scores is huge

It's a solo project, funded out of my own pocket (you can see total spend on the homepage lol). hope it helps you cut through the noise and pick the right model for your work.

https://www.apex-testing.org

Hope you all find it useful!

P.S. I will work on testing more quanted models as well and I might add more tests as well in the future.

[-]

rpeck@reddit

Are you planning on benchmarking DeepSeek v4? I've seen conflicting reports of its performance...

[-]

XMRNeighbor@reddit

Is there a repo of those tasks? Can I run the benchmarks independently on my local models?

[-]

zerospatial@reddit

awesome - would be cool to see for local models the model size (gb) so users could know if they could run them locally - also maybe a cost efficiency score - so it something gets you 75% accuracy at a much lower cost than 80% accuracy for ex

[-]

Kaomet@reddit

I'm suprised there is such a low score difference between opus and sonnet : can you try to add harder tasks ?

[-]

Kaomet@reddit

Could you add harder tasks ?

[-]

pol_phil@reddit

Hi, congrats for the great benchmark!

Perhaps I missed it somewhere, but what agentic scaffold do you use? SWE-Agent? OpenHands? Something else entirely?

[-]

hauhau901@reddit (OP)

"Homemade" tests entirely :) real repos and real agentic coding. You can see some info on the site.

[-]

pol_phil@reddit

I mean what type of agentic pipeline/harness/scaffold do you use to get these models to solve these tasks. In other words, what kind of system message/tools have they been given. Via Claude Code? OpenCode?

SWE-Agent and OpenHands are just "minimal" agentic frameworks commonly used in benchmarks.

[-]

hristothristov@reddit

Could you add a filter for open weights vs proprietary models and also filter by parameter count? Basically, I want to check which is the best model I could deploy and use locally with ollama.

[-]

sgmv@reddit

GPT 5.4 testing pretty late, not yet online. Would add more interest to the project if thinks were reliably current, imho.

[-]

hauhau901@reddit (OP)

Yep, I'm a one man show and finishing working on qwen3.5 35b-a3b uncensoring now. Once that's done, I'll add gpt 5.4 to the leaderboard.

It performs above 5.3 codex :)

[-]

Uranday@reddit

I tried the uncensored 27b, slow but very cool. Hope you finish the 35B soon so I can use that.

[-]

hauhau901@reddit (OP)

Thanks, I'm looking at the 35B-A3B dashboard as we speak, it's mid run. Looking promising and I hope to release it soon.

[-]

Uranday@reddit

<3

[-]

skymen75@reddit

Hey it would be really cool if it was possible to sort by size of the model and wether it's open source or not. I am often only interested in OSS models I can run on my machine (48GB Mac) and how they compare with other models including bigger/proprietary ones.

Great job though, thanks a lot for this.

[-]

clocksmith@reddit

hm would have thought Gemini 3.1 would have done better at at least one category lol

[-]

DeltaSqueezer@reddit

Can you add Qwen3.5 9B?

[-]

DeepWisdomGuy@reddit

WTF, comparing crippled quants against frontier?!? Dude, at least run Q8_0.

[-]

lemon07r@reddit

It actually scored worse than the older qwen coder model in my own evals. I dont think the new qwen models are very good for coding

[-]

SemaMod@reddit

This is great! Are you planning on adding gpt-5.3-codex? With the current results it seems like Opus 4.6 blows everyone else out of the water, but I've had generally good 5.3-codex experiences.

[-]

Virtamancer@reddit

Also, the leaderboard makes it unclear what reasoning level was used for any model. So it’s kind of pointless.

[-]

No-Mountain3817@reddit

If not specified, always assume the maximum. That way, you won’t go wrong in your estimation.

It may seem pointless to you, as you’re clearly missing the point, but the work put in by the OP as an independent benchmark can still be useful to filter out noise from other benchmarks and leaderboard ratings.

[-]

hauhau901@reddit (OP)

All reasoning models are used at their highest setting (i.e. xhigh for openai) but you could work on your wording to be less rude.

[-]

hauhau901@reddit (OP)

Hi, currently only codex sub of 200$ offers it i think :) will add it once I can find it from somewhere like OpenRouter

[-]

_yustaguy_@reddit

Actually, there is a promotion right now where even the free tier can use it with generous weekly limits.

[-]

hauhau901@reddit (OP)

That's weird, I can't see it, could you please link it? I'm not getting the model as available through the API

[-]

_yustaguy_@reddit

Check if you have limits here:

https://chatgpt.com/codex/settings/usage

Try updating your codex installation if you're still not seeing 5.3 in there.

[-]

hauhau901@reddit (OP)

Thanks for getting back to me! I found it now - limits are extremely easy to hit. I've started the benchmark process for Codex 5.3 but it'll take a while (seems to hit limits every....2-3 benchmarks and they stop it for several hours until resets)

[-]

_yustaguy_@reddit

Oh, I guess they seem pretty high to me since I use AI sparingly haha

[-]

Howdareme9@reddit

It’s not easily accessible right now (no api)

[-]

sabotage3d@reddit

It's impressive that small models are performing that good. I am also unsure if the methodology is perfect. I had myself some strange results where Qwen Coder Next wrote a better 2D fluid simulation app than Kimi K2.5 and GLM 4.7 flash wasn't that far off.

[-]

-dysangel-@reddit

Different models can have different strengths and weaknesses.

[-]

hauhau901@reddit (OP)

Don't know of many things in life that are perfect 😂 methodology is made to reduce variance as much as possible but cannot fully eliminate it.

[-]

rorowhat@reddit

Can these tests be run locally?

[-]

Far-Application1714@reddit

glm 4.7 handled the React + CLI tasks pretty solid imo, consistent enough for real work without going overboard on tokens.

[-]

Kuumikoo@reddit

Glm 5 worse than Glm 4.7 as a much bigger model? I wonder what could be the reason.

[-]

hauhau901@reddit (OP)

Bigger size doesn't equate better quality (datasets are super important). I suspect the extra training was focused on 'general intelligence' rather than coding.

[-]

Kuumikoo@reddit

Interesting. Qwen 3.5 being so strong here is also surprising. From what I see Qwen is never rated that high in coding apart from small model competitions?

[-]

hauhau901@reddit (OP)

Keep in mind, qwen3.5 is almost 400b now AND they started using the datasets from people's subscriptions on Qwen. Similar to GLM and Minimax :)

[-]

Kuumikoo@reddit

Make sense. Did they ever mention when to release Qwen 3.5 coder?

[-]

hauhau901@reddit (OP)

No, only 'smaller' general models are due to come out today/tomorrow

[-]

Kuumikoo@reddit

It looks like the best subscription plan for cheap is GLM now. But I am so sick of their unstable services.

About the benchmark, I wonder how coding languages play role here. From what I know China is quite one dimensional with Springboot and Vue.

[-]

hauhau901@reddit (OP)

(For coding) Compare it to GLM 4.7 and it comes off as inferior.

[-]

GarbageOk5505@reddit

The GPT 5.1 Mini consistency finding is interesting; token spend as a proxy for effort is a pattern worth tracking across models. What categories see the biggest spread between average performers and bombers?

[-]

hauhau901@reddit (OP)

Great idea, I will add it as a public metric!

[-]

mr_riptano@reddit

Love to see more benchmarks that aren't hopelessly contaminated, great work!

I gotta say tho I'm very very skeptical of having LLMs judge code vs actual test suites.

[-]

angelin1978@reddit

the real codebase angle is what makes this actually useful imo. the main thing i wonder about is how you handle the variance from non-deterministic model outputs, like does the same model score differently across runs? also curious what the average task complexity looks like, is it mostly single file edits or multi-file refactors

[-]

hauhau901@reddit (OP)

Hi,

In all fairness, most models have had their tasks retaken several times. Scoring has rarely varied more than +/- 5 points. You cannot fully remove variance though (could be done with 0 temperature only) because it'd limit most models capabilities as well sadly.

[-]

angelin1978@reddit

+/- 5 is honestly pretty tight for this kind of benchmark. makes sense that temp 0 would hurt the creative problem solving side. solid methodology

[-]

hauhau901@reddit (OP)

Also to reply on avg task complexity: 99% of all tasks are a real codebase so multi files and in some cases folders as well. Different can be as few as tens of lines of codes edits to as much as 3000-5000 lines.

[-]

angelin1978@reddit

thats a good range honestly. the multi-file stuff is where most benchmarks fall apart because they only test isolated single-file edits. 3000-5000 lines is gnarly though, curious how many models even attempt changes that large vs just giving up

[-]

odomobo@reddit

Very useful info. My only complaint is that score/$ is not very useful, because although cost is linear, score is not. Getting from 80 to 90 should be an enormous increase in capability, but it would barely make a dent in score/$ .

[-]

hauhau901@reddit (OP)

That's true. ELO (and obviously, score) work exactly like that, but if you start reading the comments on this thread, you'll see a lot of people either don't care about it or don't see it the same way. There is no pleasing everyone.

[-]

odomobo@reddit

I understand people not caring, and I'm not asking you to placate me, but take sonnet 4.5 and sonnet 4.6 . They're nearly identical costs and nearly identical score/$ , yet 4.6 is over 150 elo higher than 4.5 .

Of course, this isn't an objectively solvable problem since elo or score can't be turned into a quantitatively-meaningful linear value, but I think there are ways to get a somewhat meaningful heuristic out of it. A couple of formulas that make sense to me:

"Ability" doubles every 200 elo: 2^(elo/200)

Halving distance to a perfect 100 score doubles ability: 1 / (100-score)

Those are just my thoughts anyhow. The data you present is already very helpful and informative, and a motivated viewer can perform their own analysis (of course).

[-]

Icy_Butterscotch6661@reddit

Keep the scores around and test if the older SOTA models indeed get dumber when a new model comes out

[-]

Lixa8@reddit

Damn this is a cool benchmark, I really like the different categories

So basically glm 4.7 is the open weight goat uh? I wonder how capabilities degrade with quantization, I am looking forward to future tests!

[-]

hauhau901@reddit (OP)

Currently adding q4kxl as well! Tha ks for the kind words.

[-]

jmager@reddit

You made a very useful and beautiful website, thank you! Would you consider adding an additional column that shows the "score/$" metric? This to me is the most insightful part of the stats. If a model passed a given test 67% of the time but costs a hundredth of the one passing 100% of the time, running the agent 3 times in parallel is likely to have at least one agent succeed at 3% of the cost. That is simplified of course, there are other variable such as time and confounding factors, but it is interesting to think about.

[-]

hauhau901@reddit (OP)

Hi, cost/score is already included so I'm unsure what you mean exactly?

[-]

jmager@reddit

I see that metric on the detailed page for individual models, but not on the overall list of all models.

[-]

guiopen@reddit

The results seem to align very well to real world usage

[-]

hauhau901@reddit (OP)

Because they are real world usage! 😊

[-]

guiopen@reddit

Yes, unfortunately that is an exception for benchmarks, I am very thankful for this one, thank you

Also loved the inclusion of quantized models

[-]

yeah-ok@reddit

Superb work. Very nice to have a new solid take on rankings! Looking forward to the next Kimi model is my take at the end of reviewing this..!

[-]

tarruda@reddit

Is this something we can run locally against llama-server? I'd love to test how much quantization impacts the results of some of those models.

[-]

hauhau901@reddit (OP)

No, I will be adding more quanted models for everyone soon.

[-]

tarruda@reddit

One interesting quant to try is Qwen 3.5 smol-IQ2_XS from ubergarm, here's my experience using it: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/2

Would be great if you could add that one, as it seems the best quant that can run on 128G macs!

[-]

hauhau901@reddit (OP)

Will do

[-]

rm-rf-rm@reddit

This is great! I think we desperately need something like this as the main benchmark rather than the bs gamed ones, LM arena etc.

Things I think that will make this get widely adopted:

Elo score isnt as crucial as Averages and variances. I'd suggest making that the main metric to sort on. Elo adds a layer of unreliable noise and subjectivity - not very meaningful for code
Will you make the test open source? Without that this really wont go anywhere unless you have insider connections or you get some viral takeoff

[-]

hauhau901@reddit (OP)

Hi, thank you for the kind words!

ELO will help maintain scoring long term since I recalculate it whenever a new model is added
I would have liked this but after extremely careful consideration, I can't play the 'cat and mouse' game with benchmaxing companies so I will leave the tasks at their titles only publicly. Otherwise, I'm confident I'd have to remake new tests each time a new wave of LLM's comes out :(

[-]

rm-rf-rm@reddit

I can't play the 'cat and mouse' game with benchmaxing companies so I will leave the tasks at their titles only publicly.

Unfortunately, this is the trade off. But its a case of chicken and egg as well - you need to make the test available for others to run. Without that, no one has any reason to trust your scores. The other option is get a bunch of money and market your test like Andon labs or have insider connections like LM arena. But then we'd be back to square one with an unreliable test.

Thats why SWE-Rebench continually updates their test and is probably the best available benchmark today

[-]

hauhau901@reddit (OP)

Yeah, my project is free for everyone so because of that it's a "take it or leave it" since I don't have funding coming in from somewhere for this. We'll see how time progresses, it's not my intention to ask for money or whatever either and I'd like it to never resort to that.

[-]

Yorn2@reddit

Can you make the leaderboard bigger than 5 models or at least extend it so I can see the top two or three open weights models? I mean, that's like 95% of the reason I look at benchmarks.

Err nm. I see how to look it up now. You should probably make the "View Full Leaderboard" a bigger option or just a full on button on the main page.

So, a questions. Why did you say yesterday that the new Qwen was worse than MiniMax M2.5 and that you'd post the results showing this soon and then today you released a leaderboard showing the exact opposite? Did you mean Kimi K2.5 instead?

Is your plan to run this once every month or so like SWE Rebench?

[-]

hauhau901@reddit (OP)

Hello, the tests were still ongoing at the time of writing that with more favoring Minimax.

Ideally I will work on keeping it updated whenever new (worthwhile) models come up.

[-]

sabotage3d@reddit

Given the cost-to-performance ratio of Minimax 2.5, it's a no-brainer. Did you update the score on your website?

[-]

hauhau901@reddit (OP)

Everything in thr website updates as soon as I finish it locally, so hes :)

[-]

philmarcracken@reddit

Like it so far, wouldn't mind a model size parameter. Throw us vram poor a bone ༼ つ ◕_◕ ༽つ

[-]

hauhau901@reddit (OP)

I will work on adding quanted models as well

[-]

tomleelive@reddit

The cost/performance analysis is really interesting here. For those of us running Claude Code daily, knowing that Sonnet 4.6 hits the sweet spot of 75+ score at 400-800 pts/$ confirms what I've been seeing in practice. Would love to see this benchmark include agentic coding tasks too — multi-file refactors, test generation across modules. That's where the real gap between models shows up.

[-]

hauhau901@reddit (OP)

All of these tasks are strictly agentic coding :)

[-]

rm-rf-rm@reddit

If true, Haiku 4.5 (regarded as significantly worse than Sonnet 4.5 by users) is better than Minimax 2.5 which was claiming near SOTA performance

[-]

Zc5Gwu@reddit

Minimax is great but not quite sonnet level in my subjective experience.

[-]

angelin1978@reddit

the real codebase angle is what makes this actually useful imo. the main thing i wonder about is how you handle the variance from non-deterministic model outputs, like does the same model score differently across runs? also curious what the average task complexity looks like, is it mostly single file edits or multi-file refactors

[-]

debackerl@reddit

This is wonderful! So cool! Don't hesitate to setup a Patreon thing to get some sponsorship

[-]

notdba@reddit

Thank you so much ♥️

This is a great list and much more comprehensive than the one from u/mr_riptano, in both models selection and tasks diversity.

Very interesting to see that only a few open weight models do better than Haiku 4.5. This kinda explain why Claude Code can afford to farm out important tasks (e.g. Explore) to sub agents that use Haiku.

[-]

touristtam@reddit

website down?

[-]

FPham@reddit

If this is true, and the results kinda look like true, this is a pretty interesting although expensive project.

I would say, you should add some sort of Avg Score / Avg Cost metrics. By messing with the data using Grok, it came up with :

Quick takeaways :

Ultra-high value winners are the <$0.01 or $0.01 models (especially Grok variants, Step 3.5 Flash, Qwen series) — they deliver 60–70 scores for pennies, ideal for high-volume or cost-sensitive use.
Best balanced picks (75+ score, 400–800 pts/$): GPT 5.2 series, Claude Sonnet 4.6, Gemini flashes — great quality without breaking the bank.
Diminishing returns kick in at the very top (Opus, high-cost Codex) where extra score costs disproportionately more.

So basically $20 claude sub and using only Sonet looks like the best winner for me then using $20 Codex. Stay away from Opus as it eats all your money while only marginally better than Sonet.