Minimax-M2 cracks top 10 overall LLMs (production LLM performance gap shrinking: 7 points from GPT-5 in Artificial Analysis benchmark)
Posted by medi6@reddit | LocalLLaMA | View on Reddit | 28 comments
I've been analysing the Artificial Analysis benchmark set (94 production models, 329 API endpoints) and wanted to share some trends that seem notable.
Context
This is models with commercial API access, not the full experimental OS landscape. So mostly models you'd actually deploy out of the box rather than every research models
The gap between best tracked OS (MiniMax-M2, quality 61) and best proprietary (GPT-5, 68) is now 7 points. Last year it was around 18 points in the same dataset. Linear extrapolation suggests parity by Q2 2026 for production-ready models, though obviously that assumes the trend holds (and chinese labs keep shipping OSS models)
What's interesting is the tier distribution:
- Elite (60+): 1 OS, 11 proprietary
- High (50-59): 8 OS, 8 proprietary (we hit parity here)
- Below 50: OS dominates by volume
The economics are pretty stark.
OS average: $0.83/M tokens.
Proprietary: $6.03/M.
Value leaders like Qwen3-235B are hitting 228 quality per dollar vs \~10-20 for proprietary elite models (kind of a random approach but tried playing with this: quality per dollar = quality Index ÷ price/M tokens)
Speed is also shifting. OS on optimised infra (Groq, Fireworks) peaks at 3,087 tok/sec vs 616 for proprietary. Not sure how sustainable that edge is as proprietary invests in inference optimisation.
Made an interactive comparison: whatllm.org
Full write-up: https://www.whatllm.org/blog/open-source-vs-proprietary-llms-2025
Two questions I'm chewing on:
-
How representative is this benchmark set vs the wider OS ecosystem? AA focuses on API-ready production models, which excludes a lot of experimental work, fine tuned models etc
-
Is there a ceiling coming, or does this compression just continue? Chinese labs seem to be iterating faster than I expected.
Curious what others think about the trajectory here.
lemon07r@reddit
Price per million is not a very good indicator of cost. The total cost to run artificial intelligence is a much better indicator of real costs. I would replace your price per millions axis with that in your website.
medi6@reddit (OP)
fair point. Like TCO per model by running the underlying infra? Only issue is that is highly different from one provider to the other
lemon07r@reddit
I would just use the total run cost artificial analysis already has posted on their website, per model. It wont be the best representation of actual cost, but still much much better than blended cost per million.
balianone@reddit
benchmaxxed
Evening-Piglet-7471@reddit
💯
medi6@reddit (OP)
ahah fr fr, my vibe check isn't complete yet
Evening-Piglet-7471@reddit
Creepy compared to GLM, tested via Claude Code.
jacek2023@reddit
Again people masturbate on benchmarks and the price on sub dedicated to local AI
GreenGreasyGreasels@reddit
I think you simply miss the point of models being open weight/source even if you personally can't run it.
The very fact that they are open even if few can run them will reshape the community with knock on effects. And price shapes the local AI (if it is insanely cheap there less pressure to run it locally but for privacy, if very high then local AI will find more use).
The world is not setup to cater singularly to your tastes and desires. So knock it off with the constant moaning in every thread please. We get the picture already. Thank you.
pier4r@reddit
this. It should be a more common view on this sub.
medi6@reddit (OP)
Good clarification: open ≠ must-be-local. Openness helps research and tooling even when most can’t run it. And yes, cost shifts the local/hosted balance.
medi6@reddit (OP)
no need to be aggressive, think it's a interesting conversation to have.
Also this sub isn't all about llama either yet I don't see bashing on all the other non llama related posts
llama-impersonator@reddit
what i'm chewing on is how anyone takes artificial analysis seriously
Zemanyak@reddit
What leaderboard do you recommend ? I don't have the time nor the money to check 347 models before I pick my go-to.
llama-impersonator@reddit
agent: terminal bench
roleplay: eqbench
context handling ability: fiction.live
overall task handling: i like reasonscape but it's only for small models unless someone helps kryptkpr out with compute
not sure if any of the code benches or aime mean much
Evermoving-@reddit
Where do you see a leaderboard on that website?
nullmove@reddit
By now you don't have your own personal benchmark? It doesn't even "take time", I uncover plenty of failure modes in day-to-day dealing with LLMs, and I simply turn those into questions.
medi6@reddit (OP)
what makes you think they are not? of course it's far from perfect, but wouldn't go that far
llama-impersonator@reddit
it's junk, there is very little correlation to actual performance in tasks. minimax does well in their score and it's just not a good model. GLM is pretty decent and has a poor score, same with kimi. gpt-oss has a mostly undeserved score. it's an alright model for some things, but it has problems with any real context length.
medi6@reddit (OP)
yeah vibe check is the best bench imo but hard to actually put numbers on that
Mbando@reddit
We are pursuing the frontier, while they are pursuing efficiency and diffusion.
shaman-warrior@reddit
Who is this ‘we’ ?
chisleu@reddit
Running this locally with vLLM! https://www.reddit.com/r/BlackwellPerformance/comments/1oii3gz/minimax_m2_fp8_vllm_nightly/
Really happy with the performance. Passed my first round of vibe checks:
Load up cline, full normal 30k context, instruct it to read docs and code until it hits 100k context and then stop. Then had it help me setup and run my benchmark. I'm getting ~65TPS out of it with ~2500TPS of PP Woooohaaaaa!
Daemonix00@reddit
kilo code was acting up for me...
chisleu@reddit
Right on. I don't use Kilo at all anymore. I find the prompting in Cline to be ideal for my purposes. I use models that work well with Cline. What I said before isn't to imply this model is great at anything. This was just my result from a first attempt/vibe check against the model's capabilities. There were no failed tool calls and it was pretty snappy about doing it.
medi6@reddit (OP)
nicely done!
Brave-Hold-9389@reddit
This model is a beast in tool calling, but it is very bad in reasoning and maths. Actually they trained it to be used in an agentic framework. And they achieved it
ForsookComparison@reddit
Rectangles on jpeg do not match the vibe check