Minimax-M2 cracks top 10 overall LLMs (production LLM performance gap shrinking: 7 points from GPT-5 in Artificial Analysis benchmark)

[-]

lemon07r@reddit

Price per million is not a very good indicator of cost. The total cost to run artificial intelligence is a much better indicator of real costs. I would replace your price per millions axis with that in your website.

[-]

medi6@reddit (OP)

fair point. Like TCO per model by running the underlying infra? Only issue is that is highly different from one provider to the other

[-]

lemon07r@reddit

I would just use the total run cost artificial analysis already has posted on their website, per model. It wont be the best representation of actual cost, but still much much better than blended cost per million.

[-]

balianone@reddit

benchmaxxed

[-]

Evening-Piglet-7471@reddit

💯

[-]

medi6@reddit (OP)

ahah fr fr, my vibe check isn't complete yet

[-]

Evening-Piglet-7471@reddit

Creepy compared to GLM, tested via Claude Code.

[-]

jacek2023@reddit

Again people masturbate on benchmarks and the price on sub dedicated to local AI

[-]

GreenGreasyGreasels@reddit

I think you simply miss the point of models being open weight/source even if you personally can't run it.

The very fact that they are open even if few can run them will reshape the community with knock on effects. And price shapes the local AI (if it is insanely cheap there less pressure to run it locally but for privacy, if very high then local AI will find more use).

The world is not setup to cater singularly to your tastes and desires. So knock it off with the constant moaning in every thread please. We get the picture already. Thank you.

[-]

pier4r@reddit

this. It should be a more common view on this sub.

[-]

medi6@reddit (OP)

Good clarification: open ≠ must-be-local. Openness helps research and tooling even when most can’t run it. And yes, cost shifts the local/hosted balance.

[-]

medi6@reddit (OP)

no need to be aggressive, think it's a interesting conversation to have.
Also this sub isn't all about llama either yet I don't see bashing on all the other non llama related posts

[-]

llama-impersonator@reddit

what i'm chewing on is how anyone takes artificial analysis seriously

[-]

Zemanyak@reddit

What leaderboard do you recommend ? I don't have the time nor the money to check 347 models before I pick my go-to.

[-]

llama-impersonator@reddit

agent: terminal bench

roleplay: eqbench

context handling ability: fiction.live

overall task handling: i like reasonscape but it's only for small models unless someone helps kryptkpr out with compute

not sure if any of the code benches or aime mean much

[-]

Evermoving-@reddit

context handling ability: fiction.live

Where do you see a leaderboard on that website?

[-]

nullmove@reddit

By now you don't have your own personal benchmark? It doesn't even "take time", I uncover plenty of failure modes in day-to-day dealing with LLMs, and I simply turn those into questions.

[-]

medi6@reddit (OP)

what makes you think they are not? of course it's far from perfect, but wouldn't go that far

[-]

llama-impersonator@reddit

it's junk, there is very little correlation to actual performance in tasks. minimax does well in their score and it's just not a good model. GLM is pretty decent and has a poor score, same with kimi. gpt-oss has a mostly undeserved score. it's an alright model for some things, but it has problems with any real context length.

[-]

medi6@reddit (OP)

yeah vibe check is the best bench imo but hard to actually put numbers on that

[-]

Mbando@reddit

We are pursuing the frontier, while they are pursuing efficiency and diffusion.

[-]

shaman-warrior@reddit

Who is this ‘we’ ?

[-]

chisleu@reddit

Running this locally with vLLM! https://www.reddit.com/r/BlackwellPerformance/comments/1oii3gz/minimax_m2_fp8_vllm_nightly/

Really happy with the performance. Passed my first round of vibe checks:

Load up cline, full normal 30k context, instruct it to read docs and code until it hits 100k context and then stop. Then had it help me setup and run my benchmark. I'm getting ~65TPS out of it with ~2500TPS of PP Woooohaaaaa!

[-]

Daemonix00@reddit

kilo code was acting up for me...

[-]

chisleu@reddit

Right on. I don't use Kilo at all anymore. I find the prompting in Cline to be ideal for my purposes. I use models that work well with Cline. What I said before isn't to imply this model is great at anything. This was just my result from a first attempt/vibe check against the model's capabilities. There were no failed tool calls and it was pretty snappy about doing it.

[-]

medi6@reddit (OP)

nicely done!

[-]

Brave-Hold-9389@reddit

This model is a beast in tool calling, but it is very bad in reasoning and maths. Actually they trained it to be used in an agentic framework. And they achieved it

[-]

ForsookComparison@reddit

Rectangles on jpeg do not match the vibe check