Xiami mimo-v2.5 pro MIT license surpasses Opus 4.5 on arena
Posted by Terminator857@reddit | LocalLLaMA | View on Reddit | 21 comments
Many asked when we will have open weight model that is better than Opus. Well now we have it. Mimo is ranked #9 and Opus 4.5 is ranked #10.
LoveMind_AI@reddit
So I just gotta say... a couple days into using this now and I am just blown away. It's absolutely just as good as Opus. Zero tears shed for Claude turning into a beep-boop automaton.
UnbeliebteMeinung@reddit
Now we just need two or four gb200
someone383726@reddit
Just waiting on my allowance from my mom for making my bed this week then I will make the purchase!
UnbeliebteMeinung@reddit
I hope you have enough good boi points
Ok-Contest-5856@reddit
Wasn’t GLM 5.1 ahead of Opus 4.5 for a while and then they updated the leaderboard and it dropped significantly. Anyone know what happened?
-dysangel-@reddit
It looks like 5.1 still matches Opus 4.5 on average, but with higher variability. And Mimo v2.5 is 1 point ahead of both, but with even higher variability. So Opus 4.5 and GLM 5.1 are more consistent.
XTCaddict@reddit
Aren’t these human people rating it? If so I don’t know it consistent is the correct term because mimo had its weights dropped which would have led to more people reviewing
-dysangel-@reddit
the tests are blind - people don't know which model is which. That model is new, so it has fewer ratings
XTCaddict@reddit
Ahhh ok fair enough
9gxa05s8fa8sh@reddit
what happened is human psychology. people's brains see expensive things as better. when you do blinded tests like in science or llmarena, you see the truth
Terminator857@reddit (OP)
Good point, with more votes and likely will drop below Opus, since that has been the trend.
SmartCustard9944@reddit
One order of magnitude less votes. It is too early.
Feisty-Patient-7566@reddit
Is Opus 4.5 still being rigorously tested? I'd say the expectations for a model are higher now than when Opus 4.5 was released, so Mimo's high score makes it look better.
andy_potato@reddit
Yet another benchmaxxed model
Terminator857@reddit (OP)
How does one benchmax arena coding?
NandaVegg@reddit
To be fair, Llama 4 benchmaxxed the arena (generic chat) by specifically set up a variant that spams emojis.
Worried-Squirrel2023@reddit
order of magnitude fewer votes is the part everyone is glossing over. ranked #9 with thin data is not the same as ranked #9 with conviction. give it a week before celebrating.
segmond@reddit
if we are talking about Opus 4.5,
GLM5.1, DeepSeekV4, KimiK2.6 all beat it and Step3.5 in a capable hand beats it most of the time too.
someRandomGeek98@reddit
Glm 5.1? Kimi 2.6?
LoveMind_AI@reddit
Better or not, it's just incredibly, incredibly good.
thereisonlythedance@reddit
Looking forward to this being supported in llama.cpp