"Weights are coming".Xiaomi’s MiMo V2.5 Pro has landed at 54 in the Artificial Analysis Intelligence Index.
Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 70 comments
From:
- Xiaomi MiMo on 𝕏: https://x.com/XiaomiMiMo/status/2047840164777726076
- Artificial Analysis 𝕏: https://x.com/ArtificialAnlys/status/2047799218828665093
lendo93@reddit
MiMo v2.5 Pro is the strongest Chinese model we've tested in our little known, comprehensive coding reasoning benchmark. I'm surprised it has gotten so little attention. In coding reasoning, agentic work, and decision making, it averages higher than Opus 4.6. Benchmarks at https://gertlabs.com
rm-rf-rm@reddit
better than Opus 4.6? But your benchmark is showing 4.7 significantly better than 4.6...
lendo93@reddit
Opus 4.7 is significantly smarter than Opus 4.6. So is GPT 5.4. But I still use Opus 4.6 for writing code because it's a better product.
rm-rf-rm@reddit
No one who has actually used both regularly will say this
lendo93@reddit
The reality is that usefulness != raw intelligence. And people are no longer smart enough to tell the difference with frontier models, which is why we need strong benchmarks.
Mushoz@reddit
How is deepseek V4 pro so much worse than deepseek V4 flash? That seems wrong.
lendo93@reddit
Deepseek Pro is extremely nerfed by provider 429's right now. We don't count those against the model's score, only the requests that went through, but we do factor time into intelligence measurements, and that may be skewed by their infra problems. It also makes it hard to get large sample counts for our longer agentic tests where the model needs to make ~100 successful calls. From the responses that did come through, TBH it seems like it's just not any better than Flash, and probably a little more benchmaxxed. Flash is awesome.
NandaVegg@reddit
Your agentic coding leaderboard reflects my own personal experience pretty well. Decision making leaderboard is weird and sus though. Maybe too dependent on how each model reacts to (what is likely significantly undertrained and underbaked into the model compared to coding) written instruction for each game.
I played with DS V4 for a while and it is a peculiar model. It behaves nothing like other recent OSS releases that behave similarly (K2.6, GLM-5.1, MiMo 2.5 Pro) be it good or bad. Not exactly good at long context coherence and fuzzy logic but it does much less mini-CoT type "it is not X it maybe Y" prose that stems from standard RL curriculums. Reasoning chain is very unique and R1-like; clearly it was post-trained very differently from others. I won't be surprised if DS V4 shows large standard deviation on some (non-STEM-type) benchmark like that.
lendo93@reddit
DS V4 is interesting; we like Flash a lot, but there's a bias toward provider reliability because we are not self hosting models of that size.
The decision leaderboard has some anomalies. There are some interesting exceptions, like Grok 4 being pretty good at decision making/prose but not frontier at coding or tool use, and there are also some sample size issues. For decision making, a model needs to be called for virtually every turn of every game, which is quite slow and expensive, so it take a while for samples to roll in (a couple weeks to get reasonable results after model API release, whereas code results are same day). But we weight its contribution to the Combined score models based on a minimum match count.
Strong-Strike2001@reddit
In your benchmark Kimi 4.6 averages higher than Mimo 2.5....
lendo93@reddit
The weights are continuously updated as games are played, but in agentic coding, MiMo V2.5 Pro has a pretty comfortable lead https://gertlabs.com/?mode=agentic_coding
juraj336@reddit
Out of all of these.... Qwen 3.6 35B is what impresses me the most, #13 for a 35B is insane. Can you share a bit more about what parameters / backend was used? On your website I only see context
lendo93@reddit
We always use highest available thinking/quantization via OpenRouter. Qwen 3.6 is a really strong model. It breaks down a bit under larger context agentic work, but that's true of all smaller models in my experience. Context is what trillion+ parameter counts are still buying the frontier labs.
LoveMind_AI@reddit
Man, if this is open with a permissive license, I genuinely don't think there is a cooler LLM out there. Certainly just in terms of the command of language and writing ability - MiMo-V2.5-Pro is on top, and not just "on top for a Chinese model." I've been pushing it hard and it wins over K2.6 / Opus / Sonnet without much of a problem. I've been very impressed with K2.6 as an agent and haven't put MiMo through the paces there nearly as much, but in terms of writing and vibe, it's an absolute thunderbolt.
Chemical_Broccoli_62@reddit
I agree, it finishes the job effortlessly, not overly thinking like other SOTA model, feel like Gemini vibe but better.
marutthemighty@reddit
What makes Chinese AI models so damn good? Is it the algorithm? Is it the sheer amount of data they train on? Is it the speed? Is it the cost? What are the factors?
UpAndDownArrows@reddit
It's not that they are so damn good, it's that their Western counterparts are not releasing anything worth anything to the public, i.e. Western AI companies are just too damn greedy and profit-minded.
Where is Anthropic's 100B+ sized model available to download, even a single one? OpenAI's ? Google, Amazon, Microsoft, Meta? All just greedily hoarding shit.
ZeusCorleone@reddit
The distillation of already ready US models lol
phein4242@reddit
Which have been built by scraping the whole internet, often causing downtime because of aggressive crawling.
Its just fair that public data stays public, distilled or not.
LoveMind_AI@reddit
I think it's their country's approach to education, frankly, along with the seriousness with which they take the race, and the freedom they've given themselves to experiment. It doesn't hurt that they're also absolutely distilling data from western frontier models, but I don't think it's purely that. Most of the heavy hitters don't have the same financial incentives or pressures as the western labs.
Persistent_Dry_Cough@reddit
I don't know if "freedom" is something I'd equate with 996 SWE.
marutthemighty@reddit
I see. No wonder they are so successful.
Monkey_1505@reddit
Good software engineers.
DependentBat5432@reddit
I’ve been testing Mimo v2.5 Pro for days, and my honest take is that it’s a very respectable release. For the price and speed of iteration, it’s impressive. not switching from Claude yet, but it’s now my rec for people who want a strong domestic alternative
IllllIIlIllIllllIIIl@reddit
It has really really good reasoning around language and it's the first model I've encountered (open or closed) that has actually impressed me with its ability to write.
I've been prototyping an agentic pipeline that takes someone's streaming music "liked songs" playlist, pulls metadata on them from MusicBrainz, does some summary statistics, researches artists and albums, scrapes lyrics, and then throws it all into an analysis stage to produce a speculative profile on that person. You know, kind of a "what your music says about you" type of thing. MiMo-V2.5-Pro is the only model I've tried thus far that can both make genuinely insightful inferences based on that data, and write up the final profile as a cohesive, flowing document.
LoveMind_AI@reddit
first of all: that's a very cool idea and I'd love to talk, because we're working on that kind of creative "getting to know you" task and I'd be so curious to pick your brain!
but yes - you've nailed what I love about MiMo-V2.5-Pro - it creates profiles that are intuitive, insightful, and actually fun to read. It is significantly better at this than virtually every model I've tried, and even marginally better than Anthropic's current crop of models, which is impressive.
IllllIIlIllIllllIIIl@reddit
Thanks! Feel free to shoot me a message or something, though I don't know how much I can offer. One thing I'll say is that in my testing, I've found it really hard to find a sweet spot between "safe and a bit boring" profiles verses "creepy and violating" profiles. Seriously, depending on the person's playlist, it can sometimes make accurate inferences that can feel genuinely upsetting to hear. I told a friend about it and she demanded I run it on her playlist (without looking at the result) and she came back and told me "Uh, yeah... nobody is going to want to hear those kinds of things about themselves, even if they're accurate... and it was very accurate."
WolfeheartGames@reddit
Its the best opus distillation there is, but its so heavily distilled it has some issues with difficult work.
Kodix@reddit
MiMo v2 was on an MIT license, so this very likely also will be. And that's *so* fucking cool.
Technical-Earth-3254@reddit
I don't know about writing, but V2 Pro and Omni were also really good at coding. Like, they were able to solve problems I usually throw at Codex 5.3 high. I have no doubt that V2.5 is at least that good as well. And it being oss and having vision makes it borderline perfect.
BriguePalhaco@reddit
DeepSeek effect?
LegacyRemaster@reddit
size?
Ok-Hotel-8551@reddit
🥰Qwen, Kimi, Mimo > Sonnet 🤮
OGMYT@reddit
Impressive jump into the top 100 on Artificial Analysis, especially for a mobile-optimized model. V2.5 Pro seems to balance performance and efficiency well. Curious how much of the gain comes from architecture tweaks vs. data quality—would love to see the ablation studies. Real-world inference speed on consumer devices will be key for local deployment. If they open-source weights, it could be a solid option for edge use cases. https://conduit.arewefriends.org/
hellomistershifty@reddit
ohhhh a score of 54, I thought the title meant 54th place lmao
True_Requirement_891@reddit
They are being very lazy with OS...
Persistent_Dry_Cough@reddit
Hopefully they vibecode a backdoor into the next bootloader so we can break it.
Chinmay101202@reddit
hype...
jacek2023@reddit
Unusable locally
NoFaithlessness951@reddit
Some of us live in a data center.
jacek2023@reddit
Some of us just hype benchmarks without running anything locally
Mickenfox@reddit
I just want third party providers.
seamonn@reddit
Sone of us are waiting for AI bubble to pop so ebay is flooded with cheap GPUs
dangered@reddit
Hardware isn’t going anywhere when the bubble pops. GPUs and the raw materials are bought out until 2028 or 2030, anything that hits the market is going straight toward some company with seed money (likely one with a household name).
sn2006gy@reddit
Yeah, i don't think people realize that we're at the beginning of the economica "compression" phase where friction from how things used to work is still front and center - but another 3 months or so, we'll see-saw right over and shit will move fast and i don't think compute will get cheaper unless there are massive breakthroughs in a new type of compute. AI could fail tomorrow and the demand for what it can do today would survive and only increase demand on hardware to run it through the ashes and not many people realize this which is kinda funny because "i want the ai bubble to pop so i can run local ai" is a hot take if you actually think about what you're saying. (now, if you didn't care about ai and just want affordable video gaming back then its a fine take)
dangered@reddit
100% correct. Idk about affordable gaming though, hardware wise it’s still going to be the same. Software wise, game devs meed to actually start trying, they deserve this. We haven’t had efficient games in well over a decade because they’ve been relying on hardware that doesn’t exist yet over optimization.
I can legitimately run doom on a potato wired to a lemon but I can’t run a modern game without overclocking a chip that is capable of outsmarting half of the current human population.
jacek2023@reddit
Yes some people never do things, they always wait/prepare to do things
ParthProLegend@reddit
Some of us exist in unstable diffusion sub.
ThePixelHunter@reddit
Good news. 1T is too large for most to run locally, but open-weights will bring more providers.
LegacyRemaster@reddit
Open? Flash or pro?
lemon07r@reddit
I hate how meaningless this benchmark has become though.. some of these "top" models genuinely suck. Cough Gemini, cough opus 4.7.
NairbHna@reddit
Different use cases. Not everyone is coding
winterscherries@reddit
Gemini specifically is fairly underrated. In my experience it does remarkably well in some less common tasks that require math intuition, better than GPT/Claude. As for other open models, many hallucinate (MiMo, Deepseek), think for the next 30 minutes (Kimi) or flatly disregard instructions (HY). GLM 5.1 does much better than the rest, which I was surprised of given that I always considered it as a mostly coding model.
snugglezone@reddit
Claude wasting half a day helping me find an actuator setup for a robotics build. Go to Gemini after hours of banging my head with Claude, BOOM instantly finds the perfect option. Everyone really does need a MAGI system with 3 LLMs competing to provide the best answer.
pier4r@reddit
So many are fixed that the only use case is coding is crazy. It is like this since Claude 3.5
DepressedDrift@reddit
I have been noticing the term of "weights" instead of "open source" lately for alot of the new models.
Unusual_Guidance2095@reddit
I might be confusing this with something else though didn’t they promise to release the V2 Pro and Omni models like months ago open source and still haven’t done so?
Middle_Bullfrog_6173@reddit
Indeed, it was "when they are stable" which could charitably taken to mean they needed the .5. But we'll see.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
arm2armreddit@reddit
gguf when?
Chinmay101202@reddit
beautiful. but pure halluciations graph (where the model should say idk) is so scary, opus 4.7 especially.
jzn21@reddit
I am really a big fan of the MiMo V2 Pro, but with V2.5, there is something wrong on OpenRouter since it performs worse than V2 and the V2.5 flash version. The flash version is extremely good. I am very thrilled with the open weights, can't wait
mr_Owner@reddit
Can i just say, i felt clickbaited after i read 1T params llm in the comments. Not so local for 99% of ppl i guess.
Impressive_Chain6039@reddit
Mimo V2 was good on bench but bad on my setup. Too safe. We will see
segmond@reddit
yup, v2 flash was garbage. I gave it quite a try at Q8. It lasted 3 days before I had to delete it.
fin_r@reddit
I thought the V2 was still closed weights?
Technical-Earth-3254@reddit
Flash was open weights, Pro and Omni were closed. But I agree with the comment, V2 Flash really wasn't good. In coding, even Step 3.5 Flash (which is way smaller) was better imo.
z_3454_pfk@reddit
it's actually surprisingly really good. world knowledge is a bit worse than kimi but obviously way behind any closed frontier model (even gemini flash) since it's only 1t params and likely pre-training data isn't as STEM focused as the closed companies.
best thing is that it barely hallucinates... way less than any other frontier lab. idk how they did it.
FullOf_Bad_Ideas@reddit
How big is it?
rusty_fans@reddit
From the artificial analysis tweet:
Additional model details:
➤ Context window: 1M tokens ➤ Parameters: 1T total, 42B active ➤ License: Xiaomi has publicly announced that weights are to be released soon. The model will show on Artificial Analysis as a ‘proprietary’ until the weights are released ➤ Release date: April 22, 2026