Why is no open weight model inference provider hosting Mimo-v2.5 or Mimo-v2.5-pro?

Posted by True_Requirement_891@reddit | LocalLLaMA | View on Reddit | 30 comments

Literally no api inference provider is hosting the mimo-2.5 series models from Xiaomi. They seem to be reallly good.

High token efficiency and very low halucination rate compared to Kimi-k2.6, Deepseek-V4 or GLM-5.1, and yet no provider not even chutes is hosting it other than Xiaomi themselves.

I find it very strange.

[-]

Hodler-mane@reddit

also wondering this. only seen opencode go have it

[-]

It is my understanding that opencode (and others like openrouter) don't actually host the models. They just provide a proxy that passes your usage onto other hosts. The idea being that you can use just one account to get access to bunches of different models.

[-]

Hodler-mane@reddit

your probably right, which imo makes it better since the original providers aren't using quants.

[-]

Karyo_Ten@reddit

There is suspicion that Anthropic didn't get the memo and was A/B testing if they could quant Opus and see if people complained

[-]

Hodler-mane@reddit

yeah and it was REALLY noticeable when they did. people can talk a good game about Q4 all they want, but in agentic coding it fucks up so much to make it not worth doing.

[-]

Christosconst@reddit

Here you go https://opencode.ai/go

[-]

AdIllustrious436@reddit

Here is a refined version of your statement: What leads you to believe that it is not the official provider? Opencode only covers API fees; they do not host any models afaik

[-]

Enough_Big4191@reddit

could be less about quality and more about ops pain. providers care about stability, licensing clarity, and how well a model behaves under load, not just benchmarks. if it has quirks with tool use, memory, or inconsistent outputs, that shows up fast at scale even if single runs look great.

[-]

Digger412@reddit

It doesn't run out of the box correctly on plain transformers, vLLM, sglang, or llama.cpp.

While it is a good model, they've left it up to the OSS community to figure out how to support it. If you want to follow along, here are a couple of things to keep an eye on:

sglang: https://hub.docker.com/r/lukealonso/sglang-cuda13-b12x (Luke's been pivotal to moving OSS support of this model forward)

llama.cpp: https://github.com/ggml-org/llama.cpp/pull/22493 (my PR, still WIP but runs. I'll need to redo it later today to support the fused QKV)

Personally, supporting it in llama has been tricky because the HF transformers reference implementation doesn't run without dequanting the FP8 safetensors to BF16 first. MiMo has a weird tensor-parallel packed format for the weights which took time to figure out because the ordering and padding and other things are very nonstandard. I just got image support working in another branch last night, it is implemented strangely too.

Overall it's just been a very rough launch for the model. We're working on it.

[-]

thereisonlythedance@reddit

Thank you for your hard work. I’ve been keeping tabs on your PR. Looking forward to it bring merged (if I don’t install prior).

[-]

Digger412@reddit

Thanks! Hoping to get this merged in the next day or two, there's some flash attention work still needed to speed it up, and the vision PR will be afterwards. Hopefully it's all in by the end of the week!

[-]

segmond@reddit

off topic, but what kind of performance are you seeing with those 8x6000? for Kimi, MiMo, Qwen397B, MiniMax.

[-]

Digger412@reddit

I've done a sweep bench for K2.6 Q4_X and MiMo-V2.5, need to redo it for MiniMax and I still have the Qwen-397B too.. I'll re-sweep and bench them later tonight and post some numbers.

[-]

segmond@reddit

... also why didn't you just go for the DGX Station? you were approaching the price point.

[-]

Digger412@reddit

Because that's only \~252GB of VRAM, the station has 768GB of "unified" memory but the rest is made up of 496GB of LPDDR5X. I do a fair bit of work outside of direct LLM usage, eg pytorch and other weird research workloads and I like having a homogenous setup for those use cases.

[-]

segmond@reddit

I just want to mention that I have been running it since AesSedai pushed up his pR and weights and it's a very solid model. Implementation is great, I'm getting 12.8tk/sec in Q8. I'm just running my first pass on the vision model and it's looking great. Have it replicating a screenshot of a japanese desktop application in HTML. I'm seeing HTML, I'll see what it looks like in about 5 minutes.

On another note, it seems the architecture of these models are getting more complicated and the OSS is falling a bit behind in implementing them. I'm seeing more people vibing support with these models and getting somewhat working implementation even tho it's a mess.

[-]

Digger412@reddit

Thanks, yeah arches are always advancing and having easier access to advanced LLMs is both a boon and a curse. I've used LLMs to help with the MiMo-V2.5 implementation and I have to go through most lines of generated code with a comb, fix up style, undo some stupid decisions, and overall rewrite at least half of the code to get it into shape that it's worth being reviewed by a maintainer. People without as much dev experience aren't going to have that same knowledge and it shows in the amount of PRs that get closed outright because it's sloppy code that is unmaintainable.

[-]

True_Requirement_891@reddit (OP)

Thanks for the hardwork man!

[-]

Digger412@reddit

[-]

No_Conversation9561@reddit

https://huggingface.co/models?other=base_model:quantized:XiaomiMiMo/MiMo-V2.5

https://huggingface.co/models?other=base_model:quantized:XiaomiMiMo/MiMo-V2.5-Pro

Keep watch on these pages

[-]

Kodix@reddit

No clue why, but I'll just second that there *is* demand for this. Using Opencode Go, Mimo 2.5 and 2.5 Pro are *by* far the most reliable, go-to models for me, the ones that I can be actually relatively certain do a genuinely good job on their tasks.

[-]

eli_pizza@reddit

Opencode go just proxies to xiaomi

[-]

look@reddit

I think the timing with the DeepSeek V4 release screwed it over.

Millions of deluded people are flocking to a profoundly “meh” DS V4 Pro because of the brand name, and it has sucked up all spare GPU capacity to enable its mediocre, hallucination-ridden token generation.

I just dropped my Ollama Cloud service to pay for the extra Mimo 2.5 Pro tokens I need.

My guess is in about one to two weeks, conventional wisdom will catch up, DS V4 Pro will be going out of fashion, and everyone will be raving about how Xiaomi came out of nowhere with the amazing Mimo 2.5.

[-]

FullOf_Bad_Ideas@reddit

Deepseek has 99.75% discount for prefix cache hit on their API for V4 Pro. As long as others won't beat that, they'll be the king honestly because it's unmatched. Xiaomi has only 80% discount there, so in real agentic coding workflows the cost difference will be massive.

[-]

Few_Painter_5588@reddit

Most providers barely have capacity to spare and these trillion parameter sized models are awkward to serve. Like 1 H200 node has like 1.1 terrabyte of VRAM. So either you serve 1 instance of Mimo-V2.5-Pro on 2 nodes, or you serve 2 instances of GLM5.1 on 2 nodes. For most providers, it's more economical to serve the latter.

[-]

po_stulate@reddit

But it's not like they need to charge the same price for all models. It makes sense to charge more for larger models.

[-]

pfn0@reddit

the model has been a complete pain in the ass to run.

[-]

Legal-Ad-3901@reddit

Finally got flash attention working today without tanking my speed. That local 500k ctx 🤌

[-]

Cool-Chemical-5629@reddit

Mimo 2.5 pro IS very good. Very thorough, if not even the best among open weights when it comes to capability of solving complex tasks in one shot. This impression comes from my own testing I did on arena. The prompt I gave it was VERY complex. I basically gave it a very detailed plan for creating a whole 3D game and asked it to create it. Naturally there are MANY features it had to come up with, stitch together and the result was surprisingly good, if not the best out of many results given by open weight models. It was probably the most complete and complex result I've ever seen to that day. I mean, it wasn't completely working out of the box, but it wasn't completely broken either, many features were working, done surprisingly well with complex UI, interaction with the 3D world and didn't need much fixing. For a single shot? That's probably the best you can get right now.

[-]

coder543@reddit

It’s only been a week. If Xiaomi didn’t partner with anyone else to give them access before launch, as they clearly didn’t, then it takes time. Mimo is also not a household name like DeepSeek, so I doubt any of the inference providers are pulling all-nighters to make this happen.