MiMo 2.5 requires at least 4 GPUs? Am I reading this right?

Posted by Pyrenaeda@reddit | LocalLLaMA | View on Reddit | 3 comments

Was trying to stand up a quant of MiMo 2.5 on a 2 node Spark cluster tonight, reading through the SGLang cookbook https://docs.sglang.io/cookbook/autoregressive/Xiaomi/MiMo-V2.5 for it and found this:

The checkpoint has a TP=4-interleaved fused qkv_proj; attention-TP per DP group must be 4. Use --dp = TP / 4; for TP > 4 this also requires DP-attention. Total GPUs must be a multiple of 4. A bare --tp 8 without --dp 2 will fail to load with MiMoV2 fused qkv_proj checkpoint is TP=4-interleaved; got attention tp_size=8.

... If I'm reading this right, it doesn't matter how much VRAM / compute you might have available, you must have GPUs in multiples of 4 in order to run it. Anything less than 4 and it just won't run, the model is essentially hard coded to require 4/8/12/etc GPUs.

But surely I've missed something here. That can't be right... can it? ... can it?

If so, a real shame. A lot of people who might otherwise have more than sufficient resources to run it at 4 bit will be locked out of it because of the 4 GPU requirement.

[-]

AFruitShopOwner@reddit

yes it's backed into the release. I know lukealonso's nvfp4 quant fixed this problem. You can definitely run his version on 2 rtx pro 6000's.

Also to quote him " also: 1) They're missing some weights, one of the vision layers is missing biases 2) The model index is garbage and points to nonexistent files 3) They organize things in a heavily EP-favored way 4) They publish full size attention projection tensors that are silently organized all wrong unless you assume a specific set of kernels and an exact TP arrangement, with no indication that this is the case 5) There's bizarre nonstandard padding on some of the tensors "

DinoAmino@reddit

The model doesn't require the power of 2 multiples. That's how sglang and vllm work. If you need ultimate GPU/CPU flexibility use a q4 GGUF with llama.cpp.

Pyrenaeda@reddit (OP)

good tip, I had not considered it might be different for llama.cpp. Thanks, i will look into that.