MiMo 2.5 requires at least 4 GPUs? Am I reading this right?

Posted by Pyrenaeda@reddit | LocalLLaMA | View on Reddit | 3 comments

Was trying to stand up a quant of MiMo 2.5 on a 2 node Spark cluster tonight, reading through the SGLang cookbook https://docs.sglang.io/cookbook/autoregressive/Xiaomi/MiMo-V2.5 for it and found this:

The checkpoint has a TP=4-interleaved fused qkv_proj; attention-TP per DP group must be 4. Use --dp = TP / 4; for TP > 4 this also requires DP-attention. Total GPUs must be a multiple of 4. A bare --tp 8 without --dp 2 will fail to load with MiMoV2 fused qkv_proj checkpoint is TP=4-interleaved; got attention tp_size=8.

... If I'm reading this right, it doesn't matter how much VRAM / compute you might have available, you must have GPUs in multiples of 4 in order to run it. Anything less than 4 and it just won't run, the model is essentially hard coded to require 4/8/12/etc GPUs.

But surely I've missed something here. That can't be right... can it? ... can it?

If so, a real shame. A lot of people who might otherwise have more than sufficient resources to run it at 4 bit will be locked out of it because of the 4 GPU requirement.