MiMo 2.5 requires at least 4 GPUs? Am I reading this right?
Posted by Pyrenaeda@reddit | LocalLLaMA | View on Reddit | 3 comments
Was trying to stand up a quant of MiMo 2.5 on a 2 node Spark cluster tonight, reading through the SGLang cookbook https://docs.sglang.io/cookbook/autoregressive/Xiaomi/MiMo-V2.5 for it and found this:
The checkpoint has a TP=4-interleaved fused
qkv_proj; attention-TP per DP group must be 4. Use--dp = TP / 4; for TP > 4 this also requires DP-attention. Total GPUs must be a multiple of 4. A bare--tp 8without--dp 2will fail to load withMiMoV2 fused qkv_proj checkpoint is TP=4-interleaved; got attention tp_size=8.
... If I'm reading this right, it doesn't matter how much VRAM / compute you might have available, you must have GPUs in multiples of 4 in order to run it. Anything less than 4 and it just won't run, the model is essentially hard coded to require 4/8/12/etc GPUs.
But surely I've missed something here. That can't be right... can it? ... can it?
If so, a real shame. A lot of people who might otherwise have more than sufficient resources to run it at 4 bit will be locked out of it because of the 4 GPU requirement.
AFruitShopOwner@reddit
yes it's backed into the release. I know lukealonso's nvfp4 quant fixed this problem. You can definitely run his version on 2 rtx pro 6000's.
Also to quote him " also: 1) They're missing some weights, one of the vision layers is missing biases 2) The model index is garbage and points to nonexistent files 3) They organize things in a heavily EP-favored way 4) They publish full size attention projection tensors that are silently organized all wrong unless you assume a specific set of kernels and an exact TP arrangement, with no indication that this is the case 5) There's bizarre nonstandard padding on some of the tensors "
DinoAmino@reddit
The model doesn't require the power of 2 multiples. That's how sglang and vllm work. If you need ultimate GPU/CPU flexibility use a q4 GGUF with llama.cpp.
Pyrenaeda@reddit (OP)
good tip, I had not considered it might be different for llama.cpp. Thanks, i will look into that.