Anyone running Mimo-v2.5 quants with multimodal and MTP?

Posted by Ambitious_Fold_2874@reddit | LocalLLaMA | View on Reddit | 15 comments

Has anyone been able to run Q4 or Q5 of XiaomiMiMo/MiMo-V2.5, with functioning multimodal capability as well as MTP, through llamacpp? Only AesSedai’s gguf quants appear to have mmproj, and it is unclear if it has MTP layers preserved or not.

I have only 40gb of vram, but 256gb of 4-channel ddr4 ram, so I’m not expecting any great inference speed, but I’m intrigued by the model’s strength and multimodal capabilities so wanted to give it a go. Looks like MTP on llamacpp is still in draft branch, so I’ll have to use that it seems.

[-]

Karnemelk@reddit

a whopping 3.4 tokens/s with 128gb ddr4 on iq3_s

[-]

Then-Topic8766@reddit

I have a working IQ3_S version installed (works with vanilla llamacpp). To give it a try I just tried run it with MTP fork that I use with qwen-MTP. It throws an error - MTP not supported for trunk architecture 'mimo2'. Regarding vision I didn't dovnloaded mmproj yet...

[-]

tnhnyc@reddit

Do you mind giving this a try?

https://github.com/tnhnyc/llama-mimo-mtp

[-]

Then-Topic8766@reddit

Just tried and I got error: llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 508, got 505

[-]

tnhnyc@reddit

You need the gguf that has the MTP heads, they are usually stripped since they've been useless in llama.cpp. You can try with this: https://huggingface.co/tnhnyzc/MiMO-V2.5-MTP-GGUF. It's quantized like AesSedai's IQ3_S gguf except mtp heads are kept.

[-]

Then-Topic8766@reddit

Thanks. AesSedai says: "These quants include MTP tensors for when that gets added upstream eventually.", but it doesn't works. I guess I will have to download again (on my slow ADSL...)

[-]

tnhnyc@reddit

Did you maybe download it before he updated them? I think his first version didn't include them but he later updated them to keep the mtp heads. Either way, if you're curious and don't mind the download, I'd appreciate you trying but otherwise it's also no problem.

[-]

Then-Topic8766@reddit

I downloaded 09.05.26., so after his post. I will try https://huggingface.co/tnhnyzc/MiMO-V2.5-MTP-GGUF but it will take time...

[-]

tnhnyc@reddit

So it works with his quant now but it's not identical and acceptance rates are low enough that it hurts performance, rather than help it. That should not be the case with the one you are downloading though.

[-]

Then-Topic8766@reddit

I tried new quant. It doesn't work with master llama.cpp. (llama_model_load: error loading model: missing tensor 'blk.48.layer_output_norm.weight'). It works with your fork but with mixed success. Without mtp flag I got around 8 t/s generation. With proper flag I got 12 t/s. (50% better speed). But I had that speed with AesSedai quant without mtp. (12 t/s).

[-]

intaketurbine@reddit

Yeah, something’s up, I also just grabbed a fresh copy of Q5_K_M from AesSedai’s repo, and HF says it’s got 508 tensors, but the llama.cpp branch is still errorin out and reporting 505

[-]

tnhnyc@reddit

Can you pull/rebuild and try it again?

[-]

Organic_Scarcity_495@reddit

i had the same question last week. from what i can tell the mmproj and mtp are separate things in llama.cpp — you can have both loaded but the mmproj only activates when you pass an image. the mtp branch doesn't break it, just be sure you're on the mtp-pr build

[-]

tnhnyc@reddit

Experimental, but give this a try if you like: https://huggingface.co/tnhnyzc/MiMO-V2.5-MTP-GGUF

[-]

OutrageousMinimum191@reddit

As I believe, the MTP support in llama.cpp forks doesn't support MiMo models yet.

As for multimodal, it works great, tested using Q4_K_M, Q6_K and Q8_0 and it doesn't need GGUFs reconversion, just new version of llama.cpp needed. Though I converted and quantized ggufs myself, I think it is correct for any ggufs on huggingface.

Q4_K_M runs on my system with 12ch DDR5 (350Gb/s bandwidth) with 22-23 t/s. 4ch DDR4 usually has \~65Gb/s bandwidth, so it probably will be \~4 t/s