MiniMax M2 Llama.cpp support

Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 18 comments

By popular demand, here it is:

https://github.com/ggml-org/llama.cpp/pull/16831

I'll upload GGUFs to https://huggingface.co/ilintar/MiniMax-M2-GGUF, for now uploading Q8_0 (no BF16/F16 since the original model was quantized in FP8) and generating imatrix. I don't expect problems with accepting this PR, as I said, the model is pretty typical :)

[-]

Finanzamt_kommt@reddit

Even though I can't run it your a legend 🙏

[-]

ilintar@reddit (OP)

Me neither, Johannes Gaessler from the Llama.cpp team has kindly provided a server that can run / convert those beasts.

[-]

6969its_a_great_time@reddit

What kind of specs are on that thing?

[-]

ilintar@reddit (OP)

6 x 5090 and 512 GB RAM I believe.

[-]

Muted-Celebration-47@reddit

I run Q2 on my single 3090 + 64GB DDR5 and got 15-16 t/s. It is fast!

[-]

onil_gova@reddit

What are the VRAM requirements for MiniMax M2 at q4_0?

[-]

Tasty_Lynx2378@reddit

LM Studio reports
Cturan Q4K GGUF 138.34GB
MLX Community Q4 MLX 128.69GB

[-]

AlbeHxT9@reddit

How fast you want it?

[-]

bullerwins@reddit

Until Piotr's are up, I have already uploaded the quants here:
https://huggingface.co/bullerwins/MiniMax-M2-GGUF
Wait for his or bart's for the imatrix versions

[-]

spaceman_@reddit

Great! Am I correct in interpreting this PR as implementing the structure and architecture of Minimax M2 but all of the shaders and compute implementations are reused from other existing models?

[-]

ilintar@reddit (OP)

Yeah, that's how Llama.cpp works, it's modular and based on operations, so when there are no new operations to implement it uses the existing optimizations.

[-]

spaceman_@reddit

Interesting! Thanks for taking the time to respond and explain.

The PR mentions that there is no chat template yet as this model has interleaving think blocks. I'm guessing this also means that most tools won't be able to work with this model out of the box without changes to the client side?

[-]

ilintar@reddit (OP)

Guess so, but I might actually detach tool calling from reasoning support and just try to add tool call if it doesn't work out of the box.

[-]