MiniMax-M2 llama.cpp

Posted by butlan@reddit | LocalLLaMA | View on Reddit | 11 comments

I tried to implement it, it's fully cursor generated ai slop code, sorry. The chat template is strange; I'm 100% sure it's not correctly implemented, but it works with the roo code (Q2 is bad, Q4 is fine) at least. Anyone who wants to waste 100gb bandwidth can give it a try.

test device and command : 2x4090 and lot of ram

./llama-server -m minimax-m2-Q4_K.gguf -ngl 999 --cpu-moe --jinja -fa on -c 50000 --reasoning-format auto

code: here gguf: here

https://reddit.com/link/1oilwvm/video/ofpwt9vn4xxf1/player

[-]

jacek2023@reddit

what about

https://github.com/ggml-org/llama.cpp/pull/16831

[-]

FullstackSensei@reddit

Cursor can handle 20k like files?!! Dang!!!

[-]

butlan@reddit (OP)

up to 50k is fine.

[-]

Qwen30bEnjoyer@reddit

How does the Q2 compare to GPT OSS 120b Q4 or GLM 4.5 Air Q4? Given that they have the same memory footprint, and all three are at the limits of what I can run with my laptop.

[-]

butlan@reddit (OP)

It's much better than gpt-oss 120b for my use case.

[-]

ilintar@reddit

Thanks, I made a stupid mistake in my (non-vide-coded :>) implementation that I'm working on and had a working one to run comparisons ;>

[-]

butlan@reddit (OP)

I saw your comment about the chat template being tricky, you were spot on. Wise man! I bet you could implement it properly in a day. It doesn’t even look that complex a model, though it’s still kind of a mystery if this implementation actually works right.

[-]

ilintar@reddit

I did implement it, in fact, by popular demand ;> but the chat implementation will have to wait a bit since we have to figure out how to properly serve interleaved thinking (non-trivial issue, for now it's best to leave all the thinking parsing to the client).

[-]