Unsloth MiniMax M2.7 quants just finished uploading to HF

Posted by Zyj@reddit | LocalLLaMA | View on Reddit | 94 comments

They range from Q1 to BF16.

Grab them while they're still hot over at

https://huggingface.co/unsloth/MiniMax-M2.7-GGUF

Thanks to u/danielhanchen!

Here's a list:

Quantisation (Label)	GB (1024\^3 Bytes)
UD-IQ1_M	56,53 GB
UD-IQ2_XXS	60,89 GB
UD-IQ2_M	65,32 GB
UD-Q2_K_XL	70,11 GB
UD-IQ3_XXS	74,60 GB
UD-IQ3_S	77,87 GB
UD-Q3_K_S	87,21 GB
UD-Q3_K_M	94,29 GB
UD-Q3_K_XL	94,94 GB
UD-IQ4_XS	100,97 GB
UD-IQ4_NL	103,15 GB
UD-Q4_K_S	121,67 GB
MXFP4_MOE	126,67 GB
UD-Q4_K_M	130,40 GB
UD-Q4_K_XL	130,96 GB
UD-Q5_K_S	147,97 GB
UD-Q5_K_M	157,23 GB
UD-Q5_K_XL	157,81 GB
UD-Q6_K	175,15 GB
UD-Q6_K_XL	193,20 GB
Q8_0	226,44 GB
UD-Q8_K_XL	229,64 GB
BF16	426,07 GB

[-]

fdrch@reddit

First impression - prompt processing takes forever

[-]

Cybertrucker01@reddit

What's the general rule of thumb for choosing the best model for available memory?

Say you have a 128gb M5 Mac, get the 4 Bit XS model or the 3 Bit XL model?

[-]

Zyj@reddit (OP)

Getting around 15 tokens/s using UD-Q6_K_XL with 2x Strix Halo and llama.cpp + rpc-server.

[-]

Particular-Way7271@reddit

Getting 13 using a 5060 ti and 9900x cpu plus 196gb ddr5 at 6000. Starting at 13 going down to 4-5 at 70k context. Using the smallest ud-q4 btw

[-]

I'm intrigued by the 2x Strix Halo setups. I see Donato has a guide on setting up two nodes with RDMA but I don't often see many benchmarks using sharing across the cluster, and when I do there is only a limited quantity of data points supporting a specific conclusion (example site).

Would you be able to run something like llama-benchy and get pp2048,tg128 at context depths 32768, 65536, 98304, 131072 (of course ignore if they don't fit)? I'd appreciate it.

Trying to get an idea of how slow PP/TG is when presented with a reasonable agentic load.

[-]

Zyj@reddit (OP)

Sure thing. Note that this setup is not using tensor parallelism with vLLM, I'm using llama.cpp here. Donato also has some vLLM benchmarks over at https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/

Here is a llama-benchy run against llama-server (build_info: b8763-ff5ef8278):

llama-benchy --model 'unsloth/MiniMax-M2.7-GGUF:Q6_K_XL' --depth 2048 32768 65536 98304 131072 --latency-mode generation --tg 128 --tokenizer MiniMaxAI/MiniMax-M2.7

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
unsloth/MiniMax-M2.7-GGUF:Q6_K_XL	pp2048 @ d2048	238.19 ± 4.56		17349.82 ± 326.63	17202.64 ± 326.63	17349.91 ± 326.61
unsloth/MiniMax-M2.7-GGUF:Q6_K_XL	tg128 @ d2048	13.87 ± 0.09	15.00 ± 0.00
unsloth/MiniMax-M2.7-GGUF:Q6_K_XL	pp2048 @ d32768	142.45 ± 19.31		249130.72 ± 33750.04	248983.54 ± 33750.04	249130.80 ± 33750.04
unsloth/MiniMax-M2.7-GGUF:Q6_K_XL	tg128 @ d32768	5.42 ± 0.05	6.00 ± 0.00

Got proxy errors for larger depths, I will try to find out how to fix them.

[-]

bashdan@reddit

Mad respect for the data! Thank you!

Understood on the deltas between the setups. At least this gives a ballpark on performance. Are you using 2.5G on all the NICs/wires in between?

And oof at 142 tokens/sec PP and 5.4 tokens/sec TG at only 32k. I can't imagine 100k+ would be better than 100 tokens/sec PP.

[-]

truthputer@reddit

(reposting this comment, I had deleted it):

Tried it on my system - AMD EPYC 7C13, 512GB ram, single RX 7900 XTX 24GB., llama.cpp (Vulkan), configured to 128 threads, running the IQ4 XS Unsloth quant.

Llama-Bench gives pp512 = 27.2 t/s, tg128 = 5.04 t/s. Getting approximately 8 t/s in the web interface, but CPU is slammed at 100% and the system is drawing around 640 watts.

It's probably not something I'll be using often unless I will be away from the computer for a while and want to warm up the house.

[-]

truthputer@reddit

I get ~8 t/s with the IQ4_XS quant, slow but it’s running.

EPYC 7C13 64-core, 7900 XTX 24GB, 512GB RAM, llama.cpp built with Vulkan support.

But at 100% CPU and burning 650 watts, this probably isn’t going to be cost effective vs just using the cloud, lol.

[-]

masterlafontaine@reddit

What about prompt processing?

[-]

truthputer@reddit

Sorry, nuked my comment above and was going to repost, but Llamabench gives pp512: 27t/s, tg128: 5.04 t/s - although when I talk to it in the web interface I was getting around 8 t/s.

[-]

ixdx@reddit

unsloth/MiniMax-M2.7-UD-IQ3_XXS 74.6 GiB

On a system with RTX5070Ti + RTX5060Ti + 96GB DDR4 3500 MHz I get the following performance:

pp6566 168.62 tok/s
tg13267 12.22 tok/s

llama-server args: -ctk q4_0 -ctv q4_0 -dev CUDA0,CUDA1 --flash-attn on --jinja -c 32768 -fitc 32768 -fit on -fitt 384 -m /models/unsloth/MiniMax-M2.5-GGUF/MiniMax-M2.5-UD-Q2_K_XL-00001-of-00003.gguf --temp 1.0 --top-k 40 --top-p 0.95

[-]

SnooPaintings8639@reddit

Wouldn't it be better to go with Q2 model and move that ram for Q8 KV cache instead of Q4? I have read that KV quantization is hitting performance hard, especially for MoE, IIRC. But am not sure, and I really would like to see someone doing some benchmarks on these both configurations.

[-]

ixdx@reddit

Changed KV to q8_0. The results were similar:

pp4539 164.50 tok/s
tg5694  12.29 tok/s

[-]

yoracale@reddit

As a note of caution, please do NOT use CUDA 13.2 otherwise you'll get gibberish!!

[-]

StardockEngineer@reddit

You keep saying this but I am using it just fine and I compile my own.

[-]

thirteen-bit@reddit

There are just some specific kernels broken if 13.2 optimizing compilers are used.

If I understood correctly then these kernels are used do de-quantize IQ3/IQ4 and below (depending on model).

More info:

https://github.com/unslothai/unsloth/issues/4849

https://github.com/ggml-org/llama.cpp/issues/21255

[-]

StardockEngineer@reddit

Thanks for the clarification!

[-]

DygusFufs@reddit

Is there an explanation why this happens?

[-]

thirteen-bit@reddit

https://github.com/ggml-org/llama.cpp/issues/21255

[-]

grumd@reddit

Bug in CUDA. Nvidia working on it

[-]

PiaRedDragon@reddit

I seem to be getting gibberish anyway, esp on the smaller models.

I ran quick MMLU against them, the scores are not great TBH.

[-]

yoracale@reddit

I wouldn’t trust a word that comes out of your mouth when you keep criticizing Unsloth on X and on here while constantly promoting your own BAAI quants. Given that track record, you’re not really in a position to give any feedback on any model.

Your entire feed is literally just bashing unsloth just to promote your own quants, it's quite sad.

[-]

SnooPaintings8639@reddit

Good catch. Even the post above looks bot'ish, i.e. they run already couple versions of this huge model, that was barely droped? That's a deep conviction and some high-end hardware right there. They even run MMLU on them different quants? Damn... respect /s

[-]

MaCl0wSt@reddit

bruh xd

[-]

DR4G0NH3ART@reddit

Bro got receipts. Kudos.

[-]

Long_comment_san@reddit

Can't trust nothing on the internet these days

[-]

Thomas-Lore@reddit

Never could.

[-]

Long_comment_san@reddit

always has been img

[-]

Porespellar@reddit

No thanks, i’ma wait for the Milla Jovovich quants.

[-]

GordoPepe@reddit

They are multi pass as in: "no thanks"^n

[-]

SnooPaintings8639@reddit

Locked and load. UD-IQ3_XXS works great in my llama.cpp! Althogh, Claude Code had to make two fixes, one in llama.cpp for proper reasoning parsing (invalid token detection) and one in model template for conditional thinking (to allow `--reasoning off` flag).

Other than that, we already had couple of conversations! And yes, without sys prompt it greets me as Claude from Anthropic ;) I'd consider it to be a sibling of Sonnet, distilled from the same parent, lol

[-]

rpkarma@reddit

M2.7 has really felt like Sonnet to me in practice so that’s not surprising haha

What hardware are you running?

[-]

SnooPaintings8639@reddit

64GB DDR5 + 2 x RTX 3090, with Q2 I am getting 30-25 tps gen, and with Q2 more like 20-25 tps. I load it with ctx_size 100k with q8 for both K an V matrices. It does slow down to around \~10 tps gen when context is mostly full, but with long messages it manages to keep over 100 tps pp.

[-]

Thetitangaming@reddit

What cpu and ram speed is this at? I'm looking at eventually speccing a similar system

[-]

SnooPaintings8639@reddit

Kingston Fury Renegade RGB 64GB [2x32GB 6000MHz DDR5 CL32 DIMM]

Intel Core i7-13700KF

And if you're interested in MB too: ASRock Z790 Taichi

Btw, i regret getting a CPU without integrated graphics card. I'd like my monitor to connected to it, and be able to passthrough RTXes to some VM.

[-]

Thetitangaming@reddit

Thank you!! That makes sense I never thought about not having an igpu

[-]

PraxisOG@reddit

The iq3xxs is smaller than in previous releases, you only need 96gb vram for full offload now

[-]

lolwutdo@reddit

Always the damn think tag not working properly with these models smfh.

Is it a model issue or quant issue?

[-]

SnooPaintings8639@reddit

llama.cpp issue, it is not parsing the output correctly due to bug with triming or something. I would raise a ticket there, but am too lazy. The fix was one line, ask some coding agent to investigate it if bothers you (it did bother me a LOT), it was easy to find.

[-]

lolwutdo@reddit

Got a good prompt I can give? lol

[-]

SnooPaintings8639@reddit

haha, just in case, I am at commit d1f82e382, which is 5 days old, which might be important. Here is what my claud said about the issue and fix:

```
Bug: llama.cpp doesn't extract reasoning/thinking content from MiniMax model responses. The opening tag is missing and thinking text appears inline in content instead of reasoning_content.

Root cause: In common/chat-diff-analyzer.cpp line 309, compare_reasoning_presence() detects the reasoning start tag as \n (with trailing newline) using trim_leading_whitespace(). But the generation prompt calculated by diff has without the trailing \n (it gets stripped as a common suffix). This mismatch means the PEG parser's prefix() function can't find and strip the tag, so reasoning is never parsed.

Fix: In common/chat-diff-analyzer.cpp line 309, change trim_leading_whitespace to trim_whitespace so reasoning.start becomes (no trailing whitespace), matching how the generation prompt represents it. Also update the MiniMax test in tests/test-chat.cpp to prepend to the tool call test input (forced-open thinking requires closing the think block) and add a reasoning extraction test.
```

Regarding the `--reasoning off` making work (good for RP, bad for coding), the fix was to always add both thinking tags at the start if this is set to off, then model thinks the reasoning is over.

[-]

Skyline34rGt@reddit

Even 1bit wow, but for my setup I need like 0,75bit xD

[-]

Asleep-Land-3914@reddit

Maybe REAP will help us here...

[-]

Plasmx@reddit

What quant do I need to fit it into 16 GB VRAM? /s

[-]

Skyline34rGt@reddit

You could run it if you have also tons of RAM (>96GB) and offload most of the MoE layers to CPU.

(it will be slow tho).

[-]

Overall-Somewhere760@reddit

Q0.001

[-]

digamma6767@reddit

It's weird that the IQ4 quants are smaller than M2.5's IQ4 quants.

Not complaining. I'm thinking IQ4_NL might be a perfect match with the Strix Halo.

[-]

asfbrz96@reddit

I downloaded the iq4 xs and it crashes my machine, the us q3kxl works fine tho

[-]

digamma6767@reddit

I'm still trying to get my download to complete for the IQ4 quant.

I think Unsloth might've put a bad version up for it. Bartowski's IQ4_XS quant is 122GB compared to Unsloth's 108GB.

[-]

LatentSpacer@reddit

Bits |Quantization Label |Size
0-bit |UD-IQ0_XXL |00.0 GB

[-]

colin_colout@reddit

finally... a Minimax I can run on my Matrox Millenium II!

[-]

itsdotscience@reddit

Hey, if that works next, we can target the ATI Mach32 Vesa Local Bus card. Ran at FSB, then called external bus speed. Was great for the GPU but insanity for IO cards.

[-]

Jackw78@reddit

(-1)-bit UD-IQ(-1)_XXL (-60.7 GB) for those with negative mass GPU

[-]

FlamaVadim@reddit

1 billion t/s!

[-]

Affectionate-Hat-536@reddit

With unlimited context :)

[-]

Thomas-Lore@reddit

That is slow, i get infinite t/s with this setup.

[-]

anobfuscator@reddit

I get NaN t/s, what am I doing wrong

[-]

FastHotEmu@reddit

im testing UD-IQ4_XS and so far I am not impressed... its severely underperformed in my programming tests compared even to qwen3.5.

[-]

colin_colout@reddit

wait a few days for some fixes before making a complete conclusion. unsloth is fast at fixing bugs... just like they are fast at getting their first versions out

[-]

tarruda@reddit

IIRC Minimax 2.x was never very resilient to quantization, so I wouldn't expect quanta below q4_k_m to be good.

[-]

FastHotEmu@reddit

unsloth recommend it for 128gb ram though

[-]

tarruda@reddit

I don't think they had time to evaluate it

[-]

FastHotEmu@reddit

good point

[-]

FastHotEmu@reddit

i have another computer with total 256gb, will test your hypothesis soon

[-]

Appropriate_Fly6399@reddit

Can't wait

[-]

tarruda@reddit

I recommend trying q8_0 if you can as it should give full precision performance

[-]

Due_Net_3342@reddit

some benchmarks(for accuracy and error rate) for the q3 and q4 quants would be great

[-]

tarruda@reddit

Minimax architecture is not very resilient to quantization.

See this chart for more details: https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/discussions/3#69db491efdd60cd788a43362

[-]

joakim_ogren@reddit

Which is best at DGX Spark or other NVIDIA GB10 computers?

[-]

Call_Put@reddit

I have successfully run MiniMax-M2.7-GGUF/UD-Q2_K_XL on a single GB200 NVL10 (GB10) using llama.cpp with a 128K context. The inference speed is decent, though vision is not yet supported as the mmproj file is currently unavailable.

Benchmark: test

[Q&A] 256 tokens in 9.38s = 27.2 tok/s (prompt: 51)

[Code] 512 tokens in 18.20s = 28.1 tok/s (prompt: 58)

[JSON] 1024 tokens in 36.59s = 27.9 tok/s (prompt: 75)

[Math] 64 tokens in 2.44s = 26.2 tok/s (prompt: 53)

[LongCode] 2048 tokens in 74.25s = 27.5 tok/s (prompt: 65)

[-]