Unsloth MiniMax M2.7 quants just finished uploading to HF
Posted by Zyj@reddit | LocalLLaMA | View on Reddit | 94 comments
They range from Q1 to BF16.
Grab them while they're still hot over at
https://huggingface.co/unsloth/MiniMax-M2.7-GGUF
Thanks to u/danielhanchen!
Here's a list:
| Quantisation (Label) | GB (1024\^3 Bytes) |
|---|---|
| UD-IQ1_M | 56,53 GB |
| UD-IQ2_XXS | 60,89 GB |
| UD-IQ2_M | 65,32 GB |
| UD-Q2_K_XL | 70,11 GB |
| UD-IQ3_XXS | 74,60 GB |
| UD-IQ3_S | 77,87 GB |
| UD-Q3_K_S | 87,21 GB |
| UD-Q3_K_M | 94,29 GB |
| UD-Q3_K_XL | 94,94 GB |
| UD-IQ4_XS | 100,97 GB |
| UD-IQ4_NL | 103,15 GB |
| UD-Q4_K_S | 121,67 GB |
| MXFP4_MOE | 126,67 GB |
| UD-Q4_K_M | 130,40 GB |
| UD-Q4_K_XL | 130,96 GB |
| UD-Q5_K_S | 147,97 GB |
| UD-Q5_K_M | 157,23 GB |
| UD-Q5_K_XL | 157,81 GB |
| UD-Q6_K | 175,15 GB |
| UD-Q6_K_XL | 193,20 GB |
| Q8_0 | 226,44 GB |
| UD-Q8_K_XL | 229,64 GB |
| BF16 | 426,07 GB |
fdrch@reddit
First impression - prompt processing takes forever
Cybertrucker01@reddit
What's the general rule of thumb for choosing the best model for available memory?
Say you have a 128gb M5 Mac, get the 4 Bit XS model or the 3 Bit XL model?
Zyj@reddit (OP)
Getting around 15 tokens/s using UD-Q6_K_XL with 2x Strix Halo and llama.cpp + rpc-server.
Particular-Way7271@reddit
Getting 13 using a 5060 ti and 9900x cpu plus 196gb ddr5 at 6000. Starting at 13 going down to 4-5 at 70k context. Using the smallest ud-q4 btw
masterlafontaine@reddit
What is your prompt processing?
bashdan@reddit
I'm intrigued by the 2x Strix Halo setups. I see Donato has a guide on setting up two nodes with RDMA but I don't often see many benchmarks using sharing across the cluster, and when I do there is only a limited quantity of data points supporting a specific conclusion (example site).
Would you be able to run something like llama-benchy and get pp2048,tg128 at context depths 32768, 65536, 98304, 131072 (of course ignore if they don't fit)? I'd appreciate it.
Trying to get an idea of how slow PP/TG is when presented with a reasonable agentic load.
Zyj@reddit (OP)
Sure thing. Note that this setup is not using tensor parallelism with vLLM, I'm using llama.cpp here. Donato also has some vLLM benchmarks over at https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/
Here is a llama-benchy run against llama-server (
build_info: b8763-ff5ef8278):llama-benchy --model 'unsloth/MiniMax-M2.7-GGUF:Q6_K_XL' --depth 2048 32768 65536 98304 131072 --latency-mode generation --tg 128 --tokenizer MiniMaxAI/MiniMax-M2.7Got proxy errors for larger depths, I will try to find out how to fix them.
bashdan@reddit
Mad respect for the data! Thank you!
Understood on the deltas between the setups. At least this gives a ballpark on performance. Are you using 2.5G on all the NICs/wires in between?
And oof at 142 tokens/sec PP and 5.4 tokens/sec TG at only 32k. I can't imagine 100k+ would be better than 100 tokens/sec PP.
truthputer@reddit
(reposting this comment, I had deleted it):
Tried it on my system - AMD EPYC 7C13, 512GB ram, single RX 7900 XTX 24GB., llama.cpp (Vulkan), configured to 128 threads, running the IQ4 XS Unsloth quant.
Llama-Bench gives pp512 = 27.2 t/s, tg128 = 5.04 t/s. Getting approximately 8 t/s in the web interface, but CPU is slammed at 100% and the system is drawing around 640 watts.
It's probably not something I'll be using often unless I will be away from the computer for a while and want to warm up the house.
truthputer@reddit
I get ~8 t/s with the IQ4_XS quant, slow but it’s running.
EPYC 7C13 64-core, 7900 XTX 24GB, 512GB RAM, llama.cpp built with Vulkan support.
But at 100% CPU and burning 650 watts, this probably isn’t going to be cost effective vs just using the cloud, lol.
masterlafontaine@reddit
What about prompt processing?
truthputer@reddit
Sorry, nuked my comment above and was going to repost, but Llamabench gives pp512: 27t/s, tg128: 5.04 t/s - although when I talk to it in the web interface I was getting around 8 t/s.
ixdx@reddit
unsloth/MiniMax-M2.7-UD-IQ3_XXS 74.6 GiBOn a system with
RTX5070Ti + RTX5060Ti + 96GB DDR4 3500 MHzI get the following performance:llama-server args:
-ctk q4_0 -ctv q4_0 -dev CUDA0,CUDA1 --flash-attn on --jinja -c 32768 -fitc 32768 -fit on -fitt 384 -m /models/unsloth/MiniMax-M2.5-GGUF/MiniMax-M2.5-UD-Q2_K_XL-00001-of-00003.gguf --temp 1.0 --top-k 40 --top-p 0.95SnooPaintings8639@reddit
Wouldn't it be better to go with Q2 model and move that ram for Q8 KV cache instead of Q4? I have read that KV quantization is hitting performance hard, especially for MoE, IIRC. But am not sure, and I really would like to see someone doing some benchmarks on these both configurations.
ixdx@reddit
Changed KV to q8_0. The results were similar:
yoracale@reddit
As a note of caution, please do NOT use CUDA 13.2 otherwise you'll get gibberish!!
StardockEngineer@reddit
You keep saying this but I am using it just fine and I compile my own.
thirteen-bit@reddit
There are just some specific kernels broken if 13.2 optimizing compilers are used.
If I understood correctly then these kernels are used do de-quantize IQ3/IQ4 and below (depending on model).
More info:
https://github.com/unslothai/unsloth/issues/4849
https://github.com/ggml-org/llama.cpp/issues/21255
StardockEngineer@reddit
Thanks for the clarification!
DygusFufs@reddit
Is there an explanation why this happens?
thirteen-bit@reddit
https://github.com/ggml-org/llama.cpp/issues/21255
grumd@reddit
Bug in CUDA. Nvidia working on it
PiaRedDragon@reddit
I seem to be getting gibberish anyway, esp on the smaller models.
I ran quick MMLU against them, the scores are not great TBH.
yoracale@reddit
I wouldn’t trust a word that comes out of your mouth when you keep criticizing Unsloth on X and on here while constantly promoting your own BAAI quants. Given that track record, you’re not really in a position to give any feedback on any model.
Your entire feed is literally just bashing unsloth just to promote your own quants, it's quite sad.
SnooPaintings8639@reddit
Good catch. Even the post above looks bot'ish, i.e. they run already couple versions of this huge model, that was barely droped? That's a deep conviction and some high-end hardware right there. They even run MMLU on them different quants? Damn... respect /s
MaCl0wSt@reddit
bruh xd
DR4G0NH3ART@reddit
Bro got receipts. Kudos.
Long_comment_san@reddit
Can't trust nothing on the internet these days
Thomas-Lore@reddit
Never could.
Long_comment_san@reddit
always has been img
Porespellar@reddit
No thanks, i’ma wait for the Milla Jovovich quants.
GordoPepe@reddit
They are multi pass as in: "no thanks"^n
SnooPaintings8639@reddit
Locked and load. UD-IQ3_XXS works great in my llama.cpp! Althogh, Claude Code had to make two fixes, one in llama.cpp for proper reasoning parsing (invalid token detection) and one in model template for conditional thinking (to allow `--reasoning off` flag).
Other than that, we already had couple of conversations! And yes, without sys prompt it greets me as Claude from Anthropic ;) I'd consider it to be a sibling of Sonnet, distilled from the same parent, lol
rpkarma@reddit
M2.7 has really felt like Sonnet to me in practice so that’s not surprising haha
What hardware are you running?
SnooPaintings8639@reddit
64GB DDR5 + 2 x RTX 3090, with Q2 I am getting 30-25 tps gen, and with Q2 more like 20-25 tps. I load it with ctx_size 100k with q8 for both K an V matrices. It does slow down to around \~10 tps gen when context is mostly full, but with long messages it manages to keep over 100 tps pp.
Thetitangaming@reddit
What cpu and ram speed is this at? I'm looking at eventually speccing a similar system
SnooPaintings8639@reddit
Kingston Fury Renegade RGB 64GB [2x32GB 6000MHz DDR5 CL32 DIMM]
Intel Core i7-13700KF
And if you're interested in MB too: ASRock Z790 Taichi
Btw, i regret getting a CPU without integrated graphics card. I'd like my monitor to connected to it, and be able to passthrough RTXes to some VM.
Thetitangaming@reddit
Thank you!! That makes sense I never thought about not having an igpu
PraxisOG@reddit
The iq3xxs is smaller than in previous releases, you only need 96gb vram for full offload now
lolwutdo@reddit
Always the damn think tag not working properly with these models smfh.
Is it a model issue or quant issue?
SnooPaintings8639@reddit
llama.cpp issue, it is not parsing the output correctly due to bug with triming or something. I would raise a ticket there, but am too lazy. The fix was one line, ask some coding agent to investigate it if bothers you (it did bother me a LOT), it was easy to find.
lolwutdo@reddit
Got a good prompt I can give? lol
SnooPaintings8639@reddit
haha, just in case, I am at commit d1f82e382, which is 5 days old, which might be important. Here is what my claud said about the issue and fix:
``` opening tag is missing and thinking text appears inline in content instead of reasoning_content.
Bug: llama.cpp doesn't extract reasoning/thinking content from MiniMax model responses. The
Root cause: In common/chat-diff-analyzer.cpp line 309, compare_reasoning_presence() detects the reasoning start tag as\n (with trailing newline) using trim_leading_whitespace(). But the generation prompt calculated by diff has without the trailing \n (it gets stripped as a common suffix). This mismatch means the PEG parser's prefix() function can't find and strip the tag, so reasoning is never parsed.
Fix: In common/chat-diff-analyzer.cpp line 309, change trim_leading_whitespace to trim_whitespace so reasoning.start becomes (no trailing whitespace), matching how the generation prompt represents it. Also update the MiniMax test in tests/test-chat.cpp to prepend to the tool call test input (forced-open thinking requires closing the think block) and add a reasoning extraction test.
```
Regarding the `--reasoning off` making work (good for RP, bad for coding), the fix was to always add both thinking tags at the start if this is set to off, then model thinks the reasoning is over.
Skyline34rGt@reddit
Even 1bit wow, but for my setup I need like 0,75bit xD
Asleep-Land-3914@reddit
Maybe REAP will help us here...
Plasmx@reddit
What quant do I need to fit it into 16 GB VRAM? /s
Skyline34rGt@reddit
You could run it if you have also tons of RAM (>96GB) and offload most of the MoE layers to CPU.
(it will be slow tho).
Overall-Somewhere760@reddit
Q0.001
digamma6767@reddit
It's weird that the IQ4 quants are smaller than M2.5's IQ4 quants.
Not complaining. I'm thinking IQ4_NL might be a perfect match with the Strix Halo.
asfbrz96@reddit
I downloaded the iq4 xs and it crashes my machine, the us q3kxl works fine tho
digamma6767@reddit
I'm still trying to get my download to complete for the IQ4 quant.
I think Unsloth might've put a bad version up for it. Bartowski's IQ4_XS quant is 122GB compared to Unsloth's 108GB.
LatentSpacer@reddit
Bits |Quantization Label |Size
0-bit |UD-IQ0_XXL |00.0 GB
colin_colout@reddit
finally... a Minimax I can run on my Matrox Millenium II!
itsdotscience@reddit
Hey, if that works next, we can target the ATI Mach32 Vesa Local Bus card. Ran at FSB, then called external bus speed. Was great for the GPU but insanity for IO cards.
Jackw78@reddit
(-1)-bit UD-IQ(-1)_XXL (-60.7 GB) for those with negative mass GPU
FlamaVadim@reddit
1 billion t/s!
Affectionate-Hat-536@reddit
With unlimited context :)
Thomas-Lore@reddit
That is slow, i get infinite t/s with this setup.
anobfuscator@reddit
I get NaN t/s, what am I doing wrong
FastHotEmu@reddit
im testing UD-IQ4_XS and so far I am not impressed... its severely underperformed in my programming tests compared even to qwen3.5.
colin_colout@reddit
wait a few days for some fixes before making a complete conclusion. unsloth is fast at fixing bugs... just like they are fast at getting their first versions out
tarruda@reddit
IIRC Minimax 2.x was never very resilient to quantization, so I wouldn't expect quanta below q4_k_m to be good.
FastHotEmu@reddit
unsloth recommend it for 128gb ram though
tarruda@reddit
I don't think they had time to evaluate it
FastHotEmu@reddit
good point
FastHotEmu@reddit
i have another computer with total 256gb, will test your hypothesis soon
Appropriate_Fly6399@reddit
Can't wait
tarruda@reddit
I recommend trying q8_0 if you can as it should give full precision performance
Due_Net_3342@reddit
some benchmarks(for accuracy and error rate) for the q3 and q4 quants would be great
tarruda@reddit
Minimax architecture is not very resilient to quantization.
See this chart for more details: https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/discussions/3#69db491efdd60cd788a43362
.
joakim_ogren@reddit
Which is best at DGX Spark or other NVIDIA GB10 computers?
Call_Put@reddit
I have successfully run MiniMax-M2.7-GGUF/UD-Q2_K_XL on a single GB200 NVL10 (GB10) using llama.cpp with a 128K context. The inference speed is decent, though vision is not yet supported as the mmproj file is currently unavailable.
Benchmark: test
[Q&A] 256 tokens in 9.38s = 27.2 tok/s (prompt: 51)
[Code] 512 tokens in 18.20s = 28.1 tok/s (prompt: 58)
[JSON] 1024 tokens in 36.59s = 27.9 tok/s (prompt: 75)
[Math] 64 tokens in 2.44s = 26.2 tok/s (prompt: 53)
[LongCode] 2048 tokens in 74.25s = 27.5 tok/s (prompt: 65)
insanemal@reddit
Single GB10 ain't running anything above 3bit.
You really want 4xGB10 cluster to run this
Zyj@reddit (OP)
If you have two you can run the Q6 variants, they're pretty good (I used M2.5 Q6_K a lot on 2xStrix Halo)
insanemal@reddit
Sure but I'd be worried about performance.
georgeApuiu@reddit
REAP
jzn21@reddit
Tnx! What is the benefit of the MXFP4 version compared to the othher Q4 quants?
Sufficient_Prune3897@reddit
None. Isnt that fun?
rpkarma@reddit
That’s not inherently true. But it’s complicated. On the right Blackwell hardware it can be quite a bit faster, but in practice we all have consumer Blackwell and basically no kernels are made for that yet
I love my SM12.1 GB10 but gosh it frustrates me lol
DistanceSolar1449@reddit
It’s a lot faster on a 5090
getpodapp@reddit
Much higher quality and speed 4 bit quant, you need hardware support for it though.
Zyj@reddit (OP)
Are you sure about the quality?
Pentium95@reddit
Thorically, they should be faster on Nvidia 5000 series
megadonkeyx@reddit
2 t/sec here i come!
FlamaVadim@reddit
🐢🐢🐢
fets-12345c@reddit
Anybody providing the MLX versions?
LegacyRemaster@reddit
I noticed that the Q4_K_XL is a good 10GB bigger than the 2.5. Interesting
danielhanchen@reddit
Oh the old Q4_K_XL is different :) Our new method guarantees always _XL is always bigger than _M.
I would use Q4_K_M which is also dynamic!
LegacyRemaster@reddit
thx Daniel. Always amazing
Zyj@reddit (OP)
How about UD-Q6_K vs UD-Q6_K_XL? Can you tell me where I can read about the difference?
danielhanchen@reddit
Thanks!
xlltt@reddit
thats a big boy
Geximus-therealone@reddit
AWQ please ! :D
jacek2023@reddit
I am wondering which one should I get this time:
UD-Q2_K_XL 70,11 GB
UD-IQ3_XXS 74,60 GB
UD-IQ3_S 77,87 GB
UD-Q3_K_S 87,21 GB