MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks
Posted by danielhanchen@reddit | LocalLLaMA | View on Reddit | 50 comments
Hey r/LocalLLaMA, we did an investigation into MiniMax-M2.7 GGUF causing NaNs on perplexity. Our findings show the issue affects 21%-38% of all GGUFs on Hugging Face (not just ours).
- Other popular community uploaders have 38% (10/26) NaNs, another deleted theirs (1/4), and 22% of ours had NaNs (5/23) - now fixed.
- When running 99.9% KLD and other metrics, all are fine.
- We found overflowing in llama.cpp to be the culprit.
- We did PPL, KLD 99.9% benchmarks as well - lower left is better.

- Perplexity NaNs during block 32 - this was also found by the community and other quant uploaders. We also found block 311 to cause issues.
- We found that
blk.61.ffn_down_expswas the culprit - Q5_K and Q4_K of these produce NaNs starting at chunk 32 during PPL evals. Interestingly IQ4_XS, IQ3_XXS and smaller I quant types do not NaN. - This was quite confusing, since lower bit quants (Q2_K_XL for eg) did NOT NaN, but medium sized quants did (Q4_K_XL)!
- We’ve now updated the M2.7 quants at https://huggingface.co/unsloth/MiniMax-M2.7-GGUF to alleviate the issue, though we still do not know the exact cause of the NaN perplexities - it could be a fluke, or most likely large multiplies causing overflows.
Which quants did we test?
- 10/26 NaNs (38%) found at https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF: Chunk-32 failures (9): IQ3_XXS, IQ3_XS, IQ3_M, Q3_K_M, Q3_K_L, Q3_K_XL, Q4_K_S, Q4_1, Q5_K_S. Late failure (1): IQ1_S (crashed at chunk 311)
- 5/23 NaNs (21%) ours had NaNs - all fixed now at https://huggingface.co/unsloth/MiniMax-M2.7-GGUF: UD-Q4_K_S, UD-Q4_K_M, UD-Q4_K_XL, UD-Q5_K_S, MXFP4_MOE. All block 32.
- 1/4 NaN Q4_K_M at https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF was deleted due to NaNs. Block 32 as well.
Also, CUDA 13.2 is still definitely an issue. This causes some low bit quants on all models to get gibberish. Some people have dismissed it as not being an issue, but from what we’ve seen, more than 50 people have now confirmed that using CUDA 13.1 and lower fixes it. You can also see some of the public comments in our Hugging Face discussions, Reddit posts etc. NVIDIA has acknowledged that they are investigating the issue - see Unsloth Issue 4849, llama.cpp issue 21255, issue 21371
If you have any questions please do ask and thank you again for all the support as always. Appreciate it and hope you have a lovely week.
noneabove1182@reddit
looking into it there's something different that is wrong, if I run perplexity on my CPU I don't get any NaN values, but when I switch to my GPU they come back
So there must be something about the CUDA path for
Q4_KandQ5_Kthat's blowing up the activationsrelmny@reddit
Is your Q4_K_M (with llama.cpp offloading some layers to CPU) affected?
I ran the llama-perplexity with:
--check-tensors -c 4096 -f llama.cpp/scripts/wikitext-2-raw/wiki.test.raw
and it didn't find anything wrong.
noneabove1182@reddit
no, Q4_K_M is not affected because the final ffn_down layer is set to Q6_K
relmny@reddit
thanks!
danielhanchen@reddit (OP)
Q4_K_M is fine.
IQ3_XXS, IQ3_XS, IQ3_M, Q3_K_M, Q3_K_L, Q3_K_XL, Q4_K_S, Q4_1, Q5_K_S are broken
relmny@reddit
thanks!
danielhanchen@reddit (OP)
I can also help on further investigating this, but yes most likely the general path in llama.cpp is overflowing somewhere
One-Macaron6752@reddit
u/danielhanchen Good news then! Happy for the community. As for your request to ammend my post I am afraid it might not work. What I've tested, following the issues discovered with your published quant, was:
And neither of these two quants (btw, the PPL test results I've published opposite to yours are for AesSedai/MiniMax-M2.7-GGUF (Q5_K_M) - you can see in the screenshot).
As for the other quants and their respective owners you quoted finding them with faults... I'd rather not comment, avoiding a flame here.
MelodicRecognition7@reddit
thanks for your work!
Could you tell why do you "upgrade" FP8 models to 16 bit? For example Minimax was originally published in 8 bits (mostly) so the whole model size is just 230 gigabytes: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/tree/main
But you have 247 and 457 gigabyte variants, why? https://huggingface.co/unsloth/MiniMax-M2.7-GGUF
danielhanchen@reddit (OP)
Oh it's GGUFs - there is no "FP8" in GGUF land, so the only option is BF16, then we can quantize it.
We can't use Q8_0 since thats a scaled 8bit format, so it's not the same as FP8.
ZealousidealBunch220@reddit
Sadly this analysis is not complete without ubergarm's ik_llama quatns
No-Judgment9726@reddit
Wait so is this a quantization-level thing or a conversion pipeline thing? Because if 21-38% of GGUFs on HF are affected regardless of who uploaded them, that sounds like the tooling itself is broken, not just individual uploads.
Also wondering if anyone's tried running perplexity checks as part of their upload CI — seems like that should just be a standard step at this point.
LegacyRemaster@reddit
amazing job!
danielhanchen@reddit (OP)
Thanks!
FoxiPanda@reddit
Thanks as always for these types of graphs. I've been running on your IQ4_XS variant, but I think this graph will make me switch to the Q5_K_M to see how much slower it is on my hardware (mac studio).
It's also mildly interesting that Q5_K_XL+ gains almost no KLD advantage over Q5_K_M - which is sort of counter-intuitive to a lot of the posts that scream "MiniMax doesn't quantize very well"...which may be true, but only to a point.
danielhanchen@reddit (OP)
Thank you for the support! We'll definitely try to provide the community with more of these analyses!
Sometimes quantizations have quirks - KLD and PPL is only one metric - for example Benjamin Marie shows benchmarks on MiniMax 2.5 on Live Code Bench, MMLU etc:
PhilippeEiffel@reddit
Updated graph for MiniMax M2.7 would be great.
You provide a graph based on KLD, which is a valuable data.
The model provider gives benchmark values using the biggest size. Unfortunately, most of us are not able to run such models in BF16. That's the purpose of all your work providing quants. But this mean we will not be able to replicate the official benchmark performance.
Users have to make compromises that involves model size, context size, speed. It would be very helpful to have similar graph as KLD vs. size but LCBv6 vs. size. For each model: LCBv6 with Q8 and FP16 (BF16?) context quants.
I have really no idea of the time required to run LCBv6. Can anyone give some rough idea of how hard it is to run this benchmark?
-dysangel-@reddit
This matches my experience so far - the IQ2_XXS reaches incredible performance for 65GB of RAM
DOAMOD@reddit
I'm using 2.7 and I'm surprised at how well it's working, looking at that graph, I also agree, it seems surprisingly good for what one would expect.
danielhanchen@reddit (OP)
Yep MiniMax definitely is a great model for it's size
danielhanchen@reddit (OP)
That's great IQ2_XXS works nicely :)
david_0_0@reddit
Did you find that the NaN issues were consistent across different hardware setups, or did you test on a single machine? Curious if the problems scale across consumer cards vs server GPUs
danielhanchen@reddit (OP)
I tried H100s, B200s and different llama.cpp builds - the issue was replicated across all unfortunately
ReactionaryPlatypus@reddit
Thank you for all your great work with the Unsloth team. I noticed that the first BF16 file (unquantized) also has a new upload. Does that mean the block 32 issue is in the source as well? Will this issue also have an effect on your existing imatrix file as that was generated from the old BF16 files which had issues?
danielhanchen@reddit (OP)
Oh only metadata changes - no need to re-download!
WolvenSunder@reddit
I use the Q8_0 as a pseudo FP8. Should I redownload?
danielhanchen@reddit (OP)
Q8_0 is fine!
Ok-Measurement-1575@reddit
Is MiniMaxAI_MiniMax-M2.7-IQ4_XS-00001-of-00004.gguf in your list from Bartowski?
danielhanchen@reddit (OP)
That should be fine
Few_Water_1457@reddit
can't find
nanso I'll remove the model for now and try to get a working quant up tomorrow (3 days ago).Digger412@reddit
Hi, AesSedai here - I pulled that quant because of the `nan` issue and should be able to re-make it later tonight with the layer 61 change unsloth mentioned.
danielhanchen@reddit (OP)
Hey! Yep hope the suggested fix works!
VoidAlchemy@reddit
D👏S👏A👏! <3
yoracale@reddit
It's in the readme of the model card. First sentence: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF
FrostyDwarf24@reddit
splendid work!
danielhanchen@reddit (OP)
Thanks!
Educational_Rent1059@reddit
Awesome thanks for your amazing work
danielhanchen@reddit (OP)
Thank you!
Goldkoron@reddit
Layer 61 down exps was the second most sensitive tensor in the model from my KLD scan results. First sensitive being the layer 0 down exps. It's a strange issue and I wonder if it's related at all to the abnormally large KLD values in quants for this model.
danielhanchen@reddit (OP)
Yes it definitely is related - there's most likely huge activations which causes overflow
Zc5Gwu@reddit
It doesn't seem to have been affected by the NaN issue but I've been running unsloth's IQ3_XXS with good results.
The only thing I have noticed is a little bit of early stopping on occasion but that could be due to the low quant. It tends to happen before a tool call. Here's an example:
Note that it says that it's going to run a tool call but doesn't. This could also be a llama.cpp parsing bug because there was a similar issue going on previously with some models (this issue).
danielhanchen@reddit (OP)
Oh yes this does happen sometimes with models - we saw this a lot in Qwen3.5 for example - in Unsloth Studio for eg, we append a prompt to retell the model to continue - most likely an EOS was generated, and that's why it stops
Zc5Gwu@reddit
How do you tell that it's not a legitimate EOS though? The response has a
contentblock which could very well have been what it meant to do (just say something and stop).danielhanchen@reddit (OP)
Yes great question! We have to do some nifty heuristics and tricks - we tried over 1000 prompts from real world datasets and the false positive rate was around 5%, but accuracy overall increased by 30 to 50%!!
I would rather have 1 misplaced extra prompt than a weird terminated conversation.
No-Judgment9726@reddit
The finding that 21-38% of GGUFs on HF have NaN issues is honestly alarming — thanks for doing the systematic investigation instead of just reporting "it doesn't work."
One thing that stood out: is the NaN issue primarily showing up in specific quantization levels (like lower bit widths), or is it more about the conversion pipeline itself? If it's the latter, that suggests the problem isn't llama.cpp but the upstream GGUF conversion tooling, which would be a much bigger ecosystem issue to address.
Look_0ver_There@reddit
Thank you for the update. Is this a common issue with llama.cpp itself? If so, was an issue filed for it? I'm fairly certain the llama.cpp team would want to fix this ASAP.
mr_zerolith@reddit
Thank you for this info!
danielhanchen@reddit (OP)
Thank you!
dinerburgeryum@reddit
Sorry, I know I've been critical in the past, but thank you so much for all the work you and the team do for the local LLM community. Stuff like this is just killer work.
danielhanchen@reddit (OP)
No worries and appreciate the support as usual!