unsloth - MiniMax-M2.7-GGUF in BROKEN (UD-Q4_K_XL) --> avoid usage

[-]

jacek2023@reddit

I always say that people should try different models, different quants, and different GGUF sources. But people are too busy to do anything except hyping the benchmarks and watch YouTube, so here we are.

[-]

yoracale@reddit

I absolutely agree with you but the way OP presented the post made it sound like an Unsloth only problem, but they did not conduct a thorough research.

Bartowski’s quants had NaNs in 10 out of 26 cases. Ours had NaNs in 5 out of 23 cases.

So this was a broader issue, not something unique to our releases. We fixed all of ours and here is our post: https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

[-]

Total_Activity_7550@reddit

Not using Unsloth since Qwen3.5 release. Their quants (although they published an article and uploaded plenty of checkpoints to prove how good they are) just didn't work well with long context agentic tasks. Bartowski's worked well, I guess others work too.

[-]

yoracale@reddit

FYI, OP didn't research thoroughly before choosing such a dramatic title and description.

Bartowski’s quants also contained NaNs: 10/26 (38%). Ours had 5/23 (22%) with NaNs.

So this was a broader issue, not something that affected only us.

We fixed all of ours and also published a larger investigation here: https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

[-]

Then-Topic8766@reddit

As strange as it sounds - I was hoping a post like this would appear. When the MiniMax 2.7 quants appeared I happily rushed to download the MiniMax-M2.7-UD-Q4_K_M from Unsloth. On my slow ADSL this means 12-15 hours. Since I don't have much space on the SSD - I deleted MimiMax 2.5 - one of my favorite models, convinced that the new version is even better. This morning, with my first coffee, I set out to try out a new model. What a disappointment! Errors, loops, thinking endlessly... I deleted again and am now downloading Q4 from another author. I hope that the problem is only in the quant, that it is not a regression of the model.

As for the Unsloth guys some of the best quants I've used are their 'UD'. I am convinced that they are doing their best and that they are overwhelmed with work. I've also downloaded Gemma-4 a few times - I don't regret it as the models turned out fantastic in the end. Thanks to everyone in the community for the great work and experience they provided me.

[-]

yoracale@reddit

Appreciate the support. They didn’t properly test all of the quants, and other uploaders, like bartowski, ran into the same issue as well 10 out of 26 of their GGUFs had the NaN problem, just like ours. The OP’s title and description were sensationalist. We also made a post on LocalLLaMA about our investigation.

[-]

ThePrimeClock@reddit

Zero-day support and all we get is wah-wah.

It blows my mind how pathetically intolerant people have become with open source developers valuable time. Think of this as a free driver update, and be grateful rather than having a sook.

The unsloth lads have had every chance to take multi-million dollar jobs at frontier labs and instead they support us, and yet still have yo put up with this lazy whinging about free downloads.

Pull your head in.

[-]

danielhanchen@reddit

Thanks for the support and appreciate it! We did a larger investigation at https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ as per community request

[-]

ThePrimeClock@reddit

Thanks Mate. You and your team's contributions to this community are invaluable.

[-]

yoracale@reddit

Thank appreciate it! OP''s sensationalist title and description has already done the damage especially when their research wasn't conducted properly but folks like you really cheer us up! So thanks once again! 🙏

[-]

segmond@reddit

Exactly, their fast release often helps flush out implementation / template bugs in llama.cpp, quite often their release beats commercial API offerings!

[-]

segmond@reddit

I'm running unsloth Q5 and Q8 and both works great for me with no issue. No one is forcing you to use them.

[-]

yoracale@reddit

Appreciate the support! They didn't test all the quants properly, other uploaders like bartowski had the same issue where 10/26 of their GGUFs had the NaN issue just like us. OP's title and description was sensationalist and the damage has already been done unfortunately. We made a post about our investigations also on localllama!!

[-]

Aggressive-Permit317@reddit

Appreciate the heads up. I was literally about to download that exact quant. Saved me a ton of wasted time. Anyone find a working quant for MiniMax M2.7 yet or are we sticking to the official ones for now until Unsloth fixes their pipeline?

[-]

yoracale@reddit

OP did not check every other GGUFs by other uploaders. All uploaders and not just unsloth also experienced the issue. The specific quants you tested for bartowski were fine but 10/26 of their other uploads had the same NaN issue.

Also we updated it with benchmarks, fixes and finding here: https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

[-]

yoracale@reddit

Hey OP u/one-macaron6572

Would be amazing if you could update your original post claiming that our quants only had the issues when all uploaders also experienced the issue. The specific quants you tested for bartowski were fine but 10/26 of their other uploads had the same NaN issue.

Also we updated it with benchmarks, fixes and finding here: https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

Thanks so much!

[-]

dampflokfreund@reddit

That's pretty bad, I thought they would verify the quants before uploading, but that would explain why they are always so fast. Bartowski takes longer, probably because he verifies them.

[-]

danielhanchen@reddit

Hello I'm currently investigating:

AesSedai mentioned: 04-12-2026: The Q4_K_M I uploaded seems to have some issues, the PPL / KLD was throwing nan so I'll remove the model for now and try to get a working quant up tomorrow.
Ubergarm: OBSERVATION: Interestingly, the PPL does not look great on this one, but the KLD looks fine. The previous M2.5 also had some "poorly behaved" perplexity results as well with 4ish BPW quants showing "better" than baseline PPL.

99.9% KLD and Median KLD are fine, but yes OP is correct - UD-Q4_K_S, MXFP4_MOE, UD-Q4_K_M, UD-Q4_K_XL, UD-Q5_K_S do have NaN KLD.

Interestingly, smaller quants do NOT have NaNs - it's a Q4_K / Q5_K issue in ffn_down_exps

Quant	Size_GB	99.9%_KLD	Mean_KLD	Median_KLD
UD-IQ1_M	56.53	8.992761	0.548485	0.221076
UD-IQ2_XXS	60.89	8.489742	0.491275	0.188415
UD-IQ2_M	65.32	8.163422	0.447142	0.16768
UD-Q2_K_XL	70.11	7.635336	0.381748	0.13571
UD-IQ3_XXS	74.6	6.750962	0.309478	0.104197
UD-IQ3_S	77.87	6.56956	0.292742	0.097377
UD-Q3_K_S	87.21	5.160509	0.199284	0.062797
UD-Q3_K_M	94.29	4.797431	0.169051	0.052831
UD-Q3_K_XL	94.94	4.498796	0.16629	0.051984
UD-IQ4_XS	100.97	3.9325	0.134375	0.041285
UD-IQ4_NL	103.15	4.033768	0.134576	0.041012
UD-Q4_K_S	121.68	2.76419		0.02117
MXFP4_MOE	126.67	3.106201		0.028357
UD-Q4_K_M	130.39	2.444118		0.017647
UD-Q4_K_XL	130.96	2.446586		0.017101
UD-Q5_K_S	147.97	2.048012		0.012206
UD-Q5_K_M	157.23	2.000825	0.039498	0.010628
UD-Q5_K_XL	157.81	2.002362	0.039439	0.010561
UD-Q6_K	175.15	1.851206	0.036952	0.009782
UD-Q6_K_XL	193.2	1.845684	0.03237	0.008271
Q8_0	226.44	1.765187	0.03076	0.007753
UD-Q8_K_XL	229.64	1.767597	0.030709	0.007732

[-]

VoidAlchemy@reddit

Heya u/danielhanchen , I'm not sure why you're quoting me talking about my own quants here? I assume your LLM is not good with context and that you're a busy guy. Regardless, thanks for clearly listing your broken quants!

I believe the thread you want to reference is this one where some folks discovered your quant throwing nans: https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF/discussions/1#69dbf6b7b3209da89bfa050f

Also thanks u/One-Macaron6752 for letting folks know and so Daniel and Michael can fix their stuff.

Cheers and glad the whole llama community is working together to improve - a rising tide lifts all boats!

[-]

danielhanchen@reddit

u/VoidAlchemy I found the issue and fixed it at https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

Also it doesn't just affect us - 10/26 (38%) of Bartowski's quants also NaN whilst 5/23 (22%) of ours NaN. So it's a widespread issue.

[-]

VoidAlchemy@reddit

thanks for the heads up, i'll check out your post over there. too bad no root cause identified so far, guessing y'all mixed up some of the trouble tensors a bit to get around it.

appreciate you sharing your findings and nice job turning around some clean quants! cheers!

[-]

danielhanchen@reddit

Thanks! Yes sadly the only "trick" was use Q6_K for the last ffn_down tensor and the NaN disappeared - my guess is it's overflowing somewhere

[-]

tarruda@reddit

Heya u/danielhanchen , I'm not sure why you're quoting me talking about my own quants here?

What is funny is that in a hf thread I had mentioned your "smol-IQ2_XS" quant of Qwen 3.5 397b, they deleted my comment as "off-topic". All I did was say: "Hey, check out this ubergarm quant that works super well on 128g!"

[-]

One-Macaron6752@reddit (OP)

u/danielhanchen I appreciate taking the time to reply to my argumentation.

Perplexity check (without KLD) on a model sized as MiniMax takes roughly 5 minutes per qunat. I can imagine you could batched such tests at least for "pure" and/or UD quants so that accidents won't happen again.
Also, even if not published the first day you push the quants, it would still be of good help / assert trust from the community if you publish PPL / KLD at a later time in the model card. It doesn't have to reference any other fellow quanter similar PPL/KLD figures (to avoid useless competition!) but this could also serve the baseline sanity checks for most of the interesting, meaningful quants for the community.

[-]

danielhanchen@reddit

u/One-Macaron6752 I did a full investigation at https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

10/26 NaNs (38%) found at https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF: Chunk-32 failures (9): IQ3_XXS, IQ3_XS, IQ3_M, Q3_K_M, Q3_K_L, Q3_K_XL, Q4_K_S, Q4_1, Q5_K_S. Late failure (1): IQ1_S (crashed at chunk 311)
5/23 NaNs (21%) ours had NaNs - all fixed now at https://huggingface.co/unsloth/MiniMax-M2.7-GGUF: UD-Q4_K_S, UD-Q4_K_M, UD-Q4_K_XL, UD-Q5_K_S, MXFP4_MOE. All block 32.
1/4 NaN Q4_K_M at https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF was deleted due to NaNs. Block 32 as well.

[-]

1ncehost@reddit

Just wanted to report Ive used UD-IQ3_S with opencode and it appears to be quite good and as I'd expect for that quant level.

[-]

ambient_temp_xeno@reddit

It is about Unsloth. I don't even know what I'm supposed to say. They just release broken quants all the time and people should not get them.

[-]

danielhanchen@reddit

Actually this is false - I did an investigation at https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

10/26 (38%) of Bartowski's quants also NaN
5/23 (22%) of ours NaN.
So it affects everyone.

[-]

ambient_temp_xeno@reddit

I mean I get that it's usually llama.cpp breaking them, but my problem is with them being uploaded without testing, and then being updated with the same filenames.

[-]

danielhanchen@reddit

We do test them, but sometimes these are one offs - KLD 99.9% was perfectly fine (that's what we run) - but it's not just our problem - we upload 20+ quants - other people do the same and have even more broken quants.

And hence why the large analysis at https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

[-]

ambient_temp_xeno@reddit

Well, thanks for testing these anyway. I at least won't have to re-download Bartowsku's q8.

Another issue with the same filenames for new revisions of the quants is sometimes people have say 5 of 7 downloaded and then the last two splits are from a newer revision.

[-]

danielhanchen@reddit

Oh there's 2: 1. Split metadata to a small 10MB file for shard 1 2. HF_XET will skip chunks which are the same, so it won't actually redownload (doesn't always work though)

[-]

relmny@reddit

"They just release broken quants all the time..."

Do you have any proof of that?
Not even all their Minimax-m2.7 quants are "broken"...

Is "bashing on Unsloth" the new thing?

I get it that when there's an issue, for us to voice it, but bashing on a team that gives so many things for free (quants, documentation, tools, etc)... ?

[-]

ambient_temp_xeno@reddit

They remove the broken versions when they update them. Not all of the updates are their fault though.

[-]

relmny@reddit

Based on the hf page, they updated 5 quants, the rest are the same from day one.

[-]

ambient_temp_xeno@reddit

I'm not talking about the m2.7 one. It's been an ongoing thing for a long time. Some of the things could be avoided if there wasn't a 'rush to be first to upload'.

[-]

relmny@reddit

Like what?

What about all models that have issues from day one like gemma-4, qwen3/3.5 and many others or chat-templates and so on.

I'm glad they "rush" because if their quants have no issues, they can still be tested with the inference engines and patches/updates for them (llama.cpp and so) can come earlier.

And, they do it for free, they have their own timings. AFAIK nobody pays them to make quants.

But yet people still complain for many things that we get for free.

I really don't understand the hate. Don't like it? don't use it! you're not paying for it and nobody is forcing you to use it!

One thing are posts like OP, where are constructive. Another thing is just "I hate them because they make things free and they make mistakes"

And I'm done with this

[-]

ambient_temp_xeno@reddit

My whole point was I don't use and neither should other people. My time isn't free.

[-]

yoracale@reddit

Hey do you have examples or errors where a quant of ours was broken? We're more than happy to address the issue which we usually do if someone posts about it in the Discussion page otherwise we might not be aware of it.

As for MiniMax-M2.7, when we ran perplexity and KLD benchmarks on every MiniMax-M2.7 4-bit quant for Q4_K_XL, MXFP4MOE, IQ4_XSS no matter what etc, all of them did in fact show unusually high PPL compared with the other bit sizes. AesSedai and ubergam reported seeing similar issues as well and AesSedai for example deleted their one.

So it seems that the model may very well be sensitive to quantization especially at 4-bit. That said, we initially kept it up because Benjamin Marie’s benchmarks on M2.5 (which uses the same arch as M2.7) suggested that Q4_K_XL performed the best overall, so we did not remove it at the time. In fact this time, our Q4_K_XL had even more layers upcasted than M2.5

In our own internal testing, Q4_K_XL also performed very well, which led us to believe the elevated PPL might have been a fluke, since that does happen from time to time.

But, as a precaution, we’ll remove the Q4_K_XL quant for now in case there are any further issues, and we’ll pay closer attention to PPL in future evaluations.

u/danielhanchen is still doing more investigation on the matter on what could be the cause and how we can alleviate the issue.

[-]

ambient_temp_xeno@reddit

One problem that could be fixed (applies to other people too) is that when quants are revised, there's no way to tell which revision someone is using apart from file hashes. This means people going around complaining about a model using a broken quant.

[-]

Individual_Spread132@reddit

Hi there! I'm not sure if it's relevant anyhow, but the last time I checked - I've noticed your M2.7 Q4K_XL was larger in size compared to a similar quant of M2.5 (I found it weird, since I expected to have it running on my hardware exactly like the older model did, be. I thought, maybe there's some clever reason for this, but now - given the commotion - sharing these concerns seemed like a good idea to me.

[-]

-dysangel-@reddit

You have a good point, but you're presenting it in a really inflammatory/unhelpful way.

[-]

terablast@reddit

unhelpful

They literally gave examples on how unsloth can test it themselves to fix it lol, how could he be more helpful?

[-]

danielhanchen@reddit

I did a full investigation at https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ - this issue affects all quant uploaders so not just us. I also found a fix.

[-]

-dysangel-@reddit

by not starting with the "I'm sick of this shit" attitude

[-]

One-Macaron6752@reddit (OP)

Let me take this one from a personal performance: for me downloading and proofing models is quite a time / resources consuming, thus - inflammatory or not - I need to address it. I have already written to other rushed in quants publishers to avoid such "rush in for visibility and ego pleasing" and watch out for similar catastrophic approaches (no immatrix used for MoE quantization!).
With unsloth it has become a norm: they've go somekind of agreements with the model owners and get sometimes early access to their models and the nanosecond the modle publisher is onlinw with their new model so is Unsoloth with some quants, of disputable quality (see GEMMA episode also).

[-]

Embarrassed_Soup_279@reddit

im not a unsloth hater but i feel like maybe they are also inflating download metric by rapidly updating quants, especially with gemma 4, yes there were a bunch of bug fixes in llamacpp but if you can wait a few days before updating when you know there are still known bugs left.. maybe im just being picky.

[-]

danielhanchen@reddit

Re our 3 re-uploads - these are due to llama.cpp fixing bugs - this was out of our control (we're llama.cpp contributors, but not the main devs) - we could have waited, but it's best to update when multiple (10-20) bugs are fixed.

The 4th is Gemma themselves fixing the chat template for tool calling.

https://github.com/ggml-org/llama.cpp/issues/21255 was another issue CUDA 13.2 was broken - this was NVIDIA's CUDA compiler itself breaking - fully out of our hands - but we provided a solution for it.

[-]

-dysangel-@reddit

It's time consuming and frustrating for me too - but if you talk to people like that, they are less likely to listen even though you are making a good point.

[-]

Diecron@reddit

are we reading the same post? OP brought the receipts and respectfully asked for additional safety checks in future, seems reasonable to me?

[-]

tm604@reddit

Have a look at the words used by the OP:

never caring to prove
like most of the other quants creators
disastrous faults
useless rush
utterly broken

Inflammatory, unhelpful, not very respectful - and it'd take less effort just to skip the clickbait title and complaints and go straight to the important information. "Thanks for the quants but there are some issues, these show up when running --validate-quants so we would appreciate it if you could run that step in future as part of your release pipeline".

[-]

One-Macaron6752@reddit (OP)

I have amended the text not to be so offending... the essence is still there: the model has catastrophic failure in regards to what "catastrophic" word mean to MoE quantization.

[-]

Asleep-Ingenuity-481@reddit

Finally someone said it. I don't think I have used a single Unsloth quantization (save for a mistral model) that actually worked without issue (save for it being a mistral model in the big 2026)

And it makes sense when you realize that their quants are being pushed usually less than an hour after the base model is released. They push it out so they can be the first despite the fact that there is really no competition in this space.

[-]

danielhanchen@reddit

As requested by the communtiy, I did a large investigation and fixed it.
10/26 (38%) of Bartowski's quants also NaN
5/23 (22%) of ours NaN. So it affects everyone. See https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

[-]

Dany0@reddit

Unsloth, we love you, but please, take this seriously

[-]

danielhanchen@reddit

Yes I'm investigating - both AesSedai and ubergarm did mention some issues with MiniMax - AesSedai for eg: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF

04-12-2026: The Q4_K_M I uploaded seems to have some issues, the PPL / KLD was throwing nan so I'll remove the model for now and try to get a working quant up tomorrow.

[-]

MerePotato@reddit

Hey, appreciate the work you guys do in actively updating your quants, the Qwen 3.5 release was a bit troubled but you've done a stellar job with Gemma 4. I don't know why people are acting critical to such an extreme degree all of a sudden when day one releases are always going to be less stable and you guys at least put in the work to keep merging in fixes

[-]

Karyo_Ten@reddit

Because they claim Unsloth Dynamic is high quality while it consistently has lower PPL/KLD than other quants for the same bitrate.

And "chat template" fixes for so many of the past releases while the BF16 checkpoint actually had none.

[-]

MerePotato@reddit

You do realiser lower KLD is good right?

[-]

Karyo_Ten@reddit

whoops lapsus yes I do.

[-]

MerePotato@reddit

Ah gotcha, you never know with comments on here. As I understand it isn't the variance between quant providers generally within the range of statistical noise for the most part?

I have noticed Unsloths quants fall slightly short on some KLD measures after looking into it, but not to any degree I'd normally consider significant

[-]

Karyo_Ten@reddit

As I understand it isn't the variance between quant providers generally within the range of statistical noise for the most part?

Not really, for those I tested you have detailed PPL and/or KLD on AesSedai and Ubergarm quants for GLM-4.6, GLM4.7, Kimi-K2.5, MiniMax-M2.5 and Unsloth quants are always worse for same bpw.

The thing is, AesSedai and Ubergarm's recipe is very simple, keep attention in Q8, then pick the base quality depending on target size and if there are dense ffn (GLM 4.x) keep them +1 quality, and if you have extra size, keep down projection +1 quality.

All that dynamic quant seems like marketing fluff when it's always beat by that simple formula.

[-]

KURD_1_STAN@reddit

Well it isnt fun to redownload 100s of GBs, i haven't downloaded a single gemma yet. We all def appreciate what all those GGUFers do and cant Blame unsloth cause i know how inpatient people are

[-]

danielhanchen@reddit

Appreciate it - we'll always strive to do better!

[-]

Marksta@reddit

It seems a little silly to fault them for being first, but not perfect. If they just held back for the usual 3-7 days it takes of new model boo-boos to be figured out, sure they could maybe post a perfect one the first time it goes out.

But then somebody else will need to deliver broken quants, await llama.cpp updates, release again, test again... It's really the real service they're providing here spending their time reading reports and re-doing quants over and over for the first week.

[-]

ahjorth@reddit

I think everyone appreciates that there is a balance between being fast and being perfect. But I don't think it's fair to say that posting this is silly. OP is clear about what the issues are, clear on what the solution is, and even has measures for how long (or rather, how little) it would take to do this properly per model.

These issues are causing petabytes of unncessary data transfers, and dozens or hundreds (or thousands for the highly anticipated models) of person hours going to waste. I think it's in everybody's interest to prevent that, and this is a small, concretechange to the release procedure.

[-]

yoracale@reddit

Hey do you have examples or errors where a quant of ours didn't work? We're more than happy to address the issue which we usually do if someone posts about it in the Discussion page.

As for MiniMax-M2.7, when we ran perplexity and KLD benchmarks on every MiniMax-M2.7 4-bit quant for Q4_K_XL, MXFP4MOE, IQ4_XSS no matter what etc, all of them did in fact show unusually high PPL compared with the other bit sizes. AesSedai and ubergam reported seeing similar issues as well and AesSedai for example deleted their one.

So it seems that the model may very well be sensitive to quantization especially at 4-bit. That said, we initially kept it up because Benjamin Marie’s benchmarks on M2.5 (which uses the same arch as M2.7) suggested that Q4_K_XL performed the best overall, so we did not remove it at the time. In fact this time, our Q4_K_XL had even more layers upcasted than M2.5

In our own internal testing, Q4_K_XL also performed very well, which led us to believe the elevated PPL might have been a fluke, since that does happen from time to time.

But, as a precaution, we’ll remove the Q4_K_XL quant for now in case there are any further issues, and we’ll pay closer attention to PPL in future evaluations.

u/danielhanchen is still doing more investigation on the matter on what could be the cause and how we can alleviate the issue.

[-]

VoidAlchemy@reddit

Yes we've all observed unusual perplexity values with MiniMax-2.x in that 4ish BPW range, that is not the relevant piece here.

The issue is that some of your quants were not finishing perplexity runs and giving nans as daniel points out above.

The thread you are looking for is likely this one where a guy runs PPL and finds nans on your quant: https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF/discussions/1#69dbf03bf75df2eec05fa642

Cheers and thanks for fixing things up! I'd love to hear if you guys discover why this is happening on mainline llama-quantize and how you try to work-around it!

[-]

Paradigmind@reddit

I'm not sure if this is relevant, but people discuss about possibly broken quants in this post.

[-]

One-Macaron6752@reddit (OP)

Thank you u/yoracale for your reply.

Perplexity check (without KLD) on a model sized as MiniMax takes roughly 5 minutes per qunat. I can imagine you could batched such tests at least for "pure" and/or UD quants so that accidents won't happen again.
Also, even if not published the first day you push the quants, it would still be of good help / assert trust from the community if you publish PPL / KLD at a later time in the model card. It doesn't have to reference any other fellow quanter similar PPL/KLD figures (to avoid useless competition!) but this could also serve the baseline sanity checks for most of the interesting, meaningful quants for the community.

[-]

audioen@reddit

The last time I saw any NaN's, they were a result of numeric overflow inside llama.cpp inference engine due to limited numeric range on fp16 (which was used on Vulkan) and gpt-oss-120b. The problem appears as the model suddenly getting stuck and only repeated G from there on. Sampler saw nothing usable due to the NaN being generated, all tokens had the same probability which was something like 0, and it chose one.

This is probably something similar: a value near the extreme range in some floating point accumulator that occurs during inference and happens to get triggered in Q4 quantization, but happens to not trigger on higher bit quantization. They fixed the issue on Vulkan by post-processing the results and replacing infinity values with the maximum representable real values. Downside of this is that it costs some performance.

As to perplexity, I think this is not a good measurement of quality for models with a chat template, as the perplexity is influenced by the missing chat template in the perplexity evaluation context. A large value like 8 is not reasonable, even 1B models probably have lower perplexity than this. I think we should see some value around 2 if the testing is done correctly. I have no every reason to expect that modern good large models like MiniMax and Qwen3.5 would get fairly comparable numbers appropriate for their parameter count if only we used the correct chat templates during the test.

I'm not sure how much this affects K-L divergence measurements, but I expect it's probably harming them as well. As long as the text being given to model is unnatural, i.e. not following its chat template, it is in some quasi-trained state, and measuring its performance in this condition and making quantization decisions could be a fool's errand.

[-]

danielhanchen@reddit

Yes it most likely is an overflow issue - I did a larger investigation at https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

[-]

notdba@reddit

Your point of view about PPL / KLD seems a bit .. over dramatic? These instruct models can still perform extremely well when using the /v1/completions text completion endpoint. They don't lose that capability.

Anyway, we are slowly moving from PPL to KLD and Top1 when evaluating quants, so I think that's already a very good progress.

[-]

durden111111@reddit

I just use bartowski quants from now on. Ole reliable

[-]

danielhanchen@reddit

10/26 (38%) of Bartowski's quants also NaN
5/23 (22%) of ours NaN.
So it affects everyone. I fixed them all and did a larger investigation at https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

[-]

fallingdowndizzyvr@reddit

Didn't people learn from the recent Gemma experience to wait a few days?

[-]

MelodicRecognition7@reddit

*weeks

[-]

UnreasonableEconomy@reddit

A downvote or star feature on HF might be useful

[-]

vulcan4d@reddit

It is also my understanding that quanta for Minimax are real bad including the go to Q4.

[-]

MrMisterShin@reddit

This is why I only use standardised quants for GGUF regardless of provider. Q4_K_M, Q6 and Q8.

All these IQ, UD etc etc always have problems one way or another regardless of provider. I’m tired of it.

[-]

MerePotato@reddit

UD-Q8_K_XL and Bartowskis Q8_K_L at least shouldn't give you any issues, its just Q8 with a few more layers left at full precision

[-]

MrMisterShin@reddit

I’m sure there is a difference going for quants above the generic Q8 like UD-Q8_K_XL and Q8_K_L, but I couldn’t notice them with my coding prompts and tool calls.

What I did notice was a slight impact on prompt processing and token generation on my prompts. This is to be expected, due to slight increase is data size.

[-]

MerePotato@reddit

Yeah at worst I'd expect them to be indistinguishable but I don't see a world where UD-Q8_K_XL is bad

[-]

segmond@reddit

jokes on you, I got sold on UD_K_XL when my local DeepSeekv3.1-Q3-UD_K_XL was crushing all the APIs that were seving Deepseek on Openrouter.

[-]

Safe-Thanks-4242@reddit

[-]

New_Zucchini_3843@reddit

I don’t mean to criticize any individuals or organizations, but

I’ve been using the Q6-K_L version of gemma-4-31-it created by bartowski since its initial release, and I haven’t encountered any of the issues that were reported on Reddit and elsewhere. While some of those issues likely stem from user configuration errors, I’m not sure if there’s a connection between the problem the OP is raising and this one, but I’m very interested in it.

[-]

Sicarius_The_First@reddit

this is why im slow to release stuff...

[-]

Sicarius_The_First@reddit

and the one time i wasn't, initial pepe 70b was broken :(
(now fine though)

[-]

pepe256@reddit

Initial me?

[-]

rm-rf-rm@reddit

no, pepe70b. Different guy

[-]

Diecron@reddit

thanks for making that UBW dataset public btw. I was playing with it recent with synthetic reasoning traces (they're not good, but it's just for my education).

[-]

yoracale@reddit

When we ran perplexity and KLD benchmarks on MiniMax-M2.7 Q4_K_XL, it did in fact show unusually high PPL compared with the other quants. AesSedai and ubergam reported seeing similar issues as well.

That said, we initially kept it up because Benjamin Marie’s benchmarks on M2.5 (which uses the same arch as M2.7) suggested that Q4_K_XL performed the best overall, so we did not remove it at the time. In fact this time, our Q4_K_XL had even more layers upcasted than M2.5

In our own internal testing, Q4_K_XL also performed very well, which led us to believe the elevated PPL might have been a fluke, since that does happen from time to time.

But, as a precaution, we’ll remove the Q4_K_XL quant for now in case there are any further issues, and we’ll pay closer attention to PPL in future evaluations.

u/danielhanchen is still doing more investigation on the matter on what could be the cause and how we can alleviate the issue.

[-]