MiniMax m2.7 (mac only) 63gb: 88% and 89gb: 95%, MMLU 200q

Posted by HealthyCommunicat@reddit | LocalLLaMA | View on Reddit | 36 comments

Absolutely amazing. M5 max should be like 50token/s and 400pp, we’re getting closer to being “sonnet 4.5 at home” levels.

63gb: https://huggingface.co/JANGQ-AI/MiniMax-M2.7-JANG_2L

89gb: https://huggingface.co/JANGQ-AI/MiniMax-M2.7-JANG_3L

[-]

sammcj@reddit

M5 Max 128GB here - I get around 60tk/s on a 3bit quant on oMLX. It doesn't seem as reliable with tool calling as Qwen 3.5 122-A10B, hallucinated a fair bit over the half hour or so I was trying it out. (temp 1.0, top_p 0.95, top_k 64)

[-]

dinglehead@reddit

When you talk about tool calling... are you talking about straight through the model or through a harness like Claude Code? I've found that tool calling when running models locally can be sketchy right up until I use Claude Code at which point it becomes almost perfect

[-]

sammcj@reddit

I was testing through OpenCode in this case but can certainly try through CC and report back!

[-]

dinglehead@reddit

Yea the only issue is claude code eats a ton of context with its sustem prompt, but ive found almost every model nails tool calls when using it. Its kinda wild.

[-]

sammcj@reddit

Tried it with Claude code and it took 4-5 minutes just to process the prompt (40k~) which was weird - that was the case with both oMLX with the 3bit mlx-community quant and vMLX with their 3.1bit jang qant.

Memory for both grew to around 108GB so it's really too large for 128GB IMO.

[-]

dinglehead@reddit

Yea youre probably right ive been running it on 256. Not enough room for context on 128

[-]

Mochila-Mochila@reddit

Could it be due to the quant being 3 bit ? I've read this model is very sensitive to quantisation.

[-]

Kuane@reddit

How did you get it to load? It is not loading the model for me.

[-]

onil_gova@reddit

One potential issue for failing to load is that you did not finish downloading the model completely.

[-]

polawiaczperel@reddit

I know why people are going that far with quants, but isn't too much degradation going below 5 bit?

[-]

HealthyCommunicat@reddit (OP)

U would think so, but I have a 4bit and 6bit too - it's still at 95%. My JANGQ method specifically applies it in a method where the most important layers are saved at 8bit. This allows for this kind of great performance for MoE models where the attention and other most important bits only make up less than 1% of the model so that even when we make those parts 8bit, the model size can remain small if we keep the other bits even smaller

[-]

DeepOrangeSky@reddit

U would think so, but I have a 4bit and 6bit too - it's still at 95%

Glad you mentioned this, since I was pretty curious how the q3 compared to the higher quants. Did it miss the exact same questions, or different ones? Were those last couple questions way harder than the other 18-19 questions on each test?

Sounds pretty sick, in any case. At or near full strength at q3. I knew for big models it was supposed to be plausible for certain use cases, but never saw a benchmark test that was actual tasks rather than just perplexity or KL divergence or something like that, so always wondered.

[-]

Miserable-Dare5090@reddit

Would you say MMLU is a fair estimator of quality loss in all departments? I know for general chat it may be, but wondering if the 4bit / 6 bit quants fare better overall.

[-]

HealthyCommunicat@reddit (OP)

Tbh with you all benchmarks are good if you're just comparing "how different is this from this" - in this case if you were to benchmark the same exact MLX size in gb model to the JANG model, it would show how much it beats it by (it beats it by alot)

[-]

-dysangel-@reddit

Why 5 bit in particular?

Larger models handle quantisation a lot better than smaller ones. I've had fine coding results with Deepseek R1-0528, GLM 5/5.1 and now Minimax 2.7 at IQS_XXS.

[-]

Creepy-Bell-4527@reddit

Can we get a REAP-ed 3L that will fit nicely in 96GB?

[-]

Kuane@reddit

Thx for your fast work on these quants. I am trying to download the 2bit model but seems the files are incomplete/still uploading?

The 3bit gave me this error on omlx:

Expected shape (200064, 288) but received shape (200064, 384) for parameter model.embed_tokens.weight

[-]

HealthyCommunicat@reddit (OP)

Can u let omlx know? I’m sure their users like u would wanna use this

[-]

Ok_Technology_5962@reddit

vMLX is amazing. For some reason a just tend to use it for image gen. UI is a bit hard to navigate back and forth when servers start or try to go back to them not a big deal. The ease of use is amazing espcially JANG quants.

[-]

HealthyCommunicat@reddit (OP)

Ur totallt right about this. I made it with pure server use in mind, I keep saying I’ll do a redesign to make the flow / usage easier for beginners but i keep running out of time

[-]

Kuane@reddit

Yes reported the issue.

[-]

HealthyCommunicat@reddit (OP)

As far as 2 bit goes - hf is weird with fresh uploads :/

[-]

Sydorovich@reddit

At home is 3090 gpu level at most.

[-]

unbannedfornothing@reddit

why?! where is your b200's cluster? mine is in a garage

[-]

emprahsFury@reddit

you're still on the b200s. I just had 2 b300s setup last week, thinking of getting 2 more if llama.cpp fixes layer parallelism

[-]

thrownawaymane@reddit

You guys still live at home? I live in my data center now. In space.

[-]

Plasmx@reddit

How is the view? Did the Artemis guys came by on their swing?

[-]

CalligrapherFar7833@reddit

Dude b300 ? Are you a povert ? I moved to r300 already

[-]

InterstellarReddit@reddit

You’re right ima head down and pick up some milk, and cereal and I’ll get the b200 on the way out

[-]

MrBIMC@reddit

I did setup iq4 quants on a solo 3090+ 128gb ram.

It gives around 13tokens per second with ngl auto with 53 layers offloaded.

Idk whether it’s decent or not. At this stage I’d rather use dense qwen at >40tps.

But I do plan on running some tests overnight. Previously afaik minimax was getting very stupid after quantisation.

[-]