MiniMax m2.7 (mac only) 63gb: 88% and 89gb: 95%, MMLU 200q
Posted by HealthyCommunicat@reddit | LocalLLaMA | View on Reddit | 36 comments
Absolutely amazing. M5 max should be like 50token/s and 400pp, we’re getting closer to being “sonnet 4.5 at home” levels.
63gb: https://huggingface.co/JANGQ-AI/MiniMax-M2.7-JANG_2L
89gb: https://huggingface.co/JANGQ-AI/MiniMax-M2.7-JANG_3L
sammcj@reddit
M5 Max 128GB here - I get around 60tk/s on a 3bit quant on oMLX. It doesn't seem as reliable with tool calling as Qwen 3.5 122-A10B, hallucinated a fair bit over the half hour or so I was trying it out. (temp 1.0, top_p 0.95, top_k 64)
dinglehead@reddit
When you talk about tool calling... are you talking about straight through the model or through a harness like Claude Code? I've found that tool calling when running models locally can be sketchy right up until I use Claude Code at which point it becomes almost perfect
sammcj@reddit
I was testing through OpenCode in this case but can certainly try through CC and report back!
dinglehead@reddit
Yea the only issue is claude code eats a ton of context with its sustem prompt, but ive found almost every model nails tool calls when using it. Its kinda wild.
sammcj@reddit
Tried it with Claude code and it took 4-5 minutes just to process the prompt (40k~) which was weird - that was the case with both oMLX with the 3bit mlx-community quant and vMLX with their 3.1bit jang qant.
Memory for both grew to around 108GB so it's really too large for 128GB IMO.
dinglehead@reddit
Yea youre probably right ive been running it on 256. Not enough room for context on 128
Mochila-Mochila@reddit
Could it be due to the quant being 3 bit ? I've read this model is very sensitive to quantisation.
Kuane@reddit
How did you get it to load? It is not loading the model for me.
onil_gova@reddit
One potential issue for failing to load is that you did not finish downloading the model completely.
polawiaczperel@reddit
I know why people are going that far with quants, but isn't too much degradation going below 5 bit?
HealthyCommunicat@reddit (OP)
U would think so, but I have a 4bit and 6bit too - it's still at 95%. My JANGQ method specifically applies it in a method where the most important layers are saved at 8bit. This allows for this kind of great performance for MoE models where the attention and other most important bits only make up less than 1% of the model so that even when we make those parts 8bit, the model size can remain small if we keep the other bits even smaller
DeepOrangeSky@reddit
Glad you mentioned this, since I was pretty curious how the q3 compared to the higher quants. Did it miss the exact same questions, or different ones? Were those last couple questions way harder than the other 18-19 questions on each test?
Sounds pretty sick, in any case. At or near full strength at q3. I knew for big models it was supposed to be plausible for certain use cases, but never saw a benchmark test that was actual tasks rather than just perplexity or KL divergence or something like that, so always wondered.
Miserable-Dare5090@reddit
Would you say MMLU is a fair estimator of quality loss in all departments? I know for general chat it may be, but wondering if the 4bit / 6 bit quants fare better overall.
HealthyCommunicat@reddit (OP)
Tbh with you all benchmarks are good if you're just comparing "how different is this from this" - in this case if you were to benchmark the same exact MLX size in gb model to the JANG model, it would show how much it beats it by (it beats it by alot)
-dysangel-@reddit
Why 5 bit in particular?
Larger models handle quantisation a lot better than smaller ones. I've had fine coding results with Deepseek R1-0528, GLM 5/5.1 and now Minimax 2.7 at IQS_XXS.
Creepy-Bell-4527@reddit
Can we get a REAP-ed 3L that will fit nicely in 96GB?
Kuane@reddit
Thx for your fast work on these quants. I am trying to download the 2bit model but seems the files are incomplete/still uploading?
The 3bit gave me this error on omlx:
Expected shape (200064, 288) but received shape (200064, 384) for parameter model.embed_tokens.weight
HealthyCommunicat@reddit (OP)
Can u let omlx know? I’m sure their users like u would wanna use this
Ok_Technology_5962@reddit
vMLX is amazing. For some reason a just tend to use it for image gen. UI is a bit hard to navigate back and forth when servers start or try to go back to them not a big deal. The ease of use is amazing espcially JANG quants.
HealthyCommunicat@reddit (OP)
Ur totallt right about this. I made it with pure server use in mind, I keep saying I’ll do a redesign to make the flow / usage easier for beginners but i keep running out of time
Kuane@reddit
Yes reported the issue.
HealthyCommunicat@reddit (OP)
As far as 2 bit goes - hf is weird with fresh uploads :/
Sydorovich@reddit
At home is 3090 gpu level at most.
unbannedfornothing@reddit
why?! where is your b200's cluster? mine is in a garage
emprahsFury@reddit
you're still on the b200s. I just had 2 b300s setup last week, thinking of getting 2 more if llama.cpp fixes layer parallelism
thrownawaymane@reddit
You guys still live at home? I live in my data center now. In space.
Plasmx@reddit
How is the view? Did the Artemis guys came by on their swing?
CalligrapherFar7833@reddit
Dude b300 ? Are you a povert ? I moved to r300 already
InterstellarReddit@reddit
You’re right ima head down and pick up some milk, and cereal and I’ll get the b200 on the way out
MrBIMC@reddit
I did setup iq4 quants on a solo 3090+ 128gb ram.
It gives around 13tokens per second with ngl auto with 53 layers offloaded.
Idk whether it’s decent or not. At this stage I’d rather use dense qwen at >40tps.
But I do plan on running some tests overnight. Previously afaik minimax was getting very stupid after quantisation.
cheechw@reddit
Actually for the majority of the world at home level is Intel embedded graphics.
i_am_exception@reddit
What’s the context size you are working with? I would imagine the pp value not meaning much if context size was big enough.
MrHaxx1@reddit
Although an 128 GB Mac is still twice the price of what I'm willing to spend on an LLM machine, it looks like the future is bright regarding local LLM.
thesmithchris@reddit
i needed 64gb macbook for work (m4), adding 64gb was $1.1k. i did not go for it as i did not expect home models to get so close to sota models. i still do not regret this but my next macbook (m7/m8) will likely be have max ram with local llms in mind
misha1350@reddit
I think having a REAP version would be even better, for those who only have a 64GB machine.
Budget-Juggernaut-68@reddit
I'll like to see the options shuffled and see the results to ensure that answers are not memorized.