Minimax M2.7 Released

[-]

TemporalAgent7@reddit

What is the cheapest hardware that can run this at 4-bit quant and above?

[-]

ResponsibleHead8778@reddit

Currently running on 1 128gb strix halo box. unsloth/minimax-m2.7-UD-IQ4_XS using a forked turboquant llama.cpp. 132k context window getting around 20-30tok/sec (visually still need to make sure)

[-]

sword-in-stone@reddit

exact dependencies and setup on strix? can you ask your agent to create an MD file for the setup which I can pass to my agent pls

[-]

wiltors42@reddit

Wow that sounds great. I’m on main llama.cpp and Minimax m2.7 q3 @ ~80k context. It barely fits and quality is not quite perfect.

[-]

ttkciar@reddit

It should work okay with pure-CPU inference on my $800 Xeon E5-2660v3 system with 256GB DDR4. Looking forward to giving it a spin.

[-]

ttkciar@reddit

With 10B active, probably closer to 3/second, which means about 80K tokens overnight while I sleep.

[-]

ReactionaryPlatypus@reddit

I am running iq4_xs on Strix Halo 128gb + 3090 egpu 24gb.

[-]

ReactionaryPlatypus@reddit

STRIX HALO + 3090 (MNIMAX M2.5 - IQ4_XS)

prompt eval time = 15260.10 ms / 4112 tokens ( 3.71 ms per token, 269.46 tokens per second) eval time = 25127.82 ms / 623 tokens ( 40.33 ms per token, 24.79 tokens per second) total time = 40387.92 ms / 4735 tokens

prompt eval time = 176629.47 ms / 26166 tokens ( 6.75 ms per token, 148.14 tokens per second) eval time = 66263.78 ms / 614 tokens ( 107.92 ms per token, 9.27 tokens per second) total time = 242893.25 ms / 26780 tokens

[-]

oxygen_addiction@reddit

Absolute legend. Thanks!

[-]

ForsookComparison@reddit

Q4_k_s was like 125GB on disk or something, so ideally have 140+ total to do some actual work (and probably nothing parallel).

But be warned: Q4 was damn near unusable for Minimax M2.1 and M2.5 compared to the full weight versions. It drops off way harder than quantizing other popular models.

[-]

Geximus-therealone@reddit

Why ? Some 4bit quants have a lot bf16 layers

[-]

Sufficient_Prune3897@reddit

Sparse moes seem to suffer a lot more. I have noticed the same way back with GLM Air. Even Q4 was pretty random. And I didnt even code with it.

[-]

Serprotease@reddit

5 years old amd server or intel workstation with 6+ channels, 256gb of the cheapest ecc ddr4 you can get + ampere 24gb gpu + ik llama. Or a second hand M2 Ultra 192gb MacStudio.

[-]

Head_Bananana@reddit

I'm running this on Mac Studio M2 Ultra 200GB now its 121GB in RAM

[-]

Thrumpwart@reddit

14x AMD Mi50s…

[-]

joeyhipolito@reddit

non-commercial kills it for me. cool benchmark numbers but if third party hosters can't pick it up commercially it's basically a hosted-only model with extra steps.

[-]

Beginning-Window-115@reddit

I regret only buying the m5 pro 48gb and not the m5 max 128gb...

[-]

TheItalianDonkey@reddit

i have the 128gb. i'm currently running gemma-4-31b.

no way this fits.

[-]

ResponsibleHead8778@reddit

I have halo strix architecture 128gb ram. just downloaded minimax-m2.7 running llama.cpp turboquant with 132k token context window. I generate roughly 20-30tok/sec. prefill speeds are around 17tok/sec however so rag is much needed.

[-]

TheItalianDonkey@reddit

What quantisation? You must be going for a 2 or 3 right? At those quants I was reading everywhere that a smaller model is preferred due to the loss, have you did any testing if those are indeed your specs?

[-]

ResponsibleHead8778@reddit

last test was a contextual conversation where the context slowly grew. after a few prompts the prefil slowed to a crawl. everything started to take much longer. so its good for oneshotting but wouldnt recommend for everyday use with these specs.

[-]

ResponsibleHead8778@reddit

the only real test I did was "I want to to design a full on website for bleach new worlds 3 a bleach game on roblox. I want you to search the web find the correct colors and styles to use and gather some images for the site. make it modern with animations. just css javascript and html 1 file" it generated a file 1400 loc and worked great first shot. website had animations everything worked.

[-]

ResponsibleHead8778@reddit

4bit quant Unsloth/Minimax-m2.7-UD-IQ4_XS uses like 112gb-113gb of ram. context window was around 32k. so I used turboquant for my kv cache and got it up to 132k context window. I gave it a single text of around 100k tokens and it was able to load it completely into ram and responds accordingly (the prefillwas generating around 17tok/sec and took 2 hours). however when running realworld prompts I was getting 65tok/sec prefill and responses were generally around 25tok/sec

[-]

YoussofAl@reddit

QWEN 3.5 27B will get 80% of the strength of this model anyways.

[-]

ForsookComparison@reddit

I've been running the closed weight version minimax servee for a few weeks. Qwen3.5 27B (my favorite on prem model lately) is not a serious competitor for this if you're talking about agent work and coding.

[-]

YoussofAl@reddit

It’s not a serious contender, but it is a good substitute. Like how Sonnet is 80% of Opus. I feel the same way between Qwen 3.5 27B and Minimax M2.5. Then again, I haven’t tested 2.7 yet so we’ll see.

[-]

ForsookComparison@reddit

Then again, I haven’t tested 2.7 yet so we’ll see.

Wait. Where's that opinion formed from then?

[-]

YoussofAl@reddit

2.5

[-]

_-_David@reddit

You're getting downvoted, but it's not an insane take. It's all about your use-case. There will be things that MiniMax-2.7 will be able to do, but Qwen-3.5 27b can't do at all, and plenty of things that they both do exactly as well. The situation is black, white, and grey all at the same time.

[-]

eMperror_@reddit

Isnt it way too large for 128gb anyways?

[-]

waitmarks@reddit

I run 2.5 at Q3_K_XL on 128G and it’s quite usable. I can’t max out its context, but it’s still very useful.

[-]

Mysterious_Finish543@reddit

How much context are you able to run at with Q3_K_XL?

[-]

pilibitti@reddit

128 context. I only ask yes no questions.

[-]

Ok_Technology_5962@reddit

Use caveman mode. And glm 5.1 really degrades past 100k anyways

[-]

Danfhoto@reddit

I use it with OpenClaw and have the context limit set to 90,000, haven’t had issues. The q3 UD quants are quite good.

[-]

Storge2@reddit

Also interested can this run somehow on a Dgx Spark 128Gb

[-]

Fresh-Grocery-3847@reddit

Im going to be trying the hf download unsloth/MiniMax-M2.7-GGUF \ --local-dir unsloth/MiniMax-M2.7-GGUF \ --include "UD-IQ4_XS" Which is 108gbs.

And then perhaps if its too slow try The UD-Q3_K_S or UD-IQ3_S.

I'll update my findings later.

[-]

Fresh-Grocery-3847@reddit

Going back to Qwen3.5-122b quantization on minimax is terrible. https://x.com/bnjmn_marie/status/2027043753484021810

[-]

cafedude@reddit

Also interested in running this on a 128GB Strix Halo box. I suspect we'd need a 2-bit quant.

[-]

ReactionaryPlatypus@reddit

I am running iq3_m Minimax M2.5 on 128gb Strix Halo Tablet as my daily driver.

[-]

ObiwanKenobi1138@reddit

What kind of speeds are you seeing?

[-]

ReactionaryPlatypus@reddit

STRIX HALO (MNIMAX M2.5 - IQ3_MS)

prompt eval time = 18513.51 ms / 4112 tokens ( 4.50 ms per token, 222.11 tokens per second) eval time = 18429.76 ms / 396 tokens ( 46.54 ms per token, 21.49 tokens per second) total time = 36943.27 ms / 4508 tokens

prompt eval time = 234712.43 ms / 26166 tokens ( 8.97 ms per token, 111.48 tokens per second) eval time = 93301.59 ms / 700 tokens ( 133.29 ms per token, 7.50 tokens per second) total time = 328014.03 ms / 26866 tokens

[-]

georgeApuiu@reddit

If you REAP it you might be able to. I’m using the minimax 2.5 REAP on a single dgx spark

[-]

rpkarma@reddit

You'd need to cluster two via the ConnectX-7 link, and honestly it's gonna get kind of shredded by our lack of memory bandwidth I think.

I'm still going try though lol, I love my little Asus GX10

[-]

texasdude11@reddit

On two of them

[-]

xraybies@reddit

https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-100GB-MLX

[-]

Ok_Technology_5962@reddit

Use one of those JANG quants at low bits per weight is good that or oQe quant once someone drops that

[-]

InternetNavigator23@reddit

Yeah I think I heard he is planning on using some dynamic 2.7 bit or something.

Should be perfect for 128 GB of RAM. Pretty excited for it honestly.

[-]

Beginning-Window-115@reddit

it would work at UD-Q3_K_XL 🥲

[-]

eMperror_@reddit

Nice, can't wait to try it then! (M5 max 128gb) :D

[-]

Beginning-Window-115@reddit

I envy you

[-]

-dysangel-@reddit

I've been using M2.1 @ IQ2_XXS (75GB) fine on my Mac Studio

[-]

PinkySwearNotABot@reddit

I have the M1 Max 64GB and I regret not getting the 128GB

[-]

TheItalianDonkey@reddit

i have the 128gb. i'm currently running gemma-4-31b.

no way this fits.

[-]

kovexex@reddit

I have it too, don't run a dense model lol. Shits gonna be cooked, run the 26b-a4b bf16 at 60tps low context or down to 30tps at max context

[-]

330d@reddit

There was never an M1 Max with more than 64, so it's a bit of confusing statement, unless you mean you bought it recently, when other options were available? I also have the 64GB M1 Max and it's still a beast and allowed me to experiment with local models for years now.

[-]

marco89nish@reddit

What are you running on that, I'm looking for good models for my 48GB M4 Pro? Also, ollama, mlx or lm studio?

[-]

Beginning-Window-115@reddit

I mainly use "omlx" not "mlx" it has ssd caching so it's pretty fast, and my main model is Qwen3.5 27b at 4bit or if I need speed Qwen3.5 35b (moe).

[-]

thphon83@reddit

For how long have you been using omlx? I tried a couple of weeks ago with qwen3.5 122b and had to stop because there was a bug and the moment the context filled up a bit it started to forget things and get into infinite loops.

[-]

Beginning-Window-115@reddit

Yeah there was a bug like not that long ago that caused memory to fill up a ton but it was quickly fixed so maybe that's what you had, but now it should be good and make sure to fill in parameters for the model you are using and don't use too low of a quant on omlx since the quants aren't as good as gguf. (also there's turbo quant as a bonus)

[-]

itsmeemilio@reddit

How do you go about using omlx? Seems like it could be interesting for maybe running larger models possibly?

[-]

d4mations@reddit

R/omlx

[-]

Beginning-Window-115@reddit

Just start by looking at the GitHub repo and reading the instructions to install it, then once installed have a look at the settings and just get a general idea of what is what (most things can be left untouched), you can download models from omlx which makes it way easier. (mlx models only) so I recommend looking at mlx-community hf account for models.

[-]

itsmeemilio@reddit

Wow thank you for putting me onto this. What a find.

Are you aware if it's possible to run models larger than unified memory would normally allow?

E.g. a 70B or 90B model on a 48GB system?

[-]

marco89nish@reddit

This poster claims he's running huge MOE models that can't fit RAM on macbooks, I didn't give it a shot yet. Let me know if you try it https://www.reddit.com/r/LocalLLaMA/comments/1shediw/comment/ofc46y5/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

[-]

Beginning-Window-115@reddit

I don't think so and even if you could I wouldn't recommend it because it would be extremely slow but you can run large models quantised as long as it fits into ram.

[-]

Cybertrucker01@reddit

Why not the M5 Studio 256gb?

[-]

thrownawaymane@reddit

Can't buy something that doesn't exist yet

[-]

segmond@reddit

if you have the money, sell it and buy 128gb, are you going to live the rest of your life in regret?

[-]

ajblue98@reddit

Ditto M4 Max 36

[-]

digitaldisgust@reddit

The random Chinese text showing up in responses that are meant to be fully English is enough for me to delete my MiniMax account tbh. Very annoying. 🤦🏽‍♀️

[-]

Infinite_Hand7076@reddit

Would q3 or q2 version work on ai max 395 128g?

[-]

ResponsibleHead8778@reddit

if youre just using the ai max for inference you can run turboquant llama.cpp with unsloth/minimax-m2.7-UD-IQ4_XS and have 132k context window too. the prefill is ass just be aware if youre trying to load alot into it

[-]

misha1350@reddit

Yes

[-]

FrozenFishEnjoyer@reddit

I'm out here reading what's new here, checking what quants are available, and looking at the graph...but I only have 16GB VRAM.

The life of poors are sure difficult.

[-]

Maleficent-Ad5999@reddit

I wish you’d buy couple of rtx pro 6000s and never worry about vram some days in future

[-]

Eyelbee@reddit

You'd still have to worry about vram

[-]

Sufficient_Prune3897@reddit

This. I probably would have drank the cool aid and spend 7k on one, but with quickly moe's have escalated in size, it wouldn't even unlock anything I cant run now.

[-]

Maleficent-Ad5999@reddit

Can you give me a rough number on How much would feel enough?

[-]

Ok_Technology_5962@reddit

1 terabyte if vram feels good

[-]

Maleficent-Ad5999@reddit

Even then bigger models are fp8 and beyond would require more vram for context size.. so maybe 2tb vram?

[-]

Ok_Technology_5962@reddit

Ugh... You are right but i also saw that monster 2 trillion peram model that Nousresearch has... And obviously 10trillion is coming soon

[-]

Maleficent-Ad5999@reddit

yet here we are dealing with GPUs of 8GB, 12GB, 16GB in consumer space.

[-]

Sufficient_Prune3897@reddit

My point is, the ram requirements are constantly increasing. GLM got 2x bigger from 4.7 to 5, Qwen increased from 235B to 400B and Minimax 3 is probably gonna do the same.

If I want to run GLM 5 in VRAM, I'm gonna need like at least 384GB of VRAM, and that's at a bad quant.

Personally I would really like 192 so that I can at least fine-tune and train all the 'smaller' 100b models myself.

[-]

Maleficent-Ad5999@reddit

Agreed

[-]

Maleficent-Ad5999@reddit

Well then when would we ever stop accumulating more vram

[-]

Nobby_Binks@reddit

Unfortunately it's a bit like money - the more you have the more you want

[-]

a9udn9u@reddit

I have 32GB and I always think 48GB would be nice, when I got 48GB I'd want 64GB. You will never be satisfied unless you have multi-TB VRAM.

[-]

krileon@reddit

I'm on 20GB. It's such a weird spot to be in. It's a decent amount, but just shy of enough.

[-]

grumd@reddit

Depending on how much RAM you have you might still be able to run a Q2-Q3 quant

[-]

srigi@reddit

The Q_1 quant is 60GB. I have 64GB RAM, so no luck even to try to load weights.

[-]

grumd@reddit

Might run with a small context at least for testing. But yeah for 64GB+16GB you need to look at models 45-50gb max

[-]

Darkoplax@reddit

6GB VRAM here :(

[-]

BuyHighSellL0wer@reddit

Here me running models on my 4GB RX550.

There's always somebody poorer ha!

[-]

DR4G0NH3ART@reddit

Well i was doing it for the GLM 5.1 and ran that model in my 5070 ti in my head and got good results. One day, one day I will make an agent that can hallucinate as good as me locally.

[-]

RonJonBoviAkaRonJovi@reddit

https://i.redd.it/rfxxnjvl2oug1.gif

[-]

Morphon@reddit

Anyone know if there's a group out there planning to make a TQ1 quant for this?

[-]

sgmv@reddit

you probably don't want this, it's not great even at q8

[-]

FullstackSensei@reddit

Unsloth GGUFs when?

[-]

asfbrz96@reddit

Bartowski better

[-]

FullstackSensei@reddit

TBH, between the two it's like splitting hairs. I use Unsloth because they provide documentation for best params, they're generally active here, and they often get early access so their quants drop sometimes at the same time the model drops.

[-]

asfbrz96@reddit

I tried both, I usually get better output with bartowski and the I got a bunch of infinity loop on the thinking part using unsloth

[-]

Beginning-Window-115@reddit

I think Unsloth is just so early with their quant releases that it doesn't give llamac++ time to fix bugs kind of giving them a bad rep. Although once everything works usually their quants are pretty good.

[-]

dangered@reddit

That’s fairly important though.

It seems like a “good problem to have” but there reaches a point that it really isn’t.

Even Linux power users leave Arch for same exact problem (I used to use arch btw tips fedora).

[-]

FullstackSensei@reddit

To be fair, more often than not the unsloth brothers are the ones who uncover the existence of those bugs. They also find tokenizer bugs in the released model more often than I thought possible.

[-]

dangered@reddit

Same with arch users. It’s necessary for the open source lifecycle. But is it necessary for you as the user?

If you’re active in the forums finding what is causing bugs and posting workarounds or patches then you’re key to the process. If you’re not, there’s a chance you’re just inflicting pain on yourself to the benefit of no one.

I’m in no way saying “unsloth bad” but it might not be the right choice for a lot of people and it has to be acknowledged. Many people leave or never make it into communities because they are told to use the bleeding edge but become too frustrated trying to get it to work to continue.

When that happens enough times, the product gets a bad name because the wrong people were using it and now they all say “unsloth bad”

[-]

FullstackSensei@reddit

I'm not sure what's the point you're trying to make, or what is the connection with arch.

Neither me nor anyone using their quants is testing anything. The unsloth brothers, or Bartowski or anyone making quants for their job are not regular users. They're like the maintainers of one package or one part of the kernel, who find bugs in other parts or other packages during their job and report those.

If you're going to blame maintainers for finding bugs, I am really out of words for how to respond to this.

[-]

dangered@reddit

The similarity I was making was referring to the breaking releases when you pull :latest because nothing else has caught up yet.

Whether it’s compatibility issue with Ollama, a bug from the base model itself, or a driver issue.

neither me nor anyone using their quants are testing anything

You might not have known this but we are. Every day we’re raising and discussing issues in the forums with the unsloth brothers themselves.

Dan Han said:

Hey everyone, we’ve updated the quants again to include all of Google’s official chat template fixes (which fixed/improved tool-calling), along with the latest llama.cpp fixes.

We know there have been a lot of re-downloading lately, so we appreciate your patience. We’re pushing updates whenever fixes become available to make sure you always have the latest and best-performing quants.

NVIDIA is working on the CUDA 13.2 issue. Until it is fixed, do not use CUDA 13.2.

Someone else in the thread linked to a GitHub repo that has a fix, the repo has an explanation of the change that fixed another issue:

This fixed the same issue for me: https://github.com/asf0/gemma4_jinja/

I don’t “blame” anyone for these issues, this is how it’s supposed to work. This is the true power of open source development. I can’t stress enough how necessary this is for open source software.

The key point I’m making is that not every user even knows about this side of the process. It’s important to let them know.

[-]

FullstackSensei@reddit

You might be trying to shed light on the process, but IMO the impression you're giving is quite negative, especially the comparison with arch. The comparison with ollama, the parasites of the open source World, only reinforces this.

Nobody needs to download a model on the day it's released, even if there's "day-0" support. This is even more true when the model brings architectural changes. Those who want to live on the bleeding edge will of course do. But for the vast majority, waiting a week or so will ensure they don't go through any headaches, even when the internet fosters FOMO. You're not missing anything by waiting a week or even two. I haven't downloaded any of the Gemma 4 models for this very reason.

[-]

yoracale@reddit

Regarding the issue dangered mention, the users who had the unused token issue didn't use the updated unsloth quant or update llama.cpp. A user who originally commented about the unused token issue, later edited their comment to 'thanks' because they realized that the updated quants and updating llama.cpp fixed the issue: [https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24#69daf14e98f472d6c455173d](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24#69daf14e98f472d6c455173d)

The original fix was already merged into the quant when they were writing that the unused token issue still occured: https://github.com/ggml-org/llama.cpp/issues/21321#issuecomment-4206217353

These instances are unfortunately user errors, otherwise hundreds of other people would be complaining as well.

[-]

FullstackSensei@reddit

You can explain all you want. Unfortunately, 90% will simply not read it, and 99% of those who do will misunderstand it.

[-]

yoracale@reddit

Agreed. The issue is majority of time, users always think the issue is our quants or it's something we did when in 99% of the cases it's most likely not. It's the beauty of open-source but also a curse.

Ollama incompatbility issues = our fault
unused token issue which was merged but users didn't update or use updated quants = our fault
when Google officially updated their chat template = our fault

And so no matter how many paragraphs they write, at the end of the day, they just want to blame someone aka us as they think it's our fault for pretty much everything unfortunately. There's not much we can from our side except take it and just try to communicate better. That's why in communication, we always try to say the issues do not originate from unsloth otherwise some people will immediately come to that conclusion. like here: https://unsloth.ai/docs/new/changelog#gemma-4-fixes

[-]

dangered@reddit

I understand how you’d see it as negative, I am pointing out the drawbacks to bleeding edge software.

I’ve been very clear in emphasizing I am not saying anything bad about unsloth. I use it and contribute to the community.

No matter how much I like it, I would be dishonest if I said it was for everyone. Earlier in my career I recommended everyone use the “latest and greatest” and realized leaving out the amount of tinkering involved was a huge problem.

Explaining to someone just the basics of the process and letting them make their own decisions is the right way to get a user without turning them into a hater of the platform.

Simply saying “wait 2 weeks for a stable release” would get more people to be adopters rather than detractors. Pretending it’s not worth mentioning because we have the acumen and time to fix it and get the new shiny thing working is not helping anyone.

Go look at how many failed plex implementations there are because IT dad skipped “stable” for the new shiny feature and his wife and kids quietly went back to Disney+, Netflix, and Amazon because their show didn’t work that week.

This is probably the biggest blind spot we have as technical people and it hurts the community because those people generally never come back.

[-]

yoracale@reddit

The issue is the users who had the unused token issue didn't use the updated quant or update llama.cpp. A user who originally commented about the unused token issue, later edited their comment to 'thanks' because they realized that the updated quants and updating llama.cpp fixed the issue: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24#69daf14e98f472d6c455173d

Original fix which was merged into the quant: https://github.com/ggml-org/llama.cpp/issues/21321

These instances are unfortunately user errors, otherwise hundreds of other people would be complaining as well.

[-]

FullstackSensei@reddit

They actively work with the llama.cpp team and the teams releasing models to find and fix bugs. I lost count how many times they found tokenizer bugs that they reported back to the model developers.

[-]

yoracale@reddit

Thank you for the support we appreciate it!! <3 <3 <3

[-]

wojciechm@reddit

I can confirm that. Regular llama.cpp quantizations are more stable and of higher quality during my usage. Unsloth is just optimized for metrics that does not represent real quality. Recently I even started to use my own quantizations with full output tensor precision (`--leave-output-tensor` option), and that is the best setup I have been using so far. It does not inflate size significantly, but does significantly improve quality.

[-]

FullstackSensei@reddit

I use Q8 on <100B models, and Q4 above. Always follow the recommended params. Never had an issue with loops, going back all the way to QwQ.

If the model is not already supported in llama.cpp, I also wait at least a week after initial support in llama.cpp before trying, to make sure most bugs have been resolved. That's why I haven't even downloaded any of the Gemma 4 models yet.

[-]

emprahsFury@reddit

and yet, other dude is fighting for his life with the downvotes and you're sitting pretty. This sub sucks celebrity dick way too much. When yeah, it's splitting hairs.

[-]

coder543@reddit

It’s under a non-commercial license this time, which is unfortunate.

[-]

z_3454_pfk@reddit

licence is really bad lol we won’t even get third party providers so once minimax stops hosting it’ll be gone via api for a lot of people

[-]

debackerl@reddit

Uhm, I'm getting it via OpenCode Go

[-]

harrro@reddit

Opencode clearly has their own arrangement with multiple providers as they've had MM 2.7 for a while before this release.

[-]

debackerl@reddit

Thx, so that supports my point. It's false that minimax was the only provider. I never talked about providers using open weight models, which, actually, MiniMax just released as open last week.

[-]

MikeFromTheVineyard@reddit

I’m guessing they’ll privately license it to third party commercial hosters.

I’m guessing the reason that open source models are so much cheaper than private ones is the profit margin built in. All these open source labs will need to recoup their investment somehow eventually. Private licensing seems like an easy way to

[-]

oofdere@reddit

use BSL instead of stupid modified MIT licenses that strip away the MIT completely then

[-]

TheRealMasonMac@reddit

I think OpenClaw destroyed the economy of coding plans altogether, so they're trying to subsidize thru these kinds of means. It does mean that API providers will likely get more expensive as time goes on.

[-]

Momo--Sama@reddit

I don’t think there ever was a functioning coding plan economy. I think from their inception (at least for the American labs) they were meant as loss leader samplers to get people talking about what the models could do and get their employers interested in API accounts. Then December and January happened and suddenly there’s hundreds of thousands of people eating half price appetizers with no intention of ordering entrees and the companies are left to figure out how to get people to stop buying apps and start buying entrees… or leave if they’re never going to buy an entree.

[-]

TheRealGentlefox@reddit

Dario claimed like 6 months(?) back that CC was actually profitable on its own.

[-]

antunes145@reddit

You hit the nail on head with that analogy. We will be seeing a large push from companies pushing people out of subsidized plans to API plans for their agents and vibe coding.

[-]

poginmydog@reddit

Or economies of scale happens and gpu decreases in cost by so much it makes subsidised plans profitable again

[-]

EbbNorth7735@reddit

Yep, idealy the license would prohibit cloud providers from hosting it without providing revenue to minimax or companies who generate over 1 million would require providing revenue.

[-]

OpenSourcePenguin@reddit

How does ollama serve it? (Compared to 2.5?)

[-]

rebelSun25@reddit

Openrouter has one third party provider,in US. Same one offers GLM 5.1, Deepseek, etc.

[-]

reto-wyss@reddit

They could always make the license less restrictive later when they have 2.8 or 3.0 - not saying that will happen, but it is possible.

[-]

coder543@reddit

I hope they will at least consider that middle ground, if they insist on doing things this way. That’s the territory of something like the BSL (Business Source License), which is not amazing, but… better than being fully proprietary.

[-]

reto-wyss@reddit

Yeap - I was pretty excited for this one but that license is rough.

I think I'll stick with Gemma-4-31b and Qwen3.5-122b-a10b and keep hoping for a strong 100b-ish dense model. Devstral-3 ?

[-]

comatrices@reddit

release on ModelScope which looks to be the same weights has an entirely different license with no non-commercial clause https://www.modelscope.cn/models/MiniMax/MiniMax-M2.7/file/view/master/LICENSE-MODEL?status=0

how long before they revise it? lol

also interesting release date in that file

[-]

thoquz@reddit

I'm guessing they did it in response to Cursor selling their model and naming it composer 2. (That they fine tuned)

It's unfortunate, I hope minimax picks a more open licence in the future

[-]

InternetNavigator23@reddit

Curser used kimi k2.5 for the base.

[-]

NoahFect@reddit

(Shrug) So was the training data. Fuck 'em.

[-]

Edzomatic@reddit

God bless going public

[-]

PrysmX@reddit

Too bad the new license is ass for anyone that wanted to build any thing commercially.

[-]

VoiceApprehensive893@reddit

it really is a mini

[-]

Asleep_Training3543@reddit

Full GGUF quant set up if anyone needs it — BF16, Q8_0, Q6_K, Q5_K_M live, Q4_K_M/Q3_K_M/Q2_K uploading now.

https://huggingface.co/dennny123/MiniMax-M2.7-GGUF

[-]

erazortt@reddit

Please do not create quants yourself, if you do not know what you are doing! Why do you have all the small tensors at such small quants?! Especially since MiniMax is very sensitive to quantization, the small tensors must be preserved as much as possible! Actually this is generally true, since the small tensors (all the attn_*) are usually so small that its just a couple of hunderds MB difference, but the quality difference is much bigger. There is a very good reason unsloth, AesSedai and ubergarm are doing it.

And also, have you generated an imatrix and used it during quantizations? If yes, what raw data have you fed it?

[-]

Sufficient_Prune3897@reddit

I bet your fun at parties.

[-]

Mochila-Mochila@reddit

you're

[-]

Raredisarray@reddit

Yoo TY

[-]

Rascazzione@reddit

It seems the model isn't 100% open. There are serious restrictions on its use for any commercial purposes.

As it stands now, the license is more like a product demo. Try it out, and if you like it, pay up.

But since it's a Non-commercial Freeware license, it would be nice to have fixed, transparent pricing for the commercial license. And then, for startups, some kind of exemption up to a certain revenue threshold.

[-]

a9udn9u@reddit

I wonder how much that matters to the community (mostly individuals). These are not like traditional software components which small companies or indie developers would embed into their products. These require data centers to host, only big players with deep pockets can do that.

If you run a business and make a profit on top of models MiniMax spent $$$$$ to train, I say it's only fair for you to pay a license fee to them.

[-]

7734128@reddit

It's fair for them to charge a fee, of course, but it's too small of an improvement over 2.5 for that to make sense.

They should have waited for a step change in performance.

[-]

InternetNavigator23@reddit

My thoughts exactly. Don't let other people host it and compete directly. Be clear about commercial and let startups use it under 100m revenue.

[-]

Fine-Profession-3204@reddit

M2.7 scored 78% on SWE-bench Verified vs Claude Opus 4.6's 55% — the biggest gap on the benchmark practitioners trust most for real engineering prediction. But it also generated 87M output tokens during Artificial Analysis evaluation (median is 26M), meaning real per-task cost can run 3x+ the headline rate. Full benchmark table, ECPT cost framework, and the BridgeBench regression most reviews skip are in the breakdown: https://aithinkerlab.com/minimax-m2-7-vs-gpt4-claude-benchmarks/

[-]

YoussofAl@reddit

This is going to be the most impactful release of Q2 this year. (Unless Minimax M3 releases)

Not only is it a powerful model, but it can actually be run by people unlike GLM.

[-]

jon23d@reddit

Im super excited to have this, but if we aren’t supposed to use it to make works that we sell, it’s suddenly far less useful to me.

[-]

bootlickaaa@reddit

The way I'm reading it is that using it for coding, as long as the resulting work product (code) is not dependent on the model at runtime for automating a commercial product, it might be allowed. I could be wrong.

"Commercial Use" means any use of the Software or any derivative work thereof that is primarily intended for commercial advantage or monetary compensation, which includes, without limitation:

(i) offering products or services to third parties for a fee, which utilize, incorporate, or rely on the Software or its derivatives,

(ii) the commercial use of APIs provided by or for the Software or its derivatives, including to support or enable commercial products, services, or operations, whether in a cloud-based, hosted, or other similar environment, and

(iii) the deployment or provision of the Software or its derivatives that have been subjected to post-training, fine-tuning, instruction-tuning, or any other form of modification, for any commercial purpose.

4(ii) seems to be the point that needs expert interpretation. For me, if my software does not depend on the model in any way, it could be in the clear. The outputted code would have been obtained through a harness like OpenCode, which itself does depend on the model to operate, but is non-commercial.

What does it mean to support or enable an end product or operations?

[-]

jon23d@reddit

That’s my reading too. It’d be nice to get some clarification

[-]

Sliouges@reddit

This is Reddit and will get lost, but just for the record, their own blog post says "with human productivity already fully unleashed, the natural next step was to initiate self-evolution." That's a polite way of Chinese saying the human ML engineers already gave everything they could, so now the model takes over their tasks, they don't need low-level ML engineers, pack your bags, get out. Even ML low-level engineers are being replaced, and very little HIL and everyone here cheers like this doesn't concern anyone as long as MiniMax (or anyone else with the same or similar approach) keep releasing models.

[-]

bwjxjelsbd@reddit

What's the HW to run this?
Can a macbook Pro M5 Max run it?

[-]

misha1350@reddit

Newer posts regarding M2.7 suggest that a 128GB RAM model can, given some heavy quantization.

[-]

CertainlyBright@reddit

I love how these are "licensed" like they cared about copyright licenses of the data they trained from. Ima use models however I want lol

[-]

Recoil42@reddit

[-]

segmond@reddit

why don't they ever compare with their peers. I want to see how it compares to GLM-5.1, KimiK2.5, Qwe3.5-297B, etc.

[-]

Inevitable-Plantain5@reddit

I get these model providers only get a moment to have to benchmarks so they have to milk it. It seems all these Chinese models are playing with what they will open as public weights now.

I would be willing to pay a reasonable price to access weights legally so self hosting is still valuable to them. This model is most beneficial right now to people with 256gb since you can get a good quant for a model performing near SOTA in benchmarks. In the cloud there's objectively better options. On a 256gb machine, this is probably the best option still on paper IMO. For companies with several h100s this is also one of the best options. So I think there's a market.

I prefer free but I prefer options that don't require subscriptions. If they price it for industry though then I still have no options but then it becomes black market so...? lol

[-]

InternetNavigator23@reddit

Because reasons. Lol

I'd say just under GLM. Around kimi/qwen. The main highlight here is for the size they are awesome.

[-]

Real_Ebb_7417@reddit

Tbh I used MiniMax a bit for coding and for me it’s nowhere near Claude, GPT or even GLM/Qwen/Kimi. I think it was just trained for benchmarks but in real life work scenario it’s not as good.

[-]

Wooden_Yam1924@reddit

is it something wrong with this repo? I see only 124 of 130 safetensors

[-]

Manwith2plans@reddit

Was so excited for this but it's a non-commercial license so severely limits the utility for me :(

[-]

Kind-Abies8738@reddit

...why? You realise it's little more than a suggestion right?

[-]

rpkarma@reddit

Not when it would be super useful to host at work. Our legal team would have a fit if we tried.

We'll probably end up paying them instead.

[-]

Kind-Abies8738@reddit

If your operation is big enough to have a "legal team" then yeah. But then I don't feel sorry for ya ;)

[-]

rpkarma@reddit

Yeah that’s why I said we’ll probably end up paying them so we can host it ourselves!

[-]

Kind-Abies8738@reddit

Ah, gotcha. The "instead" bit threw me off.

[-]

Virtamancer@reddit

Is this the most important open source (actually large) LLM release since OG deepseek?

[-]

Darkoplax@reddit

GLM is still the leader in Open weight

Minimax, Kimi, Qwen and Deepseek all chasing them rn

[-]

Edzomatic@reddit

From my testing glm, especially glm 5.1, is better in general. But minimax is much smaller and punches well above its weight

[-]

Virtamancer@reddit

I thought GLM isn't open source/weights/whatever.

[-]

coder543@reddit

Not sure where you got that impression: https://huggingface.co/zai-org/GLM-5.1

[-]

Virtamancer@reddit

Maybe it was closed at some point or I'm just misremembering. Good to know, though.

In any case though GLM is gargantuan, nobody will ever be able to run it at home. MiniMax m2.7 performs 99% as well at 25% the size, and based on quick mental math should fit into a mac studio at full precision, and at 8bit it should fit EASILY into even low end mac studios/minis (ones with only 256gb).

To me, that's what makes m2.7 a milestone release. It applies the 80/20 rule but takes it further with 99/25.

[-]

sgmv@reddit

I ran glm 5.1 at home on 256gb ram and 4x 3090 workstation - iq2 kl, 6t/s. Not super useful at that speed, but if you want capability rather than intelligence, i think it still beats 6bit qwen3.5 397b, which is of similar size.
Also, minimax 2.7 is released as 8bit, so the quant will be less, 4 bit etc.

[-]

330d@reddit

How fast you run 397b on that hardware? I assume the ram is DDR4 8 channel?

[-]

sgmv@reddit

Didnt actually run that one locally, was just comappring capability at a certain model size. It's ddr4 quad channel, I think only epycs have ddr4 eight channel.

[-]

shroddy@reddit

low end mac studios/minis (ones with only 256gb)

I suddenly feel very poor...

[-]

Thrumpwart@reddit

I have a 192GB Mac Studio and that comment made me feel poor.

[-]

coder543@reddit

It is not 99%, and 229 is not 25% of 754.

I was very excited for this release too, until I saw the license.

[-]

Virtamancer@reddit

Oh google said it's storage size is 1.65TB. M2.7's is ~458gb, which is about a quarter the size. But at any size, my point is just that it's radically smaller for roughly equivalent performance.

I try to stay up to date with open models for software development. Not local, but through openrouter. All the information I care about shows m2.7 is VERY close for a fraction of the cost.

[-]

hainesk@reddit

I think M2.7 is trained in FP8 so it’s size is 230gb.

[-]

sgmv@reddit

I ran glm 5.1 at home on 256gb ram and 4x 3090 workstation - iq2 kl, 6t/s. Not super useful at that speed, but if you want capability rather than intelligence, i think it still beats 6bit qwen3.5 397b, which is of similar size.
Also, minimax 2.7 is released as 8bit, so the quant will be less, 4 bit etc.

[-]

Edzomatic@reddit

GLM 5 and 5.1 are both open source. The only model in the family to not be open sourced is 5-turbo

[-]

robertpro01@reddit

What's the size?

[-]

gjallerhorns_only@reddit

230B total parameters

[-]

robertpro01@reddit

It is actually a very good size for that benchmark

[-]

coder543@reddit

Not under this license, it’s not. Good for hobbyists and researchers, but the important thing about open weight models is keeping the proprietary providers from establishing total control of the market.

[-]

zxyzyxz@reddit

In practice this won't actually be enforceable for most people. I could use this to write code for my employer as said below but no one would actually know as the model doesn't phone home.

[-]

Virtamancer@reddit

What are the bad limitations?

[-]

coder543@reddit

The license is strictly non-commercial.

[-]

Virtamancer@reddit

Oh I'm thinking about for home use anyways. It's finally the smartest model ever (roughly—not exactly, but roughly—equivalent to GLM 5.1) and can fit in a mac studio. It can fit in smaller mac studios/minis (256gb) when quantized to 8bits or slightly less).

[-]

coder543@reddit

“Home use” here does not include writing code that you will use for your employer or for your own software that you intend to sell. The license prohibits all of that, from what I can see. Just FYI. (IANAL, of course.)

[-]

winterscherries@reddit

Right, but employers are the entities the company needs to generate money from. Getting to this model costs an incredible amount of money. If you don't earn money from those who actually do have deep pockets, like corporations who use your model to compound their profit margins, then you're not going to get money from anyone.

[-]

muyuu@reddit

If you run it at home, this isn't enforceable.

It will just prevent competitors from selling Minimax 2.7 tokens.

[-]

Virtamancer@reddit

I hear you. And I get that sucks for some people.

As a counterpoint, as far as I know there's nothing actually forcing anyone to disclose if they use minimax commercially.

Beyond that, I'm not in the crypto bro camp that believes all local model use must be in pursuit of profit; it's OK to vibe code to make projects and apps that are useful to me that would never exist otherwise, and if I have some fun and learn along the way then that's even better.

I don't use local models for coding because I have access to the paid ones, but if I did use local models (and hopefully next year they'll be good enough) then it's hard for me to see what would prevent me from using any local model and ignoring the license.

[-]

ForsookComparison@reddit

You cannot use this for anything other than hobby or research and there's no clear cut path to doing so. You need to contact and reach a case by case agreement with MiniMax it seems

[-]

Virtamancer@reddit

I mean you literally can, right? You're just not technically allowed to? Not that lawyers have ever agreed on anything anyways.

I think the license is intended more as a means to prevent large companies, the kinds who would be afraid of getting investigated and sued, from using it without whatever agreement you're referring to. I don't think minimax ultimately cares, or could afford to care, or could ever prove, if individuals are using it commercially for many use cases.

[-]

ForsookComparison@reddit

Just be smart about it

[-]

Material_Soft1380@reddit

Had to try it:
MiniMax 2.7 Q8_K_XL (\~250GB) on a single RTX6000 with RAM offload, getting 8.64 tokens/second, which is actually usable.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Comprehensive_Iron_8@reddit

I am confused. Minimax 2.7 was launched 3 weeks ago.

[-]

Comprehensive_Iron_8@reddit

Ahh. I never checked that they released the weights. Eh, glm-5.1 is better. Too late for the weights.

[-]

Comprehensive_Iron_8@reddit

[-]

arm2armreddit@reddit

This screenshot is cloud-based, and you don't even know what you are using. Ollama Cloud is an opaque service.

[-]

OffBeannie@reddit

This is released for local LLM

[-]

LegacyRemaster@reddit

God bless you

[-]

PromptInjection_@reddit

Just made a quick test.
Runs with about 110 PP and 20 G tokens /s on AMD Strix Halo (Windows, llama.cpp)

[-]

jacek2023@reddit

Unlike models such as GLM, Kimi, or DeepSeek, I can run MiniMax locally at Q3, so from my point of view, MiniMax is much better than those three, unless GLM releases Air again.

[-]

Thrumpwart@reddit

“No your honour, I used Qwen 122B to vibe code this app. I just used Minimax to write short stories about a dude named Elias.”

[-]

Nyghtbynger@reddit

"Elias, please compile a website about horse merchandise. Do not act like your rival Arthias would do :
- failing to follow community guidelines
- modifying reference files
- making mistakes
This horse merchandise is really important to defeat the enemy kingdom. Please neigh if you understand.
"

[-]

kawaii_karthus@reddit

I wonder how this comparisons to Qwen 235b? it is still one of my most favorite models.

[-]

Nyghtbynger@reddit

It codes really well. Very clearly. I like the style and it's easy to collaborate with it on code. Your opinion ?

[-]

mehow333@reddit

REAP please

[-]

ResidentPositive4122@reddit

Calling that license "modified MIT" is a farce. Either do or don't, up to you, but at least call it what it is.

[-]

jreoka1@reddit

I bought their $10 a month token plan and used it heavily without even coming close to using the weekly limit. Thats how it should be done IMO.

[-]

SnooPaintings8639@reddit

I am so happy for for this releasee. The previous version of this model m.2.5 is my fldaily driver at Q2, really capable.

Hope it will work well and quantized asap. With m2.5 I could not make it work under ik_llama.cpp (was going into loops) and mainline llama.cpp has a bug that removes the initial thinking tag and some UIs tools have a hard time parsing it. But after I dealt with this, it was a great model even for long context work!

[-]

VampiroMedicado@reddit

It says it's Claude lol

[-]

DarkGhostHunter@reddit

Great!

230 GB

Back to Qwen Code I guess...

[-]