Currently running on 1 128gb strix halo box. unsloth/minimax-m2.7-UD-IQ4_XS using a forked turboquant llama.cpp. 132k context window getting around 20-30tok/sec (visually still need to make sure)
prompt eval time = 15260.10 ms / 4112 tokens ( 3.71 ms per token, 269.46 tokens per second) eval time = 25127.82 ms / 623 tokens ( 40.33 ms per token, 24.79 tokens per second) total time = 40387.92 ms / 4735 tokens
prompt eval time = 176629.47 ms / 26166 tokens ( 6.75 ms per token, 148.14 tokens per second) eval time = 66263.78 ms / 614 tokens ( 107.92 ms per token, 9.27 tokens per second) total time = 242893.25 ms / 26780 tokens
Q4_k_s was like 125GB on disk or something, so ideally have 140+ total to do some actual work (and probably nothing parallel).
But be warned: Q4 was damn near unusable for Minimax M2.1 and M2.5 compared to the full weight versions. It drops off way harder than quantizing other popular models.
5 years old amd server or intel workstation with 6+ channels, 256gb of the cheapest ecc ddr4 you can get + ampere 24gb gpu + ik llama.
Or a second hand M2 Ultra 192gb MacStudio.
non-commercial kills it for me. cool benchmark numbers but if third party hosters can't pick it up commercially it's basically a hosted-only model with extra steps.
I have halo strix architecture 128gb ram. just downloaded minimax-m2.7 running llama.cpp turboquant with 132k token context window. I generate roughly 20-30tok/sec. prefill speeds are around 17tok/sec however so rag is much needed.
What quantisation? You must be going for a 2 or 3 right? At those quants I was reading everywhere that a smaller model is preferred due to the loss, have you did any testing if those are indeed your specs?
last test was a contextual conversation where the context slowly grew. after a few prompts the prefil slowed to a crawl. everything started to take much longer. so its good for oneshotting but wouldnt recommend for everyday use with these specs.
the only real test I did was "I want to to design a full on website for bleach new worlds 3 a bleach game on roblox. I want you to search the web find the correct colors and styles to use and gather some images for the site. make it modern with animations. just css javascript and html 1 file" it generated a file 1400 loc and worked great first shot. website had animations everything worked.
4bit quant Unsloth/Minimax-m2.7-UD-IQ4_XS uses like 112gb-113gb of ram. context window was around 32k. so I used turboquant for my kv cache and got it up to 132k context window. I gave it a single text of around 100k tokens and it was able to load it completely into ram and responds accordingly (the prefillwas generating around 17tok/sec and took 2 hours). however when running realworld prompts I was getting 65tok/sec prefill and responses were generally around 25tok/sec
I've been running the closed weight version minimax servee for a few weeks. Qwen3.5 27B (my favorite on prem model lately) is not a serious competitor for this if you're talking about agent work and coding.
It’s not a serious contender, but it is a good substitute. Like how Sonnet is 80% of Opus. I feel the same way between Qwen 3.5 27B and Minimax M2.5. Then again, I haven’t tested 2.7 yet so we’ll see.
You're getting downvoted, but it's not an insane take. It's all about your use-case. There will be things that MiniMax-2.7 will be able to do, but Qwen-3.5 27b can't do at all, and plenty of things that they both do exactly as well. The situation is black, white, and grey all at the same time.
prompt eval time = 18513.51 ms / 4112 tokens ( 4.50 ms per token, 222.11 tokens per second) eval time = 18429.76 ms / 396 tokens ( 46.54 ms per token, 21.49 tokens per second) total time = 36943.27 ms / 4508 tokens
prompt eval time = 234712.43 ms / 26166 tokens ( 8.97 ms per token, 111.48 tokens per second) eval time = 93301.59 ms / 700 tokens ( 133.29 ms per token, 7.50 tokens per second) total time = 328014.03 ms / 26866 tokens
There was never an M1 Max with more than 64, so it's a bit of confusing statement, unless you mean you bought it recently, when other options were available? I also have the 64GB M1 Max and it's still a beast and allowed me to experiment with local models for years now.
For how long have you been using omlx? I tried a couple of weeks ago with qwen3.5 122b and had to stop because there was a bug and the moment the context filled up a bit it started to forget things and get into infinite loops.
Yeah there was a bug like not that long ago that caused memory to fill up a ton but it was quickly fixed so maybe that's what you had, but now it should be good and make sure to fill in parameters for the model you are using and don't use too low of a quant on omlx since the quants aren't as good as gguf. (also there's turbo quant as a bonus)
Just start by looking at the GitHub repo and reading the instructions to install it, then once installed have a look at the settings and just get a general idea of what is what (most things can be left untouched), you can download models from omlx which makes it way easier. (mlx models only) so I recommend looking at mlx-community hf account for models.
This poster claims he's running huge MOE models that can't fit RAM on macbooks, I didn't give it a shot yet. Let me know if you try it https://www.reddit.com/r/LocalLLaMA/comments/1shediw/comment/ofc46y5/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button
I don't think so and even if you could I wouldn't recommend it because it would be extremely slow but you can run large models quantised as long as it fits into ram.
The random Chinese text showing up in responses that are meant to be fully English is enough for me to delete my MiniMax account tbh. Very annoying. 🤦🏽♀️
if youre just using the ai max for inference you can run turboquant llama.cpp with unsloth/minimax-m2.7-UD-IQ4_XS and have 132k context window too. the prefill is ass just be aware if youre trying to load alot into it
This. I probably would have drank the cool aid and spend 7k on one, but with quickly moe's have escalated in size, it wouldn't even unlock anything I cant run now.
My point is, the ram requirements are constantly increasing. GLM got 2x bigger from 4.7 to 5, Qwen increased from 235B to 400B and Minimax 3 is probably gonna do the same.
If I want to run GLM 5 in VRAM, I'm gonna need like at least 384GB of VRAM, and that's at a bad quant.
Personally I would really like 192 so that I can at least fine-tune and train all the 'smaller' 100b models myself.
Well i was doing it for the GLM 5.1 and ran that model in my 5070 ti in my head and got good results. One day, one day I will make an agent that can hallucinate as good as me locally.
TBH, between the two it's like splitting hairs. I use Unsloth because they provide documentation for best params, they're generally active here, and they often get early access so their quants drop sometimes at the same time the model drops.
I think Unsloth is just so early with their quant releases that it doesn't give llamac++ time to fix bugs kind of giving them a bad rep. Although once everything works usually their quants are pretty good.
To be fair, more often than not the unsloth brothers are the ones who uncover the existence of those bugs. They also find tokenizer bugs in the released model more often than I thought possible.
Same with arch users. It’s necessary for the open source lifecycle. But is it necessary for you as the user?
If you’re active in the forums finding what is causing bugs and posting workarounds or patches then you’re key to the process. If you’re not, there’s a chance you’re just inflicting pain on yourself to the benefit of no one.
I’m in no way saying “unsloth bad” but it might not be the right choice for a lot of people and it has to be acknowledged. Many people leave or never make it into communities because they are told to use the bleeding edge but become too frustrated trying to get it to work to continue.
When that happens enough times, the product gets a bad name because the wrong people were using it and now they all say “unsloth bad”
I'm not sure what's the point you're trying to make, or what is the connection with arch.
Neither me nor anyone using their quants is testing anything. The unsloth brothers, or Bartowski or anyone making quants for their job are not regular users. They're like the maintainers of one package or one part of the kernel, who find bugs in other parts or other packages during their job and report those.
If you're going to blame maintainers for finding bugs, I am really out of words for how to respond to this.
Hey everyone, we’ve updated the quants again to include all of Google’s official chat template fixes (which fixed/improved tool-calling), along with the latest llama.cpp fixes.
We know there have been a lot of re-downloading lately, so we appreciate your patience. We’re pushing updates whenever fixes become available to make sure you always have the latest and best-performing quants.
NVIDIA is working on the CUDA 13.2 issue. Until it is fixed, do not use CUDA 13.2.
Someone else in the thread linked to a GitHub repo that has a fix, the repo has an explanation of the change that fixed another issue:
This fixed the same issue for me: https://github.com/asf0/gemma4_jinja/
I don’t “blame” anyone for these issues, this is how it’s supposed to work. This is the true power of open source development. I can’t stress enough how necessary this is for open source software.
The key point I’m making is that not every user even knows about this side of the process. It’s important to let them know.
You might be trying to shed light on the process, but IMO the impression you're giving is quite negative, especially the comparison with arch. The comparison with ollama, the parasites of the open source World, only reinforces this.
Nobody needs to download a model on the day it's released, even if there's "day-0" support. This is even more true when the model brings architectural changes. Those who want to live on the bleeding edge will of course do. But for the vast majority, waiting a week or so will ensure they don't go through any headaches, even when the internet fosters FOMO. You're not missing anything by waiting a week or even two. I haven't downloaded any of the Gemma 4 models for this very reason.
Regarding the issue dangered mention, the users who had the unused token issue didn't use the updated unsloth quant or update llama.cpp. A user who originally commented about the unused token issue, later edited their comment to 'thanks' because they realized that the updated quants and updating llama.cpp fixed the issue: [https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24#69daf14e98f472d6c455173d](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24#69daf14e98f472d6c455173d)
Agreed. The issue is majority of time, users always think the issue is our quants or it's something we did when in 99% of the cases it's most likely not. It's the beauty of open-source but also a curse.
Ollama incompatbility issues = our fault
unused token issue which was merged but users didn't update or use updated quants = our fault
when Google officially updated their chat template = our fault
And so no matter how many paragraphs they write, at the end of the day, they just want to blame someone aka us as they think it's our fault for pretty much everything unfortunately. There's not much we can from our side except take it and just try to communicate better. That's why in communication, we always try to say the issues do not originate from unsloth otherwise some people will immediately come to that conclusion. like here: https://unsloth.ai/docs/new/changelog#gemma-4-fixes
I understand how you’d see it as negative, I am pointing out the drawbacks to bleeding edge software.
I’ve been very clear in emphasizing I am not saying anything bad about unsloth. I use it and contribute to the community.
No matter how much I like it, I would be dishonest if I said it was for everyone. Earlier in my career I recommended everyone use the “latest and greatest” and realized leaving out the amount of tinkering involved was a huge problem.
Explaining to someone just the basics of the process and letting them make their own decisions is the right way to get a user without turning them into a hater of the platform.
Simply saying “wait 2 weeks for a stable release” would get more people to be adopters rather than detractors. Pretending it’s not worth mentioning because we have the acumen and time to fix it and get the new shiny thing working is not helping anyone.
Go look at how many failed plex implementations there are because IT dad skipped “stable” for the new shiny feature and his wife and kids quietly went back to Disney+, Netflix, and Amazon because their show didn’t work that week.
This is probably the biggest blind spot we have as technical people and it hurts the community because those people generally never come back.
The issue is the users who had the unused token issue didn't use the updated quant or update llama.cpp. A user who originally commented about the unused token issue, later edited their comment to 'thanks' because they realized that the updated quants and updating llama.cpp fixed the issue: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24#69daf14e98f472d6c455173d
They actively work with the llama.cpp team and the teams releasing models to find and fix bugs. I lost count how many times they found tokenizer bugs that they reported back to the model developers.
I can confirm that. Regular llama.cpp quantizations are more stable and of higher quality during my usage. Unsloth is just optimized for metrics that does not represent real quality. Recently I even started to use my own quantizations with full output tensor precision (`--leave-output-tensor` option), and that is the best setup I have been using so far. It does not inflate size significantly, but does significantly improve quality.
I use Q8 on <100B models, and Q4 above. Always follow the recommended params. Never had an issue with loops, going back all the way to QwQ.
If the model is not already supported in llama.cpp, I also wait at least a week after initial support in llama.cpp before trying, to make sure most bugs have been resolved. That's why I haven't even downloaded any of the Gemma 4 models yet.
and yet, other dude is fighting for his life with the downvotes and you're sitting pretty. This sub sucks celebrity dick way too much. When yeah, it's splitting hairs.
Thx, so that supports my point. It's false that minimax was the only provider. I never talked about providers using open weight models, which, actually, MiniMax just released as open last week.
I’m guessing they’ll privately license it to third party commercial hosters.
I’m guessing the reason that open source models are so much cheaper than private ones is the profit margin built in. All these open source labs will need to recoup their investment somehow eventually. Private licensing seems like an easy way to
I think OpenClaw destroyed the economy of coding plans altogether, so they're trying to subsidize thru these kinds of means. It does mean that API providers will likely get more expensive as time goes on.
I don’t think there ever was a functioning coding plan economy. I think from their inception (at least for the American labs) they were meant as loss leader samplers to get people talking about what the models could do and get their employers interested in API accounts. Then December and January happened and suddenly there’s hundreds of thousands of people eating half price appetizers with no intention of ordering entrees and the companies are left to figure out how to get people to stop buying apps and start buying entrees… or leave if they’re never going to buy an entree.
You hit the nail on head with that analogy. We will be seeing a large push from companies pushing people out of subsidized plans to API plans for their agents and vibe coding.
Yep, idealy the license would prohibit cloud providers from hosting it without providing revenue to minimax or companies who generate over 1 million would require providing revenue.
I hope they will at least consider that middle ground, if they insist on doing things this way. That’s the territory of something like the BSL (Business Source License), which is not amazing, but… better than being fully proprietary.
Please do not create quants yourself, if you do not know what you are doing! Why do you have all the small tensors at such small quants?! Especially since MiniMax is very sensitive to quantization, the small tensors must be preserved as much as possible! Actually this is generally true, since the small tensors (all the attn_*) are usually so small that its just a couple of hunderds MB difference, but the quality difference is much bigger. There is a very good reason unsloth, AesSedai and ubergarm are doing it.
And also, have you generated an imatrix and used it during quantizations? If yes, what raw data have you fed it?
It seems the model isn't 100% open. There are serious restrictions on its use for any commercial purposes.
As it stands now, the license is more like a product demo. Try it out, and if you like it, pay up.
But since it's a Non-commercial Freeware license, it would be nice to have fixed, transparent pricing for the commercial license. And then, for startups, some kind of exemption up to a certain revenue threshold.
I wonder how much that matters to the community (mostly individuals). These are not like traditional software components which small companies or indie developers would embed into their products. These require data centers to host, only big players with deep pockets can do that.
If you run a business and make a profit on top of models MiniMax spent $$$$$ to train, I say it's only fair for you to pay a license fee to them.
M2.7 scored 78% on SWE-bench Verified vs Claude Opus 4.6's 55% — the biggest gap on the benchmark practitioners trust most for real engineering prediction. But it also generated 87M output tokens during Artificial Analysis evaluation (median is 26M), meaning real per-task cost can run 3x+ the headline rate. Full benchmark table, ECPT cost framework, and the BridgeBench regression most reviews skip are in the breakdown: https://aithinkerlab.com/minimax-m2-7-vs-gpt4-claude-benchmarks/
The way I'm reading it is that using it for coding, as long as the resulting work product (code) is not dependent on the model at runtime for automating a commercial product, it might be allowed. I could be wrong.
"Commercial Use" means any use of the Software or any derivative work thereof that is primarily intended for commercial advantage or monetary compensation, which includes, without limitation:
(i) offering products or services to third parties for a fee, which utilize, incorporate, or rely on the Software or its derivatives,
(ii) the commercial use of APIs provided by or for the Software or its derivatives, including to support or enable commercial products, services, or operations, whether in a cloud-based, hosted, or other similar environment, and
(iii) the deployment or provision of the Software or its derivatives that have been subjected to post-training, fine-tuning, instruction-tuning, or any other form of modification, for any commercial purpose.
4(ii) seems to be the point that needs expert interpretation. For me, if my software does not depend on the model in any way, it could be in the clear. The outputted code would have been obtained through a harness like OpenCode, which itself does depend on the model to operate, but is non-commercial.
What does it mean to support or enable an end product or operations?
This is Reddit and will get lost, but just for the record, their own blog post says "with human productivity already fully unleashed, the natural next step was to initiate self-evolution." That's a polite way of Chinese saying the human ML engineers already gave everything they could, so now the model takes over their tasks, they don't need low-level ML engineers, pack your bags, get out. Even ML low-level engineers are being replaced, and very little HIL and everyone here cheers like this doesn't concern anyone as long as MiniMax (or anyone else with the same or similar approach) keep releasing models.
I get these model providers only get a moment to have to benchmarks so they have to milk it. It seems all these Chinese models are playing with what they will open as public weights now.
I would be willing to pay a reasonable price to access weights legally so self hosting is still valuable to them. This model is most beneficial right now to people with 256gb since you can get a good quant for a model performing near SOTA in benchmarks. In the cloud there's objectively better options. On a 256gb machine, this is probably the best option still on paper IMO. For companies with several h100s this is also one of the best options. So I think there's a market.
I prefer free but I prefer options that don't require subscriptions. If they price it for industry though then I still have no options but then it becomes black market so...? lol
Tbh I used MiniMax a bit for coding and for me it’s nowhere near Claude, GPT or even GLM/Qwen/Kimi.
I think it was just trained for benchmarks but in real life work scenario it’s not as good.
Maybe it was closed at some point or I'm just misremembering. Good to know, though.
In any case though GLM is gargantuan, nobody will ever be able to run it at home. MiniMax m2.7 performs 99% as well at 25% the size, and based on quick mental math should fit into a mac studio at full precision, and at 8bit it should fit EASILY into even low end mac studios/minis (ones with only 256gb).
To me, that's what makes m2.7 a milestone release. It applies the 80/20 rule but takes it further with 99/25.
I ran glm 5.1 at home on 256gb ram and 4x 3090 workstation - iq2 kl, 6t/s. Not super useful at that speed, but if you want capability rather than intelligence, i think it still beats 6bit qwen3.5 397b, which is of similar size.
Also, minimax 2.7 is released as 8bit, so the quant will be less, 4 bit etc.
Didnt actually run that one locally, was just comappring capability at a certain model size. It's ddr4 quad channel, I think only epycs have ddr4 eight channel.
Oh google said it's storage size is 1.65TB. M2.7's is ~458gb, which is about a quarter the size. But at any size, my point is just that it's radically smaller for roughly equivalent performance.
I try to stay up to date with open models for software development. Not local, but through openrouter. All the information I care about shows m2.7 is VERY close for a fraction of the cost.
I ran glm 5.1 at home on 256gb ram and 4x 3090 workstation - iq2 kl, 6t/s. Not super useful at that speed, but if you want capability rather than intelligence, i think it still beats 6bit qwen3.5 397b, which is of similar size.
Also, minimax 2.7 is released as 8bit, so the quant will be less, 4 bit etc.
Not under this license, it’s not. Good for hobbyists and researchers, but the important thing about open weight models is keeping the proprietary providers from establishing total control of the market.
In practice this won't actually be enforceable for most people. I could use this to write code for my employer as said below but no one would actually know as the model doesn't phone home.
Oh I'm thinking about for home use anyways. It's finally the smartest model ever (roughly—not exactly, but roughly—equivalent to GLM 5.1) and can fit in a mac studio. It can fit in smaller mac studios/minis (256gb) when quantized to 8bits or slightly less).
“Home use” here does not include writing code that you will use for your employer or for your own software that you intend to sell. The license prohibits all of that, from what I can see. Just FYI. (IANAL, of course.)
Right, but employers are the entities the company needs to generate money from. Getting to this model costs an incredible amount of money. If you don't earn money from those who actually do have deep pockets, like corporations who use your model to compound their profit margins, then you're not going to get money from anyone.
As a counterpoint, as far as I know there's nothing actually forcing anyone to disclose if they use minimax commercially.
Beyond that, I'm not in the crypto bro camp that believes all local model use must be in pursuit of profit; it's OK to vibe code to make projects and apps that are useful to me that would never exist otherwise, and if I have some fun and learn along the way then that's even better.
I don't use local models for coding because I have access to the paid ones, but if I did use local models (and hopefully next year they'll be good enough) then it's hard for me to see what would prevent me from using any local model and ignoring the license.
You cannot use this for anything other than hobby or research and there's no clear cut path to doing so. You need to contact and reach a case by case agreement with MiniMax it seems
I mean you literally can, right? You're just not technically allowed to? Not that lawyers have ever agreed on anything anyways.
I think the license is intended more as a means to prevent large companies, the kinds who would be afraid of getting investigated and sued, from using it without whatever agreement you're referring to. I don't think minimax ultimately cares, or could afford to care, or could ever prove, if individuals are using it commercially for many use cases.
Unlike models such as GLM, Kimi, or DeepSeek, I can run MiniMax locally at Q3, so from my point of view, MiniMax is much better than those three, unless GLM releases Air again.
"Elias, please compile a website about horse merchandise. Do not act like your rival Arthias would do :
- failing to follow community guidelines
- modifying reference files
- making mistakes
This horse merchandise is really important to defeat the enemy kingdom. Please neigh if you understand.
"
I am so happy for for this releasee. The previous version of this model m.2.5 is my fldaily driver at Q2, really capable.
Hope it will work well and quantized asap. With m2.5 I could not make it work under ik_llama.cpp (was going into loops) and mainline llama.cpp has a bug that removes the initial thinking tag and some UIs tools have a hard time parsing it. But after I dealt with this, it was a great model even for long context work!
TemporalAgent7@reddit
What is the cheapest hardware that can run this at 4-bit quant and above?
wiltors42@reddit
Maybe 2x Strix Halo boxes?
ResponsibleHead8778@reddit
Currently running on 1 128gb strix halo box. unsloth/minimax-m2.7-UD-IQ4_XS using a forked turboquant llama.cpp. 132k context window getting around 20-30tok/sec (visually still need to make sure)
sword-in-stone@reddit
exact dependencies and setup on strix? can you ask your agent to create an MD file for the setup which I can pass to my agent pls
ResponsibleHead8778@reddit
sword-in-stone@reddit
goat'd
wiltors42@reddit
Wow that sounds great. I’m on main llama.cpp and Minimax m2.7 q3 @ ~80k context. It barely fits and quality is not quite perfect.
ttkciar@reddit
It should work okay with pure-CPU inference on my $800 Xeon E5-2660v3 system with 256GB DDR4. Looking forward to giving it a spin.
florinandrei@reddit
1 token / second
ttkciar@reddit
With 10B active, probably closer to 3/second, which means about 80K tokens overnight while I sleep.
Maleficent-Ad5999@reddit
That’s great. 60 tokens per minute
FatheredPuma81@reddit
-signed, ChatGPT
ReactionaryPlatypus@reddit
I am running iq4_xs on Strix Halo 128gb + 3090 egpu 24gb.
oxygen_addiction@reddit
What speeds are you getting?
ReactionaryPlatypus@reddit
STRIX HALO + 3090 (MNIMAX M2.5 - IQ4_XS)
prompt eval time = 15260.10 ms / 4112 tokens ( 3.71 ms per token, 269.46 tokens per second) eval time = 25127.82 ms / 623 tokens ( 40.33 ms per token, 24.79 tokens per second) total time = 40387.92 ms / 4735 tokens
prompt eval time = 176629.47 ms / 26166 tokens ( 6.75 ms per token, 148.14 tokens per second) eval time = 66263.78 ms / 614 tokens ( 107.92 ms per token, 9.27 tokens per second) total time = 242893.25 ms / 26780 tokens
oxygen_addiction@reddit
Absolute legend. Thanks!
ForsookComparison@reddit
Q4_k_s was like 125GB on disk or something, so ideally have 140+ total to do some actual work (and probably nothing parallel).
But be warned: Q4 was damn near unusable for Minimax M2.1 and M2.5 compared to the full weight versions. It drops off way harder than quantizing other popular models.
Geximus-therealone@reddit
Why ? Some 4bit quants have a lot bf16 layers
Sufficient_Prune3897@reddit
Sparse moes seem to suffer a lot more. I have noticed the same way back with GLM Air. Even Q4 was pretty random. And I didnt even code with it.
Serprotease@reddit
5 years old amd server or intel workstation with 6+ channels, 256gb of the cheapest ecc ddr4 you can get + ampere 24gb gpu + ik llama. Or a second hand M2 Ultra 192gb MacStudio.
Head_Bananana@reddit
I'm running this on Mac Studio M2 Ultra 200GB now its 121GB in RAM
Thrumpwart@reddit
14x AMD Mi50s…
joeyhipolito@reddit
non-commercial kills it for me. cool benchmark numbers but if third party hosters can't pick it up commercially it's basically a hosted-only model with extra steps.
Beginning-Window-115@reddit
I regret only buying the m5 pro 48gb and not the m5 max 128gb...
TheItalianDonkey@reddit
i have the 128gb. i'm currently running gemma-4-31b.
no way this fits.
ResponsibleHead8778@reddit
I have halo strix architecture 128gb ram. just downloaded minimax-m2.7 running llama.cpp turboquant with 132k token context window. I generate roughly 20-30tok/sec. prefill speeds are around 17tok/sec however so rag is much needed.
TheItalianDonkey@reddit
What quantisation? You must be going for a 2 or 3 right? At those quants I was reading everywhere that a smaller model is preferred due to the loss, have you did any testing if those are indeed your specs?
ResponsibleHead8778@reddit
last test was a contextual conversation where the context slowly grew. after a few prompts the prefil slowed to a crawl. everything started to take much longer. so its good for oneshotting but wouldnt recommend for everyday use with these specs.
ResponsibleHead8778@reddit
the only real test I did was "I want to to design a full on website for bleach new worlds 3 a bleach game on roblox. I want you to search the web find the correct colors and styles to use and gather some images for the site. make it modern with animations. just css javascript and html 1 file" it generated a file 1400 loc and worked great first shot. website had animations everything worked.
ResponsibleHead8778@reddit
4bit quant Unsloth/Minimax-m2.7-UD-IQ4_XS uses like 112gb-113gb of ram. context window was around 32k. so I used turboquant for my kv cache and got it up to 132k context window. I gave it a single text of around 100k tokens and it was able to load it completely into ram and responds accordingly (the prefillwas generating around 17tok/sec and took 2 hours). however when running realworld prompts I was getting 65tok/sec prefill and responses were generally around 25tok/sec
YoussofAl@reddit
QWEN 3.5 27B will get 80% of the strength of this model anyways.
ForsookComparison@reddit
I've been running the closed weight version minimax servee for a few weeks. Qwen3.5 27B (my favorite on prem model lately) is not a serious competitor for this if you're talking about agent work and coding.
YoussofAl@reddit
It’s not a serious contender, but it is a good substitute. Like how Sonnet is 80% of Opus. I feel the same way between Qwen 3.5 27B and Minimax M2.5. Then again, I haven’t tested 2.7 yet so we’ll see.
ForsookComparison@reddit
Wait. Where's that opinion formed from then?
YoussofAl@reddit
2.5
_-_David@reddit
You're getting downvoted, but it's not an insane take. It's all about your use-case. There will be things that MiniMax-2.7 will be able to do, but Qwen-3.5 27b can't do at all, and plenty of things that they both do exactly as well. The situation is black, white, and grey all at the same time.
eMperror_@reddit
Isnt it way too large for 128gb anyways?
waitmarks@reddit
I run 2.5 at Q3_K_XL on 128G and it’s quite usable. I can’t max out its context, but it’s still very useful.
Mysterious_Finish543@reddit
How much context are you able to run at with
Q3_K_XL?pilibitti@reddit
128 context. I only ask yes no questions.
Ok_Technology_5962@reddit
Use caveman mode. And glm 5.1 really degrades past 100k anyways
Danfhoto@reddit
I use it with OpenClaw and have the context limit set to 90,000, haven’t had issues. The q3 UD quants are quite good.
Storge2@reddit
Also interested can this run somehow on a Dgx Spark 128Gb
Fresh-Grocery-3847@reddit
Im going to be trying the hf download unsloth/MiniMax-M2.7-GGUF \ --local-dir unsloth/MiniMax-M2.7-GGUF \ --include "UD-IQ4_XS" Which is 108gbs.
And then perhaps if its too slow try The UD-Q3_K_S or UD-IQ3_S.
I'll update my findings later.
Fresh-Grocery-3847@reddit
Going back to Qwen3.5-122b quantization on minimax is terrible. https://x.com/bnjmn_marie/status/2027043753484021810
cafedude@reddit
Also interested in running this on a 128GB Strix Halo box. I suspect we'd need a 2-bit quant.
ReactionaryPlatypus@reddit
I am running iq3_m Minimax M2.5 on 128gb Strix Halo Tablet as my daily driver.
ObiwanKenobi1138@reddit
What kind of speeds are you seeing?
ReactionaryPlatypus@reddit
STRIX HALO (MNIMAX M2.5 - IQ3_MS)
prompt eval time = 18513.51 ms / 4112 tokens ( 4.50 ms per token, 222.11 tokens per second) eval time = 18429.76 ms / 396 tokens ( 46.54 ms per token, 21.49 tokens per second) total time = 36943.27 ms / 4508 tokens
prompt eval time = 234712.43 ms / 26166 tokens ( 8.97 ms per token, 111.48 tokens per second) eval time = 93301.59 ms / 700 tokens ( 133.29 ms per token, 7.50 tokens per second) total time = 328014.03 ms / 26866 tokens
georgeApuiu@reddit
If you REAP it you might be able to. I’m using the minimax 2.5 REAP on a single dgx spark
rpkarma@reddit
You'd need to cluster two via the ConnectX-7 link, and honestly it's gonna get kind of shredded by our lack of memory bandwidth I think.
I'm still going try though lol, I love my little Asus GX10
texasdude11@reddit
On two of them
xraybies@reddit
https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-100GB-MLX
Ok_Technology_5962@reddit
Use one of those JANG quants at low bits per weight is good that or oQe quant once someone drops that
InternetNavigator23@reddit
Yeah I think I heard he is planning on using some dynamic 2.7 bit or something.
Should be perfect for 128 GB of RAM. Pretty excited for it honestly.
Beginning-Window-115@reddit
it would work at UD-Q3_K_XL 🥲
eMperror_@reddit
Nice, can't wait to try it then! (M5 max 128gb) :D
Beginning-Window-115@reddit
I envy you
-dysangel-@reddit
I've been using M2.1 @ IQ2_XXS (75GB) fine on my Mac Studio
PinkySwearNotABot@reddit
I have the M1 Max 64GB and I regret not getting the 128GB
TheItalianDonkey@reddit
i have the 128gb. i'm currently running gemma-4-31b.
no way this fits.
kovexex@reddit
I have it too, don't run a dense model lol. Shits gonna be cooked, run the 26b-a4b bf16 at 60tps low context or down to 30tps at max context
330d@reddit
There was never an M1 Max with more than 64, so it's a bit of confusing statement, unless you mean you bought it recently, when other options were available? I also have the 64GB M1 Max and it's still a beast and allowed me to experiment with local models for years now.
marco89nish@reddit
What are you running on that, I'm looking for good models for my 48GB M4 Pro? Also, ollama, mlx or lm studio?
Beginning-Window-115@reddit
I mainly use "omlx" not "mlx" it has ssd caching so it's pretty fast, and my main model is Qwen3.5 27b at 4bit or if I need speed Qwen3.5 35b (moe).
thphon83@reddit
For how long have you been using omlx? I tried a couple of weeks ago with qwen3.5 122b and had to stop because there was a bug and the moment the context filled up a bit it started to forget things and get into infinite loops.
Beginning-Window-115@reddit
Yeah there was a bug like not that long ago that caused memory to fill up a ton but it was quickly fixed so maybe that's what you had, but now it should be good and make sure to fill in parameters for the model you are using and don't use too low of a quant on omlx since the quants aren't as good as gguf. (also there's turbo quant as a bonus)
itsmeemilio@reddit
How do you go about using omlx? Seems like it could be interesting for maybe running larger models possibly?
d4mations@reddit
R/omlx
Beginning-Window-115@reddit
Just start by looking at the GitHub repo and reading the instructions to install it, then once installed have a look at the settings and just get a general idea of what is what (most things can be left untouched), you can download models from omlx which makes it way easier. (mlx models only) so I recommend looking at mlx-community hf account for models.
itsmeemilio@reddit
Wow thank you for putting me onto this. What a find.
Are you aware if it's possible to run models larger than unified memory would normally allow?
E.g. a 70B or 90B model on a 48GB system?
marco89nish@reddit
This poster claims he's running huge MOE models that can't fit RAM on macbooks, I didn't give it a shot yet. Let me know if you try it https://www.reddit.com/r/LocalLLaMA/comments/1shediw/comment/ofc46y5/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button
Beginning-Window-115@reddit
I don't think so and even if you could I wouldn't recommend it because it would be extremely slow but you can run large models quantised as long as it fits into ram.
Cybertrucker01@reddit
Why not the M5 Studio 256gb?
thrownawaymane@reddit
Can't buy something that doesn't exist yet
segmond@reddit
if you have the money, sell it and buy 128gb, are you going to live the rest of your life in regret?
ajblue98@reddit
Ditto M4 Max 36
digitaldisgust@reddit
The random Chinese text showing up in responses that are meant to be fully English is enough for me to delete my MiniMax account tbh. Very annoying. 🤦🏽♀️
Infinite_Hand7076@reddit
Would q3 or q2 version work on ai max 395 128g?
ResponsibleHead8778@reddit
if youre just using the ai max for inference you can run turboquant llama.cpp with unsloth/minimax-m2.7-UD-IQ4_XS and have 132k context window too. the prefill is ass just be aware if youre trying to load alot into it
misha1350@reddit
Yes
FrozenFishEnjoyer@reddit
I'm out here reading what's new here, checking what quants are available, and looking at the graph...but I only have 16GB VRAM.
The life of poors are sure difficult.
Maleficent-Ad5999@reddit
I wish you’d buy couple of rtx pro 6000s and never worry about vram some days in future
Eyelbee@reddit
You'd still have to worry about vram
Sufficient_Prune3897@reddit
This. I probably would have drank the cool aid and spend 7k on one, but with quickly moe's have escalated in size, it wouldn't even unlock anything I cant run now.
Maleficent-Ad5999@reddit
Can you give me a rough number on How much would feel enough?
Ok_Technology_5962@reddit
1 terabyte if vram feels good
Maleficent-Ad5999@reddit
Even then bigger models are fp8 and beyond would require more vram for context size.. so maybe 2tb vram?
Ok_Technology_5962@reddit
Ugh... You are right but i also saw that monster 2 trillion peram model that Nousresearch has... And obviously 10trillion is coming soon
Maleficent-Ad5999@reddit
yet here we are dealing with GPUs of 8GB, 12GB, 16GB in consumer space.
Sufficient_Prune3897@reddit
My point is, the ram requirements are constantly increasing. GLM got 2x bigger from 4.7 to 5, Qwen increased from 235B to 400B and Minimax 3 is probably gonna do the same.
If I want to run GLM 5 in VRAM, I'm gonna need like at least 384GB of VRAM, and that's at a bad quant.
Personally I would really like 192 so that I can at least fine-tune and train all the 'smaller' 100b models myself.
Maleficent-Ad5999@reddit
Agreed
Maleficent-Ad5999@reddit
Well then when would we ever stop accumulating more vram
Nobby_Binks@reddit
Unfortunately it's a bit like money - the more you have the more you want
a9udn9u@reddit
I have 32GB and I always think 48GB would be nice, when I got 48GB I'd want 64GB. You will never be satisfied unless you have multi-TB VRAM.
krileon@reddit
I'm on 20GB. It's such a weird spot to be in. It's a decent amount, but just shy of enough.
grumd@reddit
Depending on how much RAM you have you might still be able to run a Q2-Q3 quant
srigi@reddit
The Q_1 quant is 60GB. I have 64GB RAM, so no luck even to try to load weights.
grumd@reddit
Might run with a small context at least for testing. But yeah for 64GB+16GB you need to look at models 45-50gb max
Darkoplax@reddit
6GB VRAM here :(
BuyHighSellL0wer@reddit
Here me running models on my 4GB RX550.
There's always somebody poorer ha!
DR4G0NH3ART@reddit
Well i was doing it for the GLM 5.1 and ran that model in my 5070 ti in my head and got good results. One day, one day I will make an agent that can hallucinate as good as me locally.
RonJonBoviAkaRonJovi@reddit
https://i.redd.it/rfxxnjvl2oug1.gif
Morphon@reddit
Anyone know if there's a group out there planning to make a TQ1 quant for this?
sgmv@reddit
you probably don't want this, it's not great even at q8
FullstackSensei@reddit
Unsloth GGUFs when?
asfbrz96@reddit
Bartowski better
FullstackSensei@reddit
TBH, between the two it's like splitting hairs. I use Unsloth because they provide documentation for best params, they're generally active here, and they often get early access so their quants drop sometimes at the same time the model drops.
asfbrz96@reddit
I tried both, I usually get better output with bartowski and the I got a bunch of infinity loop on the thinking part using unsloth
Beginning-Window-115@reddit
I think Unsloth is just so early with their quant releases that it doesn't give llamac++ time to fix bugs kind of giving them a bad rep. Although once everything works usually their quants are pretty good.
dangered@reddit
That’s fairly important though.
It seems like a “good problem to have” but there reaches a point that it really isn’t.
Even Linux power users leave Arch for same exact problem (I used to use arch btw tips fedora).
FullstackSensei@reddit
To be fair, more often than not the unsloth brothers are the ones who uncover the existence of those bugs. They also find tokenizer bugs in the released model more often than I thought possible.
dangered@reddit
Same with arch users. It’s necessary for the open source lifecycle. But is it necessary for you as the user?
If you’re active in the forums finding what is causing bugs and posting workarounds or patches then you’re key to the process. If you’re not, there’s a chance you’re just inflicting pain on yourself to the benefit of no one.
I’m in no way saying “unsloth bad” but it might not be the right choice for a lot of people and it has to be acknowledged. Many people leave or never make it into communities because they are told to use the bleeding edge but become too frustrated trying to get it to work to continue.
When that happens enough times, the product gets a bad name because the wrong people were using it and now they all say “unsloth bad”
FullstackSensei@reddit
I'm not sure what's the point you're trying to make, or what is the connection with arch.
Neither me nor anyone using their quants is testing anything. The unsloth brothers, or Bartowski or anyone making quants for their job are not regular users. They're like the maintainers of one package or one part of the kernel, who find bugs in other parts or other packages during their job and report those.
If you're going to blame maintainers for finding bugs, I am really out of words for how to respond to this.
dangered@reddit
The similarity I was making was referring to the breaking releases when you pull :latest because nothing else has caught up yet.
Whether it’s compatibility issue with Ollama, a bug from the base model itself, or a driver issue.
You might not have known this but we are. Every day we’re raising and discussing issues in the forums with the unsloth brothers themselves.
Dan Han said:
Someone else in the thread linked to a GitHub repo that has a fix, the repo has an explanation of the change that fixed another issue:
I don’t “blame” anyone for these issues, this is how it’s supposed to work. This is the true power of open source development. I can’t stress enough how necessary this is for open source software.
The key point I’m making is that not every user even knows about this side of the process. It’s important to let them know.
FullstackSensei@reddit
You might be trying to shed light on the process, but IMO the impression you're giving is quite negative, especially the comparison with arch. The comparison with ollama, the parasites of the open source World, only reinforces this.
Nobody needs to download a model on the day it's released, even if there's "day-0" support. This is even more true when the model brings architectural changes. Those who want to live on the bleeding edge will of course do. But for the vast majority, waiting a week or so will ensure they don't go through any headaches, even when the internet fosters FOMO. You're not missing anything by waiting a week or even two. I haven't downloaded any of the Gemma 4 models for this very reason.
yoracale@reddit
Regarding the issue dangered mention, the users who had the unused token issue didn't use the updated unsloth quant or update llama.cpp. A user who originally commented about the unused token issue, later edited their comment to 'thanks' because they realized that the updated quants and updating llama.cpp fixed the issue: [https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24#69daf14e98f472d6c455173d](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24#69daf14e98f472d6c455173d)
The original fix was already merged into the quant when they were writing that the unused token issue still occured: https://github.com/ggml-org/llama.cpp/issues/21321#issuecomment-4206217353
These instances are unfortunately user errors, otherwise hundreds of other people would be complaining as well.
FullstackSensei@reddit
You can explain all you want. Unfortunately, 90% will simply not read it, and 99% of those who do will misunderstand it.
yoracale@reddit
Agreed. The issue is majority of time, users always think the issue is our quants or it's something we did when in 99% of the cases it's most likely not. It's the beauty of open-source but also a curse.
Ollama incompatbility issues = our fault
unused token issue which was merged but users didn't update or use updated quants = our fault
when Google officially updated their chat template = our fault
And so no matter how many paragraphs they write, at the end of the day, they just want to blame someone aka us as they think it's our fault for pretty much everything unfortunately. There's not much we can from our side except take it and just try to communicate better. That's why in communication, we always try to say the issues do not originate from unsloth otherwise some people will immediately come to that conclusion. like here: https://unsloth.ai/docs/new/changelog#gemma-4-fixes
dangered@reddit
I understand how you’d see it as negative, I am pointing out the drawbacks to bleeding edge software.
I’ve been very clear in emphasizing I am not saying anything bad about unsloth. I use it and contribute to the community.
No matter how much I like it, I would be dishonest if I said it was for everyone. Earlier in my career I recommended everyone use the “latest and greatest” and realized leaving out the amount of tinkering involved was a huge problem.
Explaining to someone just the basics of the process and letting them make their own decisions is the right way to get a user without turning them into a hater of the platform.
Simply saying “wait 2 weeks for a stable release” would get more people to be adopters rather than detractors. Pretending it’s not worth mentioning because we have the acumen and time to fix it and get the new shiny thing working is not helping anyone.
Go look at how many failed plex implementations there are because IT dad skipped “stable” for the new shiny feature and his wife and kids quietly went back to Disney+, Netflix, and Amazon because their show didn’t work that week.
This is probably the biggest blind spot we have as technical people and it hurts the community because those people generally never come back.
yoracale@reddit
The issue is the users who had the unused token issue didn't use the updated quant or update llama.cpp. A user who originally commented about the unused token issue, later edited their comment to 'thanks' because they realized that the updated quants and updating llama.cpp fixed the issue: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24#69daf14e98f472d6c455173d
Original fix which was merged into the quant: https://github.com/ggml-org/llama.cpp/issues/21321
These instances are unfortunately user errors, otherwise hundreds of other people would be complaining as well.
FullstackSensei@reddit
They actively work with the llama.cpp team and the teams releasing models to find and fix bugs. I lost count how many times they found tokenizer bugs that they reported back to the model developers.
yoracale@reddit
Thank you for the support we appreciate it!! <3 <3 <3
wojciechm@reddit
I can confirm that. Regular llama.cpp quantizations are more stable and of higher quality during my usage. Unsloth is just optimized for metrics that does not represent real quality. Recently I even started to use my own quantizations with full output tensor precision (`--leave-output-tensor` option), and that is the best setup I have been using so far. It does not inflate size significantly, but does significantly improve quality.
FullstackSensei@reddit
I use Q8 on <100B models, and Q4 above. Always follow the recommended params. Never had an issue with loops, going back all the way to QwQ.
If the model is not already supported in llama.cpp, I also wait at least a week after initial support in llama.cpp before trying, to make sure most bugs have been resolved. That's why I haven't even downloaded any of the Gemma 4 models yet.
emprahsFury@reddit
and yet, other dude is fighting for his life with the downvotes and you're sitting pretty. This sub sucks celebrity dick way too much. When yeah, it's splitting hairs.
coder543@reddit
It’s under a non-commercial license this time, which is unfortunate.
z_3454_pfk@reddit
licence is really bad lol we won’t even get third party providers so once minimax stops hosting it’ll be gone via api for a lot of people
debackerl@reddit
Uhm, I'm getting it via OpenCode Go
harrro@reddit
Opencode clearly has their own arrangement with multiple providers as they've had MM 2.7 for a while before this release.
debackerl@reddit
Thx, so that supports my point. It's false that minimax was the only provider. I never talked about providers using open weight models, which, actually, MiniMax just released as open last week.
MikeFromTheVineyard@reddit
I’m guessing they’ll privately license it to third party commercial hosters.
I’m guessing the reason that open source models are so much cheaper than private ones is the profit margin built in. All these open source labs will need to recoup their investment somehow eventually. Private licensing seems like an easy way to
oofdere@reddit
use BSL instead of stupid modified MIT licenses that strip away the MIT completely then
TheRealMasonMac@reddit
I think OpenClaw destroyed the economy of coding plans altogether, so they're trying to subsidize thru these kinds of means. It does mean that API providers will likely get more expensive as time goes on.
Momo--Sama@reddit
I don’t think there ever was a functioning coding plan economy. I think from their inception (at least for the American labs) they were meant as loss leader samplers to get people talking about what the models could do and get their employers interested in API accounts. Then December and January happened and suddenly there’s hundreds of thousands of people eating half price appetizers with no intention of ordering entrees and the companies are left to figure out how to get people to stop buying apps and start buying entrees… or leave if they’re never going to buy an entree.
TheRealGentlefox@reddit
Dario claimed like 6 months(?) back that CC was actually profitable on its own.
antunes145@reddit
You hit the nail on head with that analogy. We will be seeing a large push from companies pushing people out of subsidized plans to API plans for their agents and vibe coding.
poginmydog@reddit
Or economies of scale happens and gpu decreases in cost by so much it makes subsidised plans profitable again
EbbNorth7735@reddit
Yep, idealy the license would prohibit cloud providers from hosting it without providing revenue to minimax or companies who generate over 1 million would require providing revenue.
OpenSourcePenguin@reddit
How does ollama serve it? (Compared to 2.5?)
rebelSun25@reddit
Openrouter has one third party provider,in US. Same one offers GLM 5.1, Deepseek, etc.
reto-wyss@reddit
They could always make the license less restrictive later when they have 2.8 or 3.0 - not saying that will happen, but it is possible.
coder543@reddit
I hope they will at least consider that middle ground, if they insist on doing things this way. That’s the territory of something like the BSL (Business Source License), which is not amazing, but… better than being fully proprietary.
reto-wyss@reddit
Yeap - I was pretty excited for this one but that license is rough.
I think I'll stick with Gemma-4-31b and Qwen3.5-122b-a10b and keep hoping for a strong 100b-ish dense model. Devstral-3 ?
comatrices@reddit
release on ModelScope which looks to be the same weights has an entirely different license with no non-commercial clause https://www.modelscope.cn/models/MiniMax/MiniMax-M2.7/file/view/master/LICENSE-MODEL?status=0
how long before they revise it? lol
also interesting release date in that file
thoquz@reddit
I'm guessing they did it in response to Cursor selling their model and naming it composer 2. (That they fine tuned)
It's unfortunate, I hope minimax picks a more open licence in the future
InternetNavigator23@reddit
Curser used kimi k2.5 for the base.
NoahFect@reddit
(Shrug) So was the training data. Fuck 'em.
Edzomatic@reddit
God bless going public
PrysmX@reddit
Too bad the new license is ass for anyone that wanted to build any thing commercially.
VoiceApprehensive893@reddit
it really is a mini
Asleep_Training3543@reddit
Full GGUF quant set up if anyone needs it — BF16, Q8_0, Q6_K, Q5_K_M live, Q4_K_M/Q3_K_M/Q2_K uploading now.
https://huggingface.co/dennny123/MiniMax-M2.7-GGUF
erazortt@reddit
Please do not create quants yourself, if you do not know what you are doing! Why do you have all the small tensors at such small quants?! Especially since MiniMax is very sensitive to quantization, the small tensors must be preserved as much as possible! Actually this is generally true, since the small tensors (all the attn_*) are usually so small that its just a couple of hunderds MB difference, but the quality difference is much bigger. There is a very good reason unsloth, AesSedai and ubergarm are doing it.
And also, have you generated an imatrix and used it during quantizations? If yes, what raw data have you fed it?
Sufficient_Prune3897@reddit
I bet your fun at parties.
Mochila-Mochila@reddit
you're
Raredisarray@reddit
Yoo TY
Rascazzione@reddit
It seems the model isn't 100% open. There are serious restrictions on its use for any commercial purposes.
As it stands now, the license is more like a product demo. Try it out, and if you like it, pay up.
But since it's a Non-commercial Freeware license, it would be nice to have fixed, transparent pricing for the commercial license. And then, for startups, some kind of exemption up to a certain revenue threshold.
a9udn9u@reddit
I wonder how much that matters to the community (mostly individuals). These are not like traditional software components which small companies or indie developers would embed into their products. These require data centers to host, only big players with deep pockets can do that.
If you run a business and make a profit on top of models MiniMax spent $$$$$ to train, I say it's only fair for you to pay a license fee to them.
7734128@reddit
It's fair for them to charge a fee, of course, but it's too small of an improvement over 2.5 for that to make sense.
They should have waited for a step change in performance.
InternetNavigator23@reddit
My thoughts exactly. Don't let other people host it and compete directly. Be clear about commercial and let startups use it under 100m revenue.
Fine-Profession-3204@reddit
M2.7 scored 78% on SWE-bench Verified vs Claude Opus 4.6's 55% — the biggest gap on the benchmark practitioners trust most for real engineering prediction. But it also generated 87M output tokens during Artificial Analysis evaluation (median is 26M), meaning real per-task cost can run 3x+ the headline rate. Full benchmark table, ECPT cost framework, and the BridgeBench regression most reviews skip are in the breakdown: https://aithinkerlab.com/minimax-m2-7-vs-gpt4-claude-benchmarks/
YoussofAl@reddit
This is going to be the most impactful release of Q2 this year. (Unless Minimax M3 releases)
Not only is it a powerful model, but it can actually be run by people unlike GLM.
jon23d@reddit
Im super excited to have this, but if we aren’t supposed to use it to make works that we sell, it’s suddenly far less useful to me.
bootlickaaa@reddit
The way I'm reading it is that using it for coding, as long as the resulting work product (code) is not dependent on the model at runtime for automating a commercial product, it might be allowed. I could be wrong.
(i) offering products or services to third parties for a fee, which utilize, incorporate, or rely on the Software or its derivatives,
(ii) the commercial use of APIs provided by or for the Software or its derivatives, including to support or enable commercial products, services, or operations, whether in a cloud-based, hosted, or other similar environment, and
(iii) the deployment or provision of the Software or its derivatives that have been subjected to post-training, fine-tuning, instruction-tuning, or any other form of modification, for any commercial purpose.
4(ii) seems to be the point that needs expert interpretation. For me, if my software does not depend on the model in any way, it could be in the clear. The outputted code would have been obtained through a harness like OpenCode, which itself does depend on the model to operate, but is non-commercial.
What does it mean to support or enable an end product or operations?
jon23d@reddit
That’s my reading too. It’d be nice to get some clarification
Sliouges@reddit
This is Reddit and will get lost, but just for the record, their own blog post says "with human productivity already fully unleashed, the natural next step was to initiate self-evolution." That's a polite way of Chinese saying the human ML engineers already gave everything they could, so now the model takes over their tasks, they don't need low-level ML engineers, pack your bags, get out. Even ML low-level engineers are being replaced, and very little HIL and everyone here cheers like this doesn't concern anyone as long as MiniMax (or anyone else with the same or similar approach) keep releasing models.
bwjxjelsbd@reddit
What's the HW to run this?
Can a macbook Pro M5 Max run it?
misha1350@reddit
Newer posts regarding M2.7 suggest that a 128GB RAM model can, given some heavy quantization.
CertainlyBright@reddit
I love how these are "licensed" like they cared about copyright licenses of the data they trained from. Ima use models however I want lol
Recoil42@reddit
segmond@reddit
why don't they ever compare with their peers. I want to see how it compares to GLM-5.1, KimiK2.5, Qwe3.5-297B, etc.
Inevitable-Plantain5@reddit
I get these model providers only get a moment to have to benchmarks so they have to milk it. It seems all these Chinese models are playing with what they will open as public weights now.
I would be willing to pay a reasonable price to access weights legally so self hosting is still valuable to them. This model is most beneficial right now to people with 256gb since you can get a good quant for a model performing near SOTA in benchmarks. In the cloud there's objectively better options. On a 256gb machine, this is probably the best option still on paper IMO. For companies with several h100s this is also one of the best options. So I think there's a market.
I prefer free but I prefer options that don't require subscriptions. If they price it for industry though then I still have no options but then it becomes black market so...? lol
InternetNavigator23@reddit
Because reasons. Lol
I'd say just under GLM. Around kimi/qwen. The main highlight here is for the size they are awesome.
Real_Ebb_7417@reddit
Tbh I used MiniMax a bit for coding and for me it’s nowhere near Claude, GPT or even GLM/Qwen/Kimi. I think it was just trained for benchmarks but in real life work scenario it’s not as good.
Wooden_Yam1924@reddit
is it something wrong with this repo? I see only 124 of 130 safetensors
Manwith2plans@reddit
Was so excited for this but it's a non-commercial license so severely limits the utility for me :(
Kind-Abies8738@reddit
...why? You realise it's little more than a suggestion right?
rpkarma@reddit
Not when it would be super useful to host at work. Our legal team would have a fit if we tried.
We'll probably end up paying them instead.
Kind-Abies8738@reddit
If your operation is big enough to have a "legal team" then yeah. But then I don't feel sorry for ya ;)
rpkarma@reddit
Yeah that’s why I said we’ll probably end up paying them so we can host it ourselves!
Kind-Abies8738@reddit
Ah, gotcha. The "instead" bit threw me off.
Virtamancer@reddit
Is this the most important open source (actually large) LLM release since OG deepseek?
Darkoplax@reddit
GLM is still the leader in Open weight
Minimax, Kimi, Qwen and Deepseek all chasing them rn
Edzomatic@reddit
From my testing glm, especially glm 5.1, is better in general. But minimax is much smaller and punches well above its weight
Virtamancer@reddit
I thought GLM isn't open source/weights/whatever.
coder543@reddit
Not sure where you got that impression: https://huggingface.co/zai-org/GLM-5.1
Virtamancer@reddit
Maybe it was closed at some point or I'm just misremembering. Good to know, though.
In any case though GLM is gargantuan, nobody will ever be able to run it at home. MiniMax m2.7 performs 99% as well at 25% the size, and based on quick mental math should fit into a mac studio at full precision, and at 8bit it should fit EASILY into even low end mac studios/minis (ones with only 256gb).
To me, that's what makes m2.7 a milestone release. It applies the 80/20 rule but takes it further with 99/25.
sgmv@reddit
I ran glm 5.1 at home on 256gb ram and 4x 3090 workstation - iq2 kl, 6t/s. Not super useful at that speed, but if you want capability rather than intelligence, i think it still beats 6bit qwen3.5 397b, which is of similar size.
Also, minimax 2.7 is released as 8bit, so the quant will be less, 4 bit etc.
330d@reddit
How fast you run 397b on that hardware? I assume the ram is DDR4 8 channel?
sgmv@reddit
Didnt actually run that one locally, was just comappring capability at a certain model size. It's ddr4 quad channel, I think only epycs have ddr4 eight channel.
shroddy@reddit
I suddenly feel very poor...
Thrumpwart@reddit
I have a 192GB Mac Studio and that comment made me feel poor.
coder543@reddit
It is not 99%, and 229 is not 25% of 754.
I was very excited for this release too, until I saw the license.
Virtamancer@reddit
Oh google said it's storage size is 1.65TB. M2.7's is ~458gb, which is about a quarter the size. But at any size, my point is just that it's radically smaller for roughly equivalent performance.
I try to stay up to date with open models for software development. Not local, but through openrouter. All the information I care about shows m2.7 is VERY close for a fraction of the cost.
hainesk@reddit
I think M2.7 is trained in FP8 so it’s size is 230gb.
sgmv@reddit
I ran glm 5.1 at home on 256gb ram and 4x 3090 workstation - iq2 kl, 6t/s. Not super useful at that speed, but if you want capability rather than intelligence, i think it still beats 6bit qwen3.5 397b, which is of similar size.
Also, minimax 2.7 is released as 8bit, so the quant will be less, 4 bit etc.
Edzomatic@reddit
GLM 5 and 5.1 are both open source. The only model in the family to not be open sourced is 5-turbo
robertpro01@reddit
What's the size?
gjallerhorns_only@reddit
230B total parameters
robertpro01@reddit
It is actually a very good size for that benchmark
coder543@reddit
Not under this license, it’s not. Good for hobbyists and researchers, but the important thing about open weight models is keeping the proprietary providers from establishing total control of the market.
zxyzyxz@reddit
In practice this won't actually be enforceable for most people. I could use this to write code for my employer as said below but no one would actually know as the model doesn't phone home.
Virtamancer@reddit
What are the bad limitations?
coder543@reddit
The license is strictly non-commercial.
Virtamancer@reddit
Oh I'm thinking about for home use anyways. It's finally the smartest model ever (roughly—not exactly, but roughly—equivalent to GLM 5.1) and can fit in a mac studio. It can fit in smaller mac studios/minis (256gb) when quantized to 8bits or slightly less).
coder543@reddit
“Home use” here does not include writing code that you will use for your employer or for your own software that you intend to sell. The license prohibits all of that, from what I can see. Just FYI. (IANAL, of course.)
winterscherries@reddit
Right, but employers are the entities the company needs to generate money from. Getting to this model costs an incredible amount of money. If you don't earn money from those who actually do have deep pockets, like corporations who use your model to compound their profit margins, then you're not going to get money from anyone.
muyuu@reddit
If you run it at home, this isn't enforceable.
It will just prevent competitors from selling Minimax 2.7 tokens.
Virtamancer@reddit
I hear you. And I get that sucks for some people.
As a counterpoint, as far as I know there's nothing actually forcing anyone to disclose if they use minimax commercially.
Beyond that, I'm not in the crypto bro camp that believes all local model use must be in pursuit of profit; it's OK to vibe code to make projects and apps that are useful to me that would never exist otherwise, and if I have some fun and learn along the way then that's even better.
I don't use local models for coding because I have access to the paid ones, but if I did use local models (and hopefully next year they'll be good enough) then it's hard for me to see what would prevent me from using any local model and ignoring the license.
ForsookComparison@reddit
You cannot use this for anything other than hobby or research and there's no clear cut path to doing so. You need to contact and reach a case by case agreement with MiniMax it seems
Virtamancer@reddit
I mean you literally can, right? You're just not technically allowed to? Not that lawyers have ever agreed on anything anyways.
I think the license is intended more as a means to prevent large companies, the kinds who would be afraid of getting investigated and sued, from using it without whatever agreement you're referring to. I don't think minimax ultimately cares, or could afford to care, or could ever prove, if individuals are using it commercially for many use cases.
ForsookComparison@reddit
Just be smart about it
Material_Soft1380@reddit
Had to try it:
MiniMax 2.7 Q8_K_XL (\~250GB) on a single RTX6000 with RAM offload, getting 8.64 tokens/second, which is actually usable.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Comprehensive_Iron_8@reddit
I am confused. Minimax 2.7 was launched 3 weeks ago.
Comprehensive_Iron_8@reddit
Ahh. I never checked that they released the weights. Eh, glm-5.1 is better. Too late for the weights.
Comprehensive_Iron_8@reddit
arm2armreddit@reddit
This screenshot is cloud-based, and you don't even know what you are using. Ollama Cloud is an opaque service.
OffBeannie@reddit
This is released for local LLM
LegacyRemaster@reddit
God bless you
PromptInjection_@reddit
Just made a quick test.
Runs with about 110 PP and 20 G tokens /s on AMD Strix Halo (Windows, llama.cpp)
jacek2023@reddit
Unlike models such as GLM, Kimi, or DeepSeek, I can run MiniMax locally at Q3, so from my point of view, MiniMax is much better than those three, unless GLM releases Air again.
Thrumpwart@reddit
“No your honour, I used Qwen 122B to vibe code this app. I just used Minimax to write short stories about a dude named Elias.”
Nyghtbynger@reddit
"Elias, please compile a website about horse merchandise. Do not act like your rival Arthias would do :
- failing to follow community guidelines
- modifying reference files
- making mistakes
This horse merchandise is really important to defeat the enemy kingdom. Please neigh if you understand.
"
kawaii_karthus@reddit
I wonder how this comparisons to Qwen 235b? it is still one of my most favorite models.
Nyghtbynger@reddit
It codes really well. Very clearly. I like the style and it's easy to collaborate with it on code. Your opinion ?
mehow333@reddit
REAP please
ResidentPositive4122@reddit
Calling that license "modified MIT" is a farce. Either do or don't, up to you, but at least call it what it is.
jreoka1@reddit
I bought their $10 a month token plan and used it heavily without even coming close to using the weekly limit. Thats how it should be done IMO.
SnooPaintings8639@reddit
I am so happy for for this releasee. The previous version of this model m.2.5 is my fldaily driver at Q2, really capable.
Hope it will work well and quantized asap. With m2.5 I could not make it work under ik_llama.cpp (was going into loops) and mainline llama.cpp has a bug that removes the initial thinking tag and some UIs tools have a hard time parsing it. But after I dealt with this, it was a great model even for long context work!
VampiroMedicado@reddit
It says it's Claude lol
DarkGhostHunter@reddit
Great!
Back to Qwen Code I guess...
Aaaaaaaaaeeeee@reddit
Entertainment? 🤗
MadPelmewka@reddit
https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE
Aromatic-Flatworm-57@reddit
What a time to be alive
Acceptable_Home_@reddit
Hell yeah