Minimax 2.7: Today marks 14 days since the post on X and 12 since huggingface on openweight
Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 80 comments
I think it would make a nice Easter egg to release today!
LegacyRemaster@reddit (OP)
good news:
Processing img 43wd39p8bmtg1...
Awwtifishal@reddit
It's not just a weekend, there's holidays in China from the 1st until today.
lionellee77@reddit
There was no holiday in China on April 1 - 3. There will be a holiday on Monday (Qingming festival) April 6 this year.
Awwtifishal@reddit
It falls on the 5th this year, not 6th. About April 1 to 3, it seems you're right regarding official holidays, but I mentioned those days because there's at least one Chinese company I know that is taking those days off, connected to this holiday.
lionellee77@reddit
It’s an official off day on Monday to observe the holiday. So, don’t expect too much. I am sure they work at least 996 and need to take a break.
Awwtifishal@reddit
That's what I find odd. The Chinese company I was talking about just processed the shipment today Monday.
sammoga123@reddit
Nobody launches things on weekends, and whoever did last year ended up having the worst model of 2025 (Llama 4 was presented on Saturday)
lolwutdo@reddit
Anyone know or experience If m2.7 is better than qwen 3.5 397b?
o0genesis0o@reddit
I'm using m2.7 coding plan with claude code. It feels more or less the same as the qwen cloud model in qwen code (likely a beef up version of 397b?). Maybe a little bit better at longer context than qwen cloud. It does not feel super intelligent, but gets the job done with some guidance. If I can run this at the same speed at home, I would be very happy.
I'm on the cheapest plan, btw. Surprisingly generous quota.
suicidaleggroll@reddit
For coding, I’ve found MiniMax M2.5 is already better than Qwen3.5-397B, so M2.7 should be no competition. That’s not to say Q397 is bad, it’s still good, but MiniMax is better in all of my tests.
Vicar_of_Wibbly@reddit
Yeah the speed and concurrency of Qwen3.5 397B is hard to match. I can run 2x concurrency @ 200k tokens @ 105 tok/sec with MiniMax-M2.5 FP8 or I can run 30x concurrency @ 256k tokens @ 160 tok/sec with Qwen3.5 397B A17B NVFP4.
Qwen gets more use.
lolwutdo@reddit
Yeah that’s one of the harder things to stomach going back to minimax, kv cache is so much more efficient on qwen I can run it at max context. Im at like 50k context using minimax on fp16 kv.
laterbreh@reddit
I've found zero difference between a fp16 cache and fp8
Skyline34rGt@reddit
Depends what for.
At arena people blind pick models, heres the leaderboard with diff categories (check what is important for you) https://arena.ai/leaderboard/text
Or better yet pick 2 models you want and compary it directly there for your prompts -> New chat - Battle mode - site by site and pick 2 models you want try.
nuclearbananana@reddit
I assume on monday?
This recent trend of open labs delaying release is concerning.
LegacyRemaster@reddit (OP)
yes. GLM will be... Minimax will be... Qwen will be... it will happen but the music has changed
pmttyji@reddit
Possibly due to DeepseekV4? Totally uncertain
power97992@reddit
I hope it is better than opus 4.6 and almost as good as mythos, but i doubt it will be almost as good as mythos …. Maybe slightly worse than opus 4.6
5dtriangles201376@reddit
I've never heard of mythos, sounds interesting
pmttyji@reddit
Rumor is Mythos actually Multi Trillion paramter model up to 10T.
power97992@reddit
It will be 5-10x more expensive than opus 4.6
pier4r@reddit
they don't owe us anything. If they do it in one month, one year, a decade is still fine.
RedParaglider@reddit
It's fine. Let them work out bugs and get the early adopter rush. They need money to survive.
Django_McFly@reddit
We're talking less than a month. Some of the shortest tech delays in history. Worries seem overblown.
nuclearbananana@reddit
Given the rate at which AI models drop that's like a year
-dysangel-@reddit
As long as they're using that time to get their llama.cpp/mlx/whatever branches in order then it's a good use of time I think. It's a bit silly how many times I've downloaded the same Gemma 4 models recently.
Oh wait.. maybe this is a strat to increase their download stats? Hmm
Zc5Gwu@reddit
It’s there way of recouping some of their investment. I don’t blame them since training these things is damn expensive and difficult. They’re giving us something for free and awesome.
deejeycris@reddit
I think they should just say you can host it commercially but you have to pay a fee, then they can self host and have an edge. That'd be the easy, open, evil-free way.
docbaily@reddit
"their way"
LegacyRemaster@reddit (OP)
Absolutely. We're absolutely not blaming anyone. On the contrary. This type of post is meant to show how much we appreciate and patiently await each release.
dzhopa@reddit
That's hitting me as peak entitlement at the moment, not sure why. These are fellow humans that frankly don't owe any of us anything because we don't pay their salaries (and even if we did, idk). It's best effort, and we should be thankful for that at all times.
Sorry rant over.
No-Refrigerator-1672@reddit
Personally, I acknowledge that they have no obligations to release the model; but I don't like when companies keep people in suspence. I'd much prefer the approach of "Here's model X, available via API today, weights will be released on date Y three months later" or "Here's new model, it'll be closed forever", rather than "Maybe some day we'll release it, maybe we won't" like Qwen did with their 2 last models (3.5 Omni and 3.6).
Keep-Darwin-Going@reddit
Well it is weekend now can everyone just chill, it is like going to an open source project and scream about SLA and how they not sticking to release.
Vicar_of_Wibbly@reddit
But the but two are not mutually exclusive. We can be concerned _and _ thankful without any contradiction.
dzhopa@reddit
Fair!
I end up letting myself down with concern often enough that I just shut it down off tops. Appreciate a different perspective.
relmny@reddit
Why is concerning?
I hope they release it when they think is ready to be released and not because people "demand" it to be released.
jacek2023@reddit
People can't run Gemma 31B on their setup because it's "too slow" but they want MiniMax to use "locally".
LegacyRemaster@reddit (OP)
sorry
Monad_Maya@reddit
What's the prefill and decode speed like with M2.5 completely in VRAM?
LegacyRemaster@reddit (OP)
load_tensors: offloaded 63/63 layers to GPU
G:\gpt\unsloth\MiniMax-M2.5-GGUF\MiniMax-M2.5-UD-Q4_K_XL-00001-of-00004.gguf
load_tensors: Vulkan0 model buffer size = 84252.02 MiB
load_tensors: Vulkan1 model buffer size = 40648.67 MiB
load_tensors: Vulkan_Host model buffer size = 329.70 MiB
Only 2 video cards needed for Q4_K_XL. Prefill : realtime.
Monad_Maya@reddit
Thanks, I wish the R9700 had higher memory bandwidth and capacity. It's not the successor to the W7800 (ignoring the price difference for now).
4x of those would've been quite nice.
LegacyRemaster@reddit (OP)
so I paid 1420€+vat 1 w7800 48gb. Good price trust me. Example: on gemma 31b rtx 6000 35t/sec. w7800 vulkan 30t/sec. A lot cheaper. I'm using blackwell for training. Only interference? Better 4xW7800.
Monad_Maya@reddit
Good deal, I think I remember your post from back in the day when you acquired them.
Have fun!
suicidaleggroll@reddit
Dual RTX Pro 6000 here. I run M2.5 in Q5 with 128k context, so it can’t run fully in VRAM, but it’s close, just a couple layers offloaded to the CPU. It hits 1100/75 pp/tg
LegacyRemaster@reddit (OP)
to be honest I capped the context to \~70k. Reason? With vscode+kilocode is better to limit the context to avoid errors . Minimax 2.5 isn't so good long context. so yeah.. coding --> condensate --> go ahead.
m0nsky@reddit
I'm jelous of your setup! I bought the first card, built my server, set money aside for the second one, but then the prices increased like crazy.
I've noticed that `-ub 4096 -b 4096` really improved my prompt processing speeds. This is what I'm getting on a single RTX Pro 6000 (Q4 with 128k context):
Have you also tried this "frankenquant" with vLLM? Probably the first thing I'm going to try when I get the second card. As from the model card:
suicidaleggroll@reddit
I haven't, I'm just using the vanilla llama.cpp. I did try ik_llama.cpp, and it improved prompt processing speeds significantly, but had a lot of trouble with tool calling. Not a single model I tried in ik was able to write a JSON file in opencode, for example, while all work fine in the normal llama.cpp.
I tried vLLM as well, and while the speeds were great, the initialization time was a dealbreaker for me. It's just me using this system, and I only use one model at a time, so I tend to use the biggest model that will fit. This means when I change tasks, the system needs to unload and reload a new model. With llama.cpp, that swap takes seconds, with vLLM it's 5+ minutes, which just doesn't work for me unfortunately. If the vLLM devs can get the startup time down to something more reasonable, I'd be happy to switch over, but for now it's llama.cpp for me.
Monad_Maya@reddit
That's pretty great, thanks for sharing!
a_beautiful_rhind@reddit
For mac users I can see this being the case. They have lots of fast ram and no compute.
With "regular" ram folks, viability of such models in terms of reasoning and agentic coding is a bit suspect.
Just like the turboquant stuff tho, no sense in fighting it. If it's popular it means it's true, even when you can prove it wrong.
lolwutdo@reddit
If you have the ram, it’s literally faster than Gemma 31b or Qwen 27b; even Qwen 397b is faster.
Snoo_64233@reddit
I don't believe it
suicidaleggroll@reddit
Then you don’t understand how MoE models work
KURD_1_STAN@reddit
Llms need to process the whole active parameters for every token, dense models like qwen 27b and gemma 31b are all active parameters, but moe models have a 80b a3b or 26b a4b... meaning only 3b 4b need to be processed for each token. So this has around 250b (a10b) so it will be 2-2.5 times faster than 31b if u have enough ram to fit the other 250b in
jacek2023@reddit
Yes it's better to pretend people run Minimax in RAM than to admit thar they just want cloud access
llama-impersonator@reddit
why do you always bark up this tree when there are like 8 of us 192gb ram users!?
jacek2023@reddit
I use Minimax locally. Most of them can't
Spectrum1523@reddit
its like there is more than one person posting here
Monad_Maya@reddit
I do run MiniMax M2.5 locally (128GB ddr4 + 20GB VRAM) IQ4_XS quant from AesSedAI.
It's a pretty good model. 5-7 tokens per second on my hardware which is usable for me.
If I had more RAM or better yet VRAM, I would've gone for Qwen 397B.
GroundbreakingMall54@reddit
the minimax hype cycle is getting kinda ridiculous at this point. announce on twitter, post weights on hf "coming soon", then radio silence for 2 weeks. at least when meta drops something they just drop it
ExactSeaworthiness34@reddit
When's the last time Meta dropped something LLM related that's relevant?
Pink_da_Web@reddit
Llama 3.1?
TechnoByte_@reddit
Llama 3.3 70B
Pink_da_Web@reddit
Too
huffalump1@reddit
Ehhhh yeah good point, Llama 4 was already outshadowed by Chinese models on release, and absolutely left in the dust after...
Meta Labs continues to pump out excellent work but not in the "Llama mainline open LLM" world, it seems.
MMAgeezer@reddit
Laughs in Llama 4 Behemoth (which never actually shipped).
Living_Director_1454@reddit
Trash is meant to be dropped.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Designer_Reaction551@reddit
14 days is a long time in this space. The open-weight release delay pattern is frustrating but understandable - labs need some commercial runway before full weights drop. The real question is whether the API version performance matches the hype. Anyone actually benchmarked Minimax 2.7 against the current local leaders?
LegacyRemaster@reddit (OP)
more then all: after 2 weeks the model is already "old"
electroncarl123@reddit
Why bother? Until it's open weights it doesn't exist to me... We already have closed weights stuff from OpenAI and Anthropic
relmny@reddit
Thanks for the "openweight" in the title, as opposite to the "opensource" that has, wrongly, taken over the actual meaning.
I guess is a lost battle, and they are far from being the only ones (qwen does it to, sometimes)... on the other hand, most people have no idea what "open source" mean, let alone "open weight", so maybe "open source" is not that bad (if most people start to get the idea)... I still prefer "open weight" though.
-dysangel-@reddit
I agree, but you learn to accept after a few years that as a society people just tend to get things wrong, and the language adapts around it. Things like "I could care less" still piss me off, though actually now that I say it, I haven't seen that one misused for a while now. Maybe sometimes we can turn the ship around.
IrisColt@reddit
Two more weeks.
ridablellama@reddit
lol it’s a weekend. it’s sunday. this post is unhinged. touch grass
Sovchen@reddit
Shut the fuck up you're all a bunch of miserable atheists in this space don't fucking talk about Easter you degenerate
Altruistic_Heat_9531@reddit
and tommorow and maybe overmorrow in my timezone (UTC+7) when GLM released its 5.1
twack3r@reddit
I‘m stealing ‚overmorrow‘, love it!
-p-e-w-@reddit
It’s actually not a made-up word but an archaic English term that you can commonly find in 18th-century literature and still occasionally in the 19th century. It’s extremely rare today though.
twack3r@reddit
I know because I’m German and that’s the root. In German, the day after tomorrow is Übermorgen, so literally overmorrow. I have just never seen or heard it used in modern English.
RickyRickC137@reddit
I think Glm 4.6 air will be released in 2 weeks.