Minimax 2.7: Today marks 14 days since the post on X and 12 since huggingface on openweight

[-]

LegacyRemaster@reddit (OP)

good news:

Processing img 43wd39p8bmtg1...

[-]

Awwtifishal@reddit

It's not just a weekend, there's holidays in China from the 1st until today.

[-]

lionellee77@reddit

There was no holiday in China on April 1 - 3. There will be a holiday on Monday (Qingming festival) April 6 this year.

[-]

It falls on the 5th this year, not 6th. About April 1 to 3, it seems you're right regarding official holidays, but I mentioned those days because there's at least one Chinese company I know that is taking those days off, connected to this holiday.

[-]

lionellee77@reddit

It’s an official off day on Monday to observe the holiday. So, don’t expect too much. I am sure they work at least 996 and need to take a break.

[-]

Awwtifishal@reddit

That's what I find odd. The Chinese company I was talking about just processed the shipment today Monday.

[-]

sammoga123@reddit

Nobody launches things on weekends, and whoever did last year ended up having the worst model of 2025 (Llama 4 was presented on Saturday)

[-]

lolwutdo@reddit

Anyone know or experience If m2.7 is better than qwen 3.5 397b?

[-]

o0genesis0o@reddit

I'm using m2.7 coding plan with claude code. It feels more or less the same as the qwen cloud model in qwen code (likely a beef up version of 397b?). Maybe a little bit better at longer context than qwen cloud. It does not feel super intelligent, but gets the job done with some guidance. If I can run this at the same speed at home, I would be very happy.

I'm on the cheapest plan, btw. Surprisingly generous quota.

[-]

suicidaleggroll@reddit

For coding, I’ve found MiniMax M2.5 is already better than Qwen3.5-397B, so M2.7 should be no competition. That’s not to say Q397 is bad, it’s still good, but MiniMax is better in all of my tests.

[-]

Vicar_of_Wibbly@reddit

Yeah the speed and concurrency of Qwen3.5 397B is hard to match. I can run 2x concurrency @ 200k tokens @ 105 tok/sec with MiniMax-M2.5 FP8 or I can run 30x concurrency @ 256k tokens @ 160 tok/sec with Qwen3.5 397B A17B NVFP4.

Qwen gets more use.

[-]

lolwutdo@reddit

Yeah that’s one of the harder things to stomach going back to minimax, kv cache is so much more efficient on qwen I can run it at max context. Im at like 50k context using minimax on fp16 kv.

[-]

laterbreh@reddit

I've found zero difference between a fp16 cache and fp8

[-]

Skyline34rGt@reddit

Depends what for.

At arena people blind pick models, heres the leaderboard with diff categories (check what is important for you) https://arena.ai/leaderboard/text

Or better yet pick 2 models you want and compary it directly there for your prompts -> New chat - Battle mode - site by site and pick 2 models you want try.

[-]

nuclearbananana@reddit

I assume on monday?

This recent trend of open labs delaying release is concerning.

[-]

LegacyRemaster@reddit (OP)

yes. GLM will be... Minimax will be... Qwen will be... it will happen but the music has changed

[-]

pmttyji@reddit

Possibly due to DeepseekV4? Totally uncertain

[-]

power97992@reddit

I hope it is better than opus 4.6 and almost as good as mythos, but i doubt it will be almost as good as mythos …. Maybe slightly worse than opus 4.6

[-]

5dtriangles201376@reddit

I've never heard of mythos, sounds interesting

[-]

pmttyji@reddit

Rumor is Mythos actually Multi Trillion paramter model up to 10T.

[-]

power97992@reddit

It will be 5-10x more expensive than opus 4.6

[-]

pier4r@reddit

This recent trend of open labs delaying release is concerning.

they don't owe us anything. If they do it in one month, one year, a decade is still fine.

[-]

RedParaglider@reddit

It's fine. Let them work out bugs and get the early adopter rush. They need money to survive.

[-]

Django_McFly@reddit

We're talking less than a month. Some of the shortest tech delays in history. Worries seem overblown.

[-]

nuclearbananana@reddit

Given the rate at which AI models drop that's like a year

[-]

-dysangel-@reddit

As long as they're using that time to get their llama.cpp/mlx/whatever branches in order then it's a good use of time I think. It's a bit silly how many times I've downloaded the same Gemma 4 models recently.

Oh wait.. maybe this is a strat to increase their download stats? Hmm

[-]

Zc5Gwu@reddit

It’s there way of recouping some of their investment. I don’t blame them since training these things is damn expensive and difficult. They’re giving us something for free and awesome.

[-]

deejeycris@reddit

I think they should just say you can host it commercially but you have to pay a fee, then they can self host and have an edge. That'd be the easy, open, evil-free way.

[-]

docbaily@reddit

"their way"

[-]

LegacyRemaster@reddit (OP)

Absolutely. We're absolutely not blaming anyone. On the contrary. This type of post is meant to show how much we appreciate and patiently await each release.

[-]

dzhopa@reddit

This recent trend of open labs delaying release is concerning.

That's hitting me as peak entitlement at the moment, not sure why. These are fellow humans that frankly don't owe any of us anything because we don't pay their salaries (and even if we did, idk). It's best effort, and we should be thankful for that at all times.

Sorry rant over.

[-]

No-Refrigerator-1672@reddit

Personally, I acknowledge that they have no obligations to release the model; but I don't like when companies keep people in suspence. I'd much prefer the approach of "Here's model X, available via API today, weights will be released on date Y three months later" or "Here's new model, it'll be closed forever", rather than "Maybe some day we'll release it, maybe we won't" like Qwen did with their 2 last models (3.5 Omni and 3.6).

[-]

Keep-Darwin-Going@reddit

Well it is weekend now can everyone just chill, it is like going to an open source project and scream about SLA and how they not sticking to release.

[-]

Vicar_of_Wibbly@reddit

But the but two are not mutually exclusive. We can be concerned _and _ thankful without any contradiction.

[-]

dzhopa@reddit

Fair!

I end up letting myself down with concern often enough that I just shut it down off tops. Appreciate a different perspective.

[-]

relmny@reddit

Why is concerning?
I hope they release it when they think is ready to be released and not because people "demand" it to be released.

[-]

jacek2023@reddit

People can't run Gemma 31B on their setup because it's "too slow" but they want MiniMax to use "locally".

[-]

LegacyRemaster@reddit (OP)

sorry

[-]

Monad_Maya@reddit

What's the prefill and decode speed like with M2.5 completely in VRAM?

[-]

LegacyRemaster@reddit (OP)

load_tensors: offloaded 63/63 layers to GPU

G:\gpt\unsloth\MiniMax-M2.5-GGUF\MiniMax-M2.5-UD-Q4_K_XL-00001-of-00004.gguf

load_tensors: Vulkan0 model buffer size = 84252.02 MiB

load_tensors: Vulkan1 model buffer size = 40648.67 MiB

load_tensors: Vulkan_Host model buffer size = 329.70 MiB

Only 2 video cards needed for Q4_K_XL. Prefill : realtime.

[-]

Monad_Maya@reddit

Thanks, I wish the R9700 had higher memory bandwidth and capacity. It's not the successor to the W7800 (ignoring the price difference for now).

4x of those would've been quite nice.

[-]

LegacyRemaster@reddit (OP)

so I paid 1420€+vat 1 w7800 48gb. Good price trust me. Example: on gemma 31b rtx 6000 35t/sec. w7800 vulkan 30t/sec. A lot cheaper. I'm using blackwell for training. Only interference? Better 4xW7800.

[-]

Monad_Maya@reddit

Good deal, I think I remember your post from back in the day when you acquired them.

Have fun!

[-]

suicidaleggroll@reddit

Dual RTX Pro 6000 here. I run M2.5 in Q5 with 128k context, so it can’t run fully in VRAM, but it’s close, just a couple layers offloaded to the CPU. It hits 1100/75 pp/tg

[-]

LegacyRemaster@reddit (OP)

to be honest I capped the context to \~70k. Reason? With vscode+kilocode is better to limit the context to avoid errors . Minimax 2.5 isn't so good long context. so yeah.. coding --> condensate --> go ahead.

[-]

m0nsky@reddit

I'm jelous of your setup! I bought the first card, built my server, set money aside for the second one, but then the prices increased like crazy.

I've noticed that `-ub 4096 -b 4096` really improved my prompt processing speeds. This is what I'm getting on a single RTX Pro 6000 (Q4 with 128k context):

prompt eval time =   11395.44 ms / 15495 tokens (    0.74 ms per token,  1359.75 tokens per second)
eval time =    1990.31 ms /    47 tokens (   42.35 ms per token,    23.61 tokens per second)
total time =   13385.75 ms / 15542 tokens

Have you also tried this "frankenquant" with vLLM? Probably the first thing I'm going to try when I get the second card. As from the model card:

This strives to be the highest quality quant that can run on 192GiB VRAM.

The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.

On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request.

[-]

suicidaleggroll@reddit

I haven't, I'm just using the vanilla llama.cpp. I did try ik_llama.cpp, and it improved prompt processing speeds significantly, but had a lot of trouble with tool calling. Not a single model I tried in ik was able to write a JSON file in opencode, for example, while all work fine in the normal llama.cpp.

I tried vLLM as well, and while the speeds were great, the initialization time was a dealbreaker for me. It's just me using this system, and I only use one model at a time, so I tend to use the biggest model that will fit. This means when I change tasks, the system needs to unload and reload a new model. With llama.cpp, that swap takes seconds, with vLLM it's 5+ minutes, which just doesn't work for me unfortunately. If the vLLM devs can get the startup time down to something more reasonable, I'd be happy to switch over, but for now it's llama.cpp for me.

[-]

Monad_Maya@reddit

That's pretty great, thanks for sharing!

[-]

a_beautiful_rhind@reddit

For mac users I can see this being the case. They have lots of fast ram and no compute.

With "regular" ram folks, viability of such models in terms of reasoning and agentic coding is a bit suspect.

Just like the turboquant stuff tho, no sense in fighting it. If it's popular it means it's true, even when you can prove it wrong.

[-]

lolwutdo@reddit

If you have the ram, it’s literally faster than Gemma 31b or Qwen 27b; even Qwen 397b is faster.

[-]

Snoo_64233@reddit

I don't believe it

[-]

suicidaleggroll@reddit

Then you don’t understand how MoE models work

[-]

KURD_1_STAN@reddit

Llms need to process the whole active parameters for every token, dense models like qwen 27b and gemma 31b are all active parameters, but moe models have a 80b a3b or 26b a4b... meaning only 3b 4b need to be processed for each token. So this has around 250b (a10b) so it will be 2-2.5 times faster than 31b if u have enough ram to fit the other 250b in

[-]

jacek2023@reddit

Yes it's better to pretend people run Minimax in RAM than to admit thar they just want cloud access

[-]

llama-impersonator@reddit

why do you always bark up this tree when there are like 8 of us 192gb ram users!?

[-]

jacek2023@reddit

I use Minimax locally. Most of them can't

[-]

Spectrum1523@reddit

its like there is more than one person posting here

[-]

Monad_Maya@reddit

I do run MiniMax M2.5 locally (128GB ddr4 + 20GB VRAM) IQ4_XS quant from AesSedAI.

It's a pretty good model. 5-7 tokens per second on my hardware which is usable for me.

If I had more RAM or better yet VRAM, I would've gone for Qwen 397B.

[-]

GroundbreakingMall54@reddit

the minimax hype cycle is getting kinda ridiculous at this point. announce on twitter, post weights on hf "coming soon", then radio silence for 2 weeks. at least when meta drops something they just drop it

[-]

ExactSeaworthiness34@reddit

When's the last time Meta dropped something LLM related that's relevant?

[-]

Pink_da_Web@reddit

Llama 3.1?

[-]

TechnoByte_@reddit

Llama 3.3 70B

[-]

Pink_da_Web@reddit

Too

[-]

huffalump1@reddit

Ehhhh yeah good point, Llama 4 was already outshadowed by Chinese models on release, and absolutely left in the dust after...

Meta Labs continues to pump out excellent work but not in the "Llama mainline open LLM" world, it seems.

[-]

MMAgeezer@reddit

Laughs in Llama 4 Behemoth (which never actually shipped).

[-]

Living_Director_1454@reddit

Trash is meant to be dropped.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Designer_Reaction551@reddit

14 days is a long time in this space. The open-weight release delay pattern is frustrating but understandable - labs need some commercial runway before full weights drop. The real question is whether the API version performance matches the hype. Anyone actually benchmarked Minimax 2.7 against the current local leaders?

[-]

LegacyRemaster@reddit (OP)

more then all: after 2 weeks the model is already "old"

[-]

electroncarl123@reddit

Why bother? Until it's open weights it doesn't exist to me... We already have closed weights stuff from OpenAI and Anthropic

[-]

relmny@reddit

Thanks for the "openweight" in the title, as opposite to the "opensource" that has, wrongly, taken over the actual meaning.

I guess is a lost battle, and they are far from being the only ones (qwen does it to, sometimes)... on the other hand, most people have no idea what "open source" mean, let alone "open weight", so maybe "open source" is not that bad (if most people start to get the idea)... I still prefer "open weight" though.

[-]

-dysangel-@reddit

I agree, but you learn to accept after a few years that as a society people just tend to get things wrong, and the language adapts around it. Things like "I could care less" still piss me off, though actually now that I say it, I haven't seen that one misused for a while now. Maybe sometimes we can turn the ship around.

[-]