Those of you running minimax 2.7 locally, how are you feeling about it?

[-]

funding__secured@reddit

I'm running it at full precision (FP8) on 4x RTX 6000 Pros on top of vllm. It's amazing. No complaints from my side.

[-]

somerussianbear@reddit

That’s 50 grand in hardware. Respect.

[-]

a_beautiful_rhind@reddit

They idle better than 3090s, even with a model loaded.

[-]

a_beautiful_rhind@reddit

Yea, I got 2 like that and 2 that don't go below 13w then jump to 20-30w as soon as I touch them. Add in p2p with rebar and it's over.

[-]

DeltaSqueezer@reddit

I found a trick. There's a specific point in time during vLLM start up which makes the power jump. I first load a small model with llama.cpp to take a low amount of VRAM and this sometimes stops this power jump, once vLLM crosses this boundary successfully, I terminate the small app so that vLLM can continue to load. It's a bit tricky, but once it starts up with low idle power it stays that way. I had it like this for months now. Depending on which version of vLLM you have the timing can be trickier or not 100% repeatable but a couple of attempts and it works and I leave it. For me, it is worth the slight hassle for lower power.

[-]

W2D2020@reddit

Oh we are allowed to post porn now?

[-]

No_Mango7658@reddit

That’s hot

[-]

funding__secured@reddit

It really warms up the office.

[-]

No_Mango7658@reddit

[-]

LatentSpacer@reddit

“LLM, generate the output of nvidia-smi on 4x RTX 6000 Pro”

Just kidding, lucky you! As a GPU lower middle class, I’m just envious of the GPU rich.

[-]

funding__secured@reddit

You mean financial irresponsible😄

[-]

LegacyRemaster@reddit

brutal

[-]

JoshuaLandy@reddit

r/usernamechecksout

[-]

ortegaalfredo@reddit

I want to live that dream too, Mr. Pool

[-]

GreenGreasyGreasels@reddit

Why won't you believe him ? He has funding secured for so many vram'ses.

[-]

funding__secured@reddit

I haven't even mentioned my GH200 or my 8 Sparks 😄

[-]

funding__secured@reddit

```$ nvidia-smi

Fri Apr 17 00:12:31 2026

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |

+-----------------------------------------+------------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+========================+======================|

| 0 NVIDIA RTX PRO 6000 Blac... On | 00000000:01:00.0 Off | Off |

| 31% 57C P8 18W / 250W | 90772MiB / 97887MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

| 1 NVIDIA RTX PRO 6000 Blac... On | 00000000:02:00.0 Off | Off |

| 30% 39C P8 4W / 250W | 90772MiB / 97887MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

| 2 NVIDIA RTX PRO 6000 Blac... On | 00000000:C1:00.0 Off | Off |

| 30% 54C P8 10W / 250W | 90772MiB / 97887MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

| 3 NVIDIA RTX PRO 6000 Blac... On | 00000000:E1:00.0 Off | Off |

| 30% 50C P8 14W / 250W | 90772MiB / 97887MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=========================================================================================|

| 0 N/A N/A 715097 C VLLM::Worker_TP0_EP0 90762MiB |

| 1 N/A N/A 715098 C VLLM::Worker_TP1_EP1 90762MiB |

| 2 N/A N/A 715099 C VLLM::Worker_TP2_EP2 90762MiB |

| 3 N/A N/A 715100 C VLLM::Worker_TP3_EP3 90762MiB |

+-----------------------------------------------------------------------------------------+```

[-]

funding__secured@reddit

I could run it on the GH200, but it's running GLM-5.1 right now - IQ4_K for now.

[-]

Ok_Mammoth589@reddit

Maybe share some of your settings that are different from op's settings to help him out...

[-]

funding__secured@reddit

Fair enough - I use docker.

      - "--model"
      - "MiniMaxAI/MiniMax-M2.7"
      - "--tensor-parallel-size"
      - "4"
      - "--gpu-memory-utilization"
      - "0.92"
      - "--max-model-len"
      - "196608"
      - "--dtype"
      - "auto"
      - "--compilation-config"
      - '{"cudagraph_mode": "PIECEWISE"}'
      - "--enable-prefix-caching"  # Automatic KV cache optimization
      - "--enable-chunked-prefill"  # Process long prompts in chunks
      - "--max-num-seqs"
      - "16"
      - "--trust-remote-code"
      - "--tool-call-parser"
      - "minimax_m2"
      - "--enable-auto-tool-choice"
      - "--reasoning-parser"
      - "minimax_m2"
      - "--enable-expert-parallel"
      - "--served-model-name"
      - "minimax-m2.7"

[-]

CKtalon@reddit

Are you using any harness like Codex, OpenCode, Claude Code? They work for me for a few requests before I get some connection error and I can’t figure out what’s wrong. I suspect it’s some context length issue even though I’m at around 30+k tokens and the GPUs can handle 190+k tokens.

[-]

funding__secured@reddit

I’m using Opencode. I never had an issue like that one. Are you checking the logs on vllm? It might have crashed for some reason. It’s really reliable for me and I REALLY push this thing to the max.

[-]

CKtalon@reddit

If I were to start a new session with zero context, the harnesses will continue and then again error out some time later, with vLLM/sglang (I’ve tried both) still running and capable of providing responses by curl, which is why I suspect some context window issue, but the logs show it’s around 30-40k tokens at the point the error happens

[-]

funding__secured@reddit

That’s very weird. Are you using kv cache quantization?

[-]

CKtalon@reddit

Bf16

[-]

funding__secured@reddit

Same here.

[-]

fanhed@reddit

I also ran the fp8 m2.7 with the same configuration, but in my local test cases, many strange problems occurred, such as infinite loops and sudden stops in output. And the test cases did not pass, while the m2.5 could pass smoothly. Interestingly, when I switched to the awq quantization version, it worked normally instead.

[-]

funding__secured@reddit

That’s very weird. Maybe something got corrupted during model download? Well, glad it’s working with awq for you at least.

[-]

JumpyAbies@reddit

How many tokens/s can you achieve in FP8?

[-]

funding__secured@reddit

Varies a bit:

vllm-minimax-m2.7  | (APIServer pid=69) INFO 04-17 00:10:05 [loggers.py:259] Engine 000: Avg prompt throughput: 79.8 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 94.1%
vllm-minimax-m2.7  | (APIServer pid=69) INFO 04-17 00:10:15 [loggers.py:259] Engine 000: Avg prompt throughput: 10.6 tokens/s, Avg generation throughput: 60.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.5%, Prefix cache hit rate: 94.0%
vllm-minimax-m2.7  | (APIServer pid=69) INFO 04-17 00:10:25 [loggers.py:259] Engine 000: Avg prompt throughput: 58.8 tokens/s, Avg generation throughput: 37.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 94.0%
vllm-minimax-m2.7  | (APIServer pid=69) INFO 04-17 00:10:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 94.0%

[-]

Individual_Spread132@reddit

Tried Q3KM, IQ4XS, Q4KM - varied GGUFs from different uploaders on HF (unsloth, bartowski, aessedai). All recently updated. Fresh llamacpp build. And guess what? I'm getting random Chinese words/characters popping up sometimes, usually no more than once in one message. It doesn't happen too often, but it's hard to ignore and it's super annoying.

As for the missing spaces and spelling errors - yes, I've noticed it too, but it's not as jarring...

[-]

SnooPaintings8639@reddit

I have minimax 2.7 plugged into my OpenClaw... from their official API service, and also am getting random Chinese characters. I think this a quirk of this model.

[-]

Tim_tango_PT@reddit

I'm using minimax 2.7 hosted by minimax as my daily driver for Openclaw and I also get the random chinese characters sometimes.

[-]

LegacyRemaster@reddit

same. random Chinese characters

[-]

Individual_Spread132@reddit

Oh, well... That clears it up quite a bit. Thank you.

I hope they'll address it in the forthcoming versions of MiniMax.

[-]

yoracale@reddit

What CUDA version are you using? That might be the problem

[-]

Individual_Spread132@reddit

Honestly, it's a bit of a mess in that regard.

So, I don't have CUDA toolkit installed - it's just game-ready drivers and .cmd command returns Version 13.2.

However, I tend to use LMStudio or KoboldCPP (which, unless I'm deeply mistaken, come with all the necessary CUDA stuff? At least I see "CUDA 12 llamacpp" in LMStudio). I should mention that no other models have issues. Gemma 4 - absolutely fine. Older models - fine too.

Maybe you're right. What should I even do at this point?

[-]

Firestorm1820@reddit

13.2 has known issues (gibberish, Chinese characters etc). Rebuild with a 12.9.x CUDA to resolve.

[-]

tnhnyc@reddit

I've also gotten random Chinese popping up sometimes with some of those quants. Also another thing, with Aessedai IQ4_XS, the reasoning seemed strange when I compared the other quants with same prompts, it'd be a lot shorter than Unsloth's IQ3_XS or Nvidia NIM, I had GLM 5.1 look at the quant and this seemed to have "fixed" the reasoning.

--override-kv tokenizer.ggml.add_bos_token=bool:false

[-]

Individual_Spread132@reddit

popping up sometimes with some of those quants

The real question is whether it's a quantization issue or the model itself... I mean, are there any quants that don't have that problem?

[-]

SeriousExplorer7479@reddit

What precision KV cache are you using?

[-]

laterbreh@reddit (OP)

fp8, but we have tried 16, similar results.

We are currently working on sampling changes and it seems to have made a bigger difference. looks like this model is particular about its sampling parameters for the tasks we are assigning it.

[-]

tidoo420@reddit

what kind of insane hardware is that, wtf do you guys do lmao, pablo expensivebar

[-]

1ncehost@reddit

31 tok/s on my Ryzen 395 with 128k context. Using it with OpenCode, it feels like the best models from last summer. Its my favorite local model rn.

[-]

valtor2@reddit

really? so like, opus 4-ish?

[-]

muyuu@reddit

I was considering a setup like this. How much RAM is it taking?

[-]

1ncehost@reddit

92ish GB

[-]

Anarchaotic@reddit

I didn't even consider this as a viable model on my strix halo. Is the quantization not that massive of a quality hit?

[-]

1ncehost@reddit

It works in opencode pretty well as far as I'm concerned.

[-]

ReactionaryPlatypus@reddit

I am running it on Strix Halo and am quite happy with it but I am also having those same issues (I thought it was caused by quantization).

Does anyone have a System prompt to reduce random Chinese characters appearing?

[-]

BacklashLaRue@reddit

Can I ask, what is your load configuration on the strix halo. I have the 128 GB version.

[-]

ReactionaryPlatypus@reddit

I'll post a mini guide this week.

[-]

Zyj@reddit

Same here at Q6. But also with Qwen. The Chinese characters aren‘t random, they are fitting words for the occasion

[-]

fanhed@reddit

My hardware is rtx pro 6000 x4, deploying m2.7 fp8 with vllm, used in opencode, encountering issues similar to you. Infinite loops and test cases fail, there are also very long thinking, and it will even stop by itself.

[-]

fanhed@reddit

There are even stranger things; after I switched to the awq quantified version, the situation actually improved.

[-]

laterbreh@reddit (OP)

Thats what I was considering trying was a quantized version... however moving to a quant instead of the full version to resolve the problem doesnt sit well with me.

[-]

fanhed@reddit

I suspect it's a vllm compatibility issue, you can try switching to sglang instead.

[-]

Remper1997@reddit

If you are using the official on on Mac now you can track you api usage with this simple app: https://github.com/Remper1997/MiniMaxUsage

[-]

Exciting-Engine882@reddit

I tried it with q8 xl quant with llama cpp and also saw random typos in the coding tests I have done. went back to qwen 3.5 the big one. too bad that it is having these issues. I will try a lower temperature, been using the default one which I think is 0.8

[-]

cmndr_spanky@reddit

temp at 1.0 ?!? lol

also max tokens at 16k is loco my first. You're not going to get any useful coding done with those settings.

What coding agent are you using ?

[-]

laterbreh@reddit (OP)

Opencode harnesses. Max tokens is a runaway safety, setting them lower doesnt impact code quality output for us.

Prior with m2.5 temp was 1.0 as well, this is whats recommended from Minimax so thats what we stuck with.

And to the contrary we get plenty of useful coding done with those exact settings on 2.5 using autonomous agents with our opencode harness.

Care to elaborate as to why/why not those settings?

[-]

jon23d@reddit

I use 1.0 too, though I need at least 30k tokens just to get started.

[-]

laterbreh@reddit (OP)

Our context window is 196,000 tokens.

[-]

cmndr_spanky@reddit

Are you running it from vLLM?

Just try temp of 0.6 repetition_penalty 1.0 max_tokens 32k and let me know, I'm very curious if it improves tool calling.

You're right max_tokens isn't as important as context_window size, however sometimes the LLM response (for efficiency reasons) may decide to trigger multiple stacked tool calls in a single response, tool schemas / calls can be token heavy so although not likely there's a chance it could rub up against a 16k single turn generation limit.

all that said.. I think repetition_penality and/or temperature could have the biggest impact.

I know not the same model, but I found repetition_penality 1.0 caused qwen 3.5 to fail at tool calling miserably meanwhile 1.1 was great. (My advice is start at 1.0 and then 1.1 next)

[-]

samthepotatoeman@reddit

I run it on 2 RTX 6000s and personally it's the first time a local model truly felt like it could let me break my over reliance on cloud models. It reminds me a lot of sonnet and I love it. Right now I use chatgpt as the planner and have it write detailed prompts for mini max 2.7 im still figuring out the best settings but I love the speed and abilities of this model.

[-]

Zyj@reddit

Did you use M2.5 previously?

[-]

samthepotatoeman@reddit

I apologize I am still new to this. I did mess around with 2.5. It was good, but I still just could tell I was using a local model that isn't as smart. It was close but just could tell it was going to hurt more things then help. 2.7 at the moment feels like a SOTA ish openweight model I can run well. Because I did like 2.5 I was on pins and needles for 2.7. As soon as it released I got it up and running and was very impressed. Still working on my first project with it, but very impressed so far.

using llama.cpp as the backend and Q4_K_M quant. Not sure if those are the best settings but it was the best I could cobble together when it released.

[-]

a_beautiful_rhind@reddit

Did you try different sampling? .8-.9 temp, no topk/p no rep penalty? Use top-n-sigma and min_P instead. Rep penalty is kinda destructive.

Run one repeatable test, maybe with same seed and see how the outputs change. Try to isolate what's causing your issue by removing variables.

[-]

laterbreh@reddit (OP)

Appreciate the tip-- never looked at top-n-sigma or min_p you mind sharing your settings? Thankfully we have repeatable test cases for evaluations. Will likely run a script to tweak and tune a step on each sample and let it run over night and then parse the data in the morning!

[-]

a_beautiful_rhind@reddit

I literally did. Top-n I do 1.0 for coherence. Min_P usually 0.01-0.03 to cut off the tail and junk tokens. Might have to patch VLLM for top-no sampling, I don't remember if it ever got merged but the authors posted code to do just that. I have to hybrid the model with only 96g of ram so I use IK_llama.

[-]

laterbreh@reddit (OP)

Yea sorry i menat about the top-n-sigma, couldnt find it in the recent docs of vllm so im currently playing with min_p and lower temperatures in my evals

[-]

a_beautiful_rhind@reddit

I think you can just edit site packages: https://github.com/Tomorrowdawn/top_nsigma

[-]

JayPSec@reddit

I'm running Luke Alonso's NVFP4 on two rtx 6000 max q. My main complaint with the model is the urge to go beyond what's asked of it. I find that a tight system prompt, I'm just running stock open code OpenAgents with some coding standards, works pretty well. But the model feels very vibe oriented, it wants to do everything and it better do it now. And it feels a bit confused with some non standard plugins like snip. I do think it's better for brainstorming than 2.5 but more unpredictable. As for the 'chinese' characters I've seen others pointing out, I've never seen them.

[-]

ambient_temp_xeno@reddit

Rep penalty should be off: 1.0

[-]

laterbreh@reddit (OP)

Ran an eval set with out repetition penalty, has made an improvement.

[-]

antunes145@reddit

I’m running it on a Mac M5 max 128 GB of course it’s the quantized version but it takes up a whopping 100 GB of RAM. I’m running it on VLMX. But out of all the local models I have ever tried the fit on my system. It is the only one that was able to get my coding requests spot on. It’s a very very capable model so far my favorite one for local coding. But I am now testing qwen 3.6 26b and honestly for its small size. It’s actually surprising me of the code quality. I might start using it for a smaller coding requests.

[-]

tomz17@reddit

Running in fp8 on a 9684x w 12-channel DDR5 + 2x3090. Seeing roughly 30t/s tg and 300 t/s prefill.

Very useable.

[-]

LegacyRemaster@reddit

For many tasks, I still use 2.1. For example, I'm generating a dataset now, and it's much faster and more consistent than the latest versions.

[-]

relmny@reddit

Although I don't do code, I'm using it more and more. It's actually becoming my main model.

I run the Q4_K_M quant and I'm really liking it. I used to run qwen3.5-27b (and 122b, etc) as I really like qwen, but this one feel really good.

No issues so far.

I use:

--temp 1.0 --top-k 40 --top-p 0.95 -c 24576

(can't fit more context, but I will try turboquant (TheTom), as a test to see if I can fit more)

[-]

someone383726@reddit

I’m running on 2 6000 pros and it’s been great for me. Generated lots of scripts without error. It feels like it is sonnet level

[-]

Zyj@reddit

Some Q4?

[-]

Cosmicdev_058@reddit

Definitely seeing some odd behavior with 2.7, especially around code formatting and consistency. It might be worth trying different inference engines like TGI or checking if a different quantization level helps.

For tracking these inconsistencies and comparing against 2.5, tools like LangSmith, Langfuse, or OrqAI with its eval and routing features could give you more insight or let you swap models easily.

[-]

fallingdowndizzyvr@reddit

It is also not uncommon for it to make a spelling error or miss putting a space between example const variable = something will instead constvariable =something then have to go back and fix it. EDIT: Forgot to mention the random chinese characters in its output.

I haven't seen a single Chinese character or spelling errors. I'm using Bartowski's Q4_K_S on llama.cpp B8816.

[-]

greentheonly@reddit

I am with you on this one. with claude code in particular it feels much worse than 2.5. I get random Chinese characters, but also some weird "why are you typing empty lines" and interpreting some tool call results are user input and being confused over that.

Only on 2 x pro6000 so have to use a quant (tried lukealonso's and then a couple of AWQs all with the same results). vllm 0.19.0 + the transformers update to get the model to actually load.

It really is visible when you run something exploratory in a largeish codebase for me, like "what is this code doing and how to use it?" -> "give me a comprehensive architectural review" -> "let's dive into sub-component X really deeply". 2.5 performs well, 2.7 breaks on step1 or two. (this is my standard test for new models before I consider switching to it)

● The user seems to have sent an incomplete message - there's just a space
  character shown. I should ask them what they'd like to know more about        
  regarding the XXX subsystem, or what they'd like to explore next.

❯ can you give me expanded details on the architeture of it?                    

● The user wants more detailed architecture information about XXX. Let me    
  fetch relevant documentation or explore the codebase to provide a            
  comprehensive overview of XXX's architecture.                              

● Explore(Explore XXX architecture)                                          
  ⎿  Done (37 tool uses · 62.6k tokens · 2m 12s)                                
  (ctrl+o to expand)                                                            

● The user seems to have sent an incomplete message - just "<artifact " which
  suggests they may have tried to paste something or there was a cut-off. I    
  should ask them to clarify what they meant or what they'd like me to help     
  with.                                                                         


  It looks like your message got cut off. Did you mean to paste something or ask
   a follow-up question? Please let me know what you'd like help with.          

✻ Sautéed for 2m 17s

...

❯ yes, let's explore xxx                                                       

● The user wants to explore the xxx in depth. I     
  should use the Explore agent to thoroughly examine the xxx codebase.

● Explore(Explore xxx manager)                              
  ⎿  Done (17 tool uses · 95.5k tokens · 2m 24s)                                
  (ctrl+o to expand)                                                     

● The user seems to just be sending empty messages again (the system-reminder   
  says "Human repeat prevention system" but there's nothing shown from human).
  Just like before, I'll acknowledge and wait for their actual question.        


  It looks like your message came through empty again. Feel free to ask any     
  follow-up questions about xxx or any other aspect of the XXX codebase!

✻ Brewed for 2m 32s

[-]

jeffwadsworth@reddit

It is fine. Running the Q8 and coding is okay. My main model is GLM 5.1 Q4 though, so I am spoiled.

[-]

PraxisOG@reddit

I’ve had really good luck running it at iq3xxs on 96gb of vram across 3 32gb AMD V620s. I haven’t really put it through its paces yet, but I like how capable it is with knowing when to call tools and without thinking so much.

[-]

john0201@reddit

What kind of tps are you getting over pcie? I have 2x5090 and get crushed when I use both for the same model.

[-]

SillyLilBear@reddit

I get 130t/sec with two 6000 Pros on m2.7

[-]

No_Mango7658@reddit

He’s running fp8, you might be at a higher quantization

[-]

SillyLilBear@reddit

I am at fp4, but just saying pcie isn't a problem within inference.

[-]

laterbreh@reddit (OP)

60 tps

[-]

l9o-dot-dev@reddit

I'm running MLX 8bit on a Mac Studio. I see lots of those issues with missing spaces and occasional spelling errors too, but the worst has been tool calling. It often stops using the correct format (XML-based Minimax) and starts using Qwen braces, which creates problems with vllm-mlx. For now, I've patched vllm-mlx locally so it supports the Qwen format as a fallback and it's working ok. But I haven't quite seen that big of an improvement in any of my tasks, so I might rollback to M2.5.

[-]

ortegaalfredo@reddit

I'm using Int4 autoround with thousands of tool callings and I only saw it fail twice.

[-]

somerussianbear@reddit

Since you’re here, how many TPS are you getting?

[-]

laterbreh@reddit (OP)

Something in my gut says they may have impaired the weights... It genuinely feels sloppy compared to 2.5.

[-]

ortegaalfredo@reddit

I'm using it at Q4 and its silmilar to qwen3.5-397, slightly lacking in some jobs, but its so much faster, it feels like a small model.

[-]

rm-rf-rm@reddit

Please keep us posted! Im doing my customary 2 week wait approach before moving to a new model. But besides this post, every other post ive seen about 2.7 has been positive. So would be interesting to see if you are able to resolve the issues

[-]

llama-impersonator@reddit

it scores lower in world knowledge so it might have gotten enough of an RLobotomy to affect some tasks.

[-]

takoulseum@reddit

Too many issues at tool call, so I do not use it anymore (used with llama.cpp).

[-]

tiffanytrashcan@reddit

2.7?? You must have had a bad quant or some template majorly messed up, it's honestly one of the best for proper tool calls. (The syntax fails less often than GLM5 for me)

[-]

takoulseum@reddit

UD_Q4K_S with no specific flag for template. If anybody uses this quant and has no issue at all then I would be happy to know the server command used

[-]

tiffanytrashcan@reddit

Was this after they updated it? The original Unsloth release (and many others) had a few broken quants, I know a 4S of some type was definitely on that list.

[-]

takoulseum@reddit

I think it that was, but I will need to redownload to be sure

[-]

tiffanytrashcan@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/

I just double checked, yes 4K_S was listed under "ours had NaNs". If you tried it before this update, it's worth another look.

[-]

Ok-Measurement-1575@reddit

Update vllm.

[-]

laterbreh@reddit (OP)

We run vllm nightly, we update once a week. Anything specific hit regarding minimax?

[-]

Qwen30bEnjoyer@reddit

Specs: 7700x, 96gb SODIMM DDR5 using an adapter, running at 4800 MT/s with a 16gb 6800xt.

150 tps PP, 7 tps TG on IQ3-XXS. Been busy with school, so I can't test it as much as I would like.

[-]

catplusplusok@reddit

1.0 is very high temperature for coding (you asking for and getting variety :-) ) and max tokens is very low, so you don't get much of context.

[-]

laterbreh@reddit (OP)

At 1.0 with m2.5 with our evaluation sets it could predictably reach the same conclusion even with that high of a temperature. We are using a maximum context window, i think you are mistaking max tokens with max context, the max tokens is a runaway protection in its output.

[-]

Lissanro@reddit

I tested minimax-m2.7-230b-a10b-ud_q5_k_xl (GGUF with llama.cpp) and it does not make obvious typos for me. In your settings, repetetion_penalty may be causing it - generally, it is bad, especially for programming, and even for creative writing may increase possibility of types too. That said, it is still not as smart as GLM 5.1 and not as coherent with long context as Kimi K2.5, which I still run the most on my rig, but Minimax M2.7 seems to be a good high speed model for simple to moderate complexity tasks.

[-]

takoulseum@reddit

How is it compared to Qwen3.5 397B?

[-]

Lissanro@reddit

Qwen3.5 397B is a bit more capable, so supports vision and fast for its size. But if you do not need vision, the difference is not that big. They also tend to pick different approaches, so if Qwen3.5 397B or Minimax M2.7 got stuck on something, it may help to pick the other one.

[-]