Minimax 2.7 running sub-agents locally
Posted by -dysangel-@reddit | LocalLLaMA | View on Reddit | 52 comments
I just tried hooking up local Minimax 2.7 to Opencode on my M3 Ultra. I'm pretty impressed that it can run so many agents churning through work in parallel so quickly! Batching like this feels like it's really making the most of the hardware.
FullstackSensei@reddit
What's IQS_XXS? And why do you give so much memory for the KV cache while neutering the model with XXS quants? Sounds a bit backwards to me
-dysangel-@reddit (OP)
IQ2_XXS quants of large models like Deepseek R1 and GLM 5.1 have been working great for me. Halving the amount of data that needs pushed around speeds things up, and the quality is still much higher than small models can manage. I allocated so much to KV cache in the hopes that sub-agents don't bust the original main agent cache cache, but I might need to do a more custom setup if it doesn't work that way (ie if it can only store one history at a time and not keep around branches off the system prompt)
lolwutdo@reddit
I use IQ2 with 397b and it works perfectly, I was afraid to get that same quant for MiniMax though thinking it would affect it more since it's way smaller model in comparison.
I guess I could try it to get more context, currently I can only squeeze in 68k context on 16gb vram with q3ks and q3kxl quants.
nikhilprasanth@reddit
How much system ram for the offload?
lolwutdo@reddit
128gb
-dysangel-@reddit (OP)
I think it's worth a try. As the fundamentals of reasoning get better baked into models, they will be more resilient to quantisation. I'm going to try it at Q4 too, but if it works at Q2 then why not?
FatheredPuma81@reddit
Uhh... 196608/7=28086 and OpenCode's system prompt takes up like 13,000 so yea you're already almost out of Context on the main agent.
FullstackSensei@reddit
Did you actually check if your hope makes sense?
My experience with batching in llama.cpp and MoE has been that it slows things down considerably. Ex: I get ~30t/s with minimax at Q4 fully in VRAM on Mi50s, but that drops to ~10t/s per request when running two in parallel (total 20t/s). It's been the same with smaller MoE models, performance drops.
As for quant, I find a smaller model (down to 80B) at much higher quant beats larger models at quants under 4 bits. Small quants neuter the model too much for anything but simple tasks.
-dysangel-@reddit (OP)
I haven't benchmarked anything yet - I wasn't intentionally trying to test out batching, I just booted up opencode and it immediately span up a bunch of agents on my first request and they all were chewing through actions in parallel much quicker than I'd have expected, especially given that M2.7 doesn't have any linear attention layers. I posted up here in excitement, but I get that benchmarks would be more valuable.
I'm really interested in seeing what actually performs better on long contexts and batching - GLM 5.1 or Minimax. Minimax is 1/3 of the size, but I expect there will be a crossover point where GLM's DSA starts to make it more performant for longer contexts.
FullstackSensei@reddit
Which performs better will heavily depend on which quants you're comparing. IMO, a smaller model at a considerably larger quant performs much better than a neutered larger model. For coding tasks, it's all about nuance.
Another thing that I think most people get wrong (personal opinion): focus on t/s output vs how much the model can actually get done one shot and without babysitting.
I'd much rather run a much larger model at half or even less the t/s and give it larger tasks with the confidence I can leave it to do it's thing for half an hour to one hour, without babysitting than run a model with faster t/s but then having to constantly correct it every couple of minutes. The latter is very stressful for me, while the former let's me do other things without worry.
-dysangel-@reddit (OP)
That's not been my experience, but it really depends on the quant. For example Deepseek R1-0528 at IQ2_XXS UD was consistently high quality, but V3-0324 lost a lot of consistency/quality in its outputs despite having the exact same architecture.
I fully agree on the t/s vs quality. When I tried Qwen 27B agentically it would often just halt from failed tool calls etc, while GLM 5.1 just keeps working solidly until the job is done.
SourceCodeplz@reddit
The cache will be the 2026 optimization hack, if someone gets it right, we get amazing continuity and chain of thought.
-dysangel-@reddit (OP)
Yeah I built my own caching system when GLM 4.5 Air came out. It's definitely a game changer, especially if you cold cache different system prompts from different agents/sub-agents.
oMLX has really simple and effective hot/cold caching options if you're on Mac, I'm a fan.
No_Accident8684@reddit
how do you run the model? ollama? lmstudio?
-dysangel-@reddit (OP)
llama.cpp
popecostea@reddit
I see you keep making this mistake. Your `--cache-ram` parameter is for the prompt cache, it's not related to the KV cache itself. In the words of GG himself, the prompt cache acts as "extra slot" space.
-dysangel-@reddit (OP)
I'm not really sure what you mean. The processed prompt cache is a KV cache, surely? I'm not sure what else you would be caching. "The" KV cache is all processed tokens so far including prompts and responses.
popecostea@reddit
While processing the prompt you make use of *the* KV cache, as for each token you reuse many calculations, on that count you are correct. For different contexts (i.e. prompts) the KV cache needs to be recomputed/evicted. When you cycle between a number of prompts it would be inefficient to recalculate each time. llama.cpp essentially "caches" the KV caches of previous prompts to try and not make recomputations on things it has already seen, and the size of that cache is what you control with that parameter. llama.cpp also does this with some checkpoints, if for example the agentic harness dumps a suffix of the context between rounds.
*the* KV cache itself is something that depends solely on the model architecture, quantization of the K/Vs, and context size you use.
walden42@reddit
TIL, thanks! How do you optimize the --cache-ram param?
popecostea@reddit
Well, you basically put as much as you are willing based on how much system memory you know you ll have free, thats the easiest way.
-dysangel-@reddit (OP)
Well, that sounds like exactly what I asked Claude for!
-dysangel-@reddit (OP)
ah thanks, my bad. I had asked Claude to allocate up to 300GB for context, but I guess 'context' in llama.cpp is more related to max_tokens in mlx
MrB0janglez@reddit
The batching throughput on M3 Ultra is impressive here. Running IQ2 quants with llama.cpp really maximizes the unified memory bandwidth. Have you tried adjusting the n_parallel slots to find where quality starts degrading? Would be curious how far you can push it.
ezyz@reddit
Any reason for preferring llama.cpp over MLX? I've found using
mlx-lm.servergives an easy 10-25% boost on speed, and that Unsloth-style mixed quants work when translated into MLX as well.-dysangel-@reddit (OP)
I use mlx all the time for smaller models, but in my empirical testing IQ2_XXS UD has been great on all the larger models recently. I was actually just considering downloading an MLX Q3 to compare. But hey the Q2 is working great already.
Just_Lingonberry_352@reddit
what is your overall impression
-dysangel-@reddit (OP)
I'm very impressed so far. I've just compared back to back and despite being n\^2 attention it is much faster overall for pp than the larger GLM 5.1. This is feeling like a model that could genuinely get me by as a smart-enough and fast-enough coding assistant if I could never use cloud again. Especially if I set up some more custom tweaks, like using a linear model for summarisation of long context windows.
Just_Lingonberry_352@reddit
that is enticing if you are using it for agentic coding curious how it compares to gpt 5.4
-dysangel-@reddit (OP)
No idea sorry - I've not used OpenAI models for a long time
SoftwareProBono@reddit
I'm loving the personality! 🤣
-dysangel-@reddit (OP)
lmao yeah - I tried giving 2.5 a personality/name and it's thoughts were "User referred to me as x, but I'm Minimax". Great workhorse, but not so good for a chilled out assistant.
SoftwareProBono@reddit
I was laughing out loud on this first test. I'm used to the open source models having less personality, but I found this terseness pretty hilarious.
rm-rf-rm@reddit
im on Minimax 2.5 right now on my M3 Ultra. I was thinking of waiting for the usual new model release issues to get cleared up and then downloading it after a week or so. But seems like its good to go? Were you using 2.5 prv
-dysangel-@reddit (OP)
I wasn't actually using 2.5 for interactive stuff before. It felt robust, but not that "smart" to me, so I was just using it for curating my assistant's memory and checking for hallucinations
unbannedfornothing@reddit
The main drawback of Minimax for me is that speed drops drastically over context size: results for 0 context:
results @50000:
results @100000:
almost 5 times less just at 100000 ctx just for comparison - Qwen 3.5 397B MXFP:
so already at 50000 it's slower than almost 2 times larger qwen.
-dysangel-@reddit (OP)
yeah n\^2 attention is no joke. They were one of the first to release a model with linear layers, but then they gave up on it. I hope the success of GLM 5 and Qwen 3.5 convince them to give it a go again.
unbannedfornothing@reddit
Thought there might be a bug with mxfp quants, so tested Q4-K-XL, results are slightly worse for me:
ForsookComparison@reddit
IQ2_XXS on 10B active params would surprise me if it was remotely useful
-dysangel-@reddit (OP)
It has seemed very capable so far. Some IQS_XXS quants are terrible. Some are not.
OffBeannie@reddit
How many gig is your ultra?
-dysangel-@reddit (OP)
AcaciaBlue@reddit
I think he's asking how much RAM the machine has overall, which is a very good question
Jackalzaq@reddit
how does it compare to qwen 3.5 397b? or glm 5.1? my experience with the minimax models is that they are very good for chatting with but they seem to have issue with coding compared to those two.
for me glm5.1 is slow but catches the mistakes of all the other models i have. it also seems like its the only good planner
-dysangel-@reddit (OP)
I've been running a few tasks on my game engine just now and it's handling them well.
The smaller Qwen models are amazing for the size, but I was never really impressed with the larger ones considering I can already be running GLM 5 for that amount of RAM.
GLM is very good at planning and coding - I'd assume better than Minimax, but I haven't had enough time with 2.7 to put it through its paces properly yet.
Minimax has faster tps, but I expect at longer contexts that GLM probably wins out for prompt processing.
Jackalzaq@reddit
yeah the only reason i run the larger qwen models is cause of the image input capabilities. ill have to give 2.7 a try then, thanks!
deejeycris@reddit
As far as I understood minimax is not very smart but good at tool calling.
-dysangel-@reddit (OP)
Yeah I'd heard the same on the previous models - but the benchmarks on M2.7 are close to frontier level, so I thought it was worth a try. It's around GLM 5.1 performance for 1/3 of the RAM costs
deejeycris@reddit
I have to try it more, currently testing with ollama cloud but kimi2.5 has been a bit faster. And GLM5.1 in zai/ollama.
LegacyRemaster@reddit
can you post the complete llama command?
-dysangel-@reddit (OP)
sure
LegacyRemaster@reddit
thx!
TheItalianDonkey@reddit
At that quantization Gemma4-31b should be much better though?
Did you try?
Whats your RAM on that M3?