2x Asus Ascent GX10 - MiniMax M2.7 AWQ - cloud providers are dead to me
Posted by t4a8945@reddit | LocalLLaMA | View on Reddit | 32 comments
Hello,
I've been on a quest to get something "close enough" of Opus 4.5 running locally, for agentic coding, as SWE with 15 years of experience.
I tried with one spark (yeah I'm calling my Asus Ascent GX10 sparks - they're the same), with models like Qwen 3.5 122B-A10B, Qwen3-Coder-Next, M2.5-REAP, ... Nothing was scratching the itch, too much frustration. 128GB is simply not enough (for me) right now.
So I bought a second one (first one I paid 2800€, second one 2500€, plus 60€ cable - total 5360€ - that's without VAT because it's a business expense, so I get VAT back).
First I tried Qwen 3.5 397B-A17B thinking it would be "it". But it's not. It's not bad, it's just not up to the task of being a reliable agentic coworker. I found it a bit eager to say "it's done!".
Then I tried MiniMax M2.5 AWQ. 130GB for the Q4 version. Lots of room for KV-cache. It's slower than Qwen 3.5 397-A17B and doesn't have vision.
But oh boy is it a good agentic workhorse.
Then came M2.7 with its new license (that is clearly made to fight against shady inference providers, which I agree with - not made to fight against us) and while it's not light and day with M2.5, it's the best model I've used.
I've set it up with my own harness (an OpenCode-like interface that I've customized for my use case), and as long as I give it a way to verify its work, it delivers (either through tests or through using the playwright-cli).
It's amazing at planning, understanding issues, developing new features, fixing bugs... All the thing you'd expect.
Sure it's not perfect, but it IS close enough and fast enough. It does frustrate me from time to time, just like proprietary SOTA models do as well.
That does require to readjust your expectation a bit though, you can't expect the same thoroughness of GPT-5.4 or the sheriff attitude of Opus 4.6. It's different, it's local but it WORKS.
So I'm calling it, cloud providers are dead to me. 2x Spark is a great setup and with M2.7 I've got a solid agent working for me.
[(they actually have quite bad thermals, stacking them is not optimal, they now lay flat on a desk)]()
PS: I have to pay my respects to the MiniMax team. They understand how to pack a great SWE in 229B parameters, while GLM-5.1 is at 754B (40B active), Kimi K2.5 at 1T (32B active), these guys understand compute. It's a win to be able to have such a smart agent in such a "small" footprint. They don't do it for us, they do it for themselves to provide great inference without as much compute as OpenAI/Anthropic/ZAI/Moonshot.
---
References:
- Spark docker: https://github.com/eugr/spark-vllm-docker (recipe is https://github.com/eugr/spark-vllm-docker/blob/main/recipes/minimax-m2.5-awq.yaml with 2.5 replaced by 2.7, that's it - but I've tweaked it to use fp8 KV-cache and full 196K context)
- The quant I'm running: https://huggingface.co/cyankiwi/MiniMax-M2.7-AWQ-4bit/
freehuntx@reddit
250gb/s lul
anzzax@reddit
273 GB/s :) But it has 200gbe interface so when connected in cluster, you run tensor parallel and throughput almost doubles. I have single spark and now I want second one. Even single spark gives me more practical value than pc with 5090 and 96gb ram.
Glittering-Call8746@reddit
Even if it fits on one spark 2 spark will double it ? Give me rl use case of this exact scenario
1ncehost@reddit
I have a ryzen 395 laptop and i too found M2.7 to be the breakthrough model. Vibecoding in OpenCode locally with that at 30 tok/s is my "there" moment. If I had to end my subscriptions, it wouldnt be ideal but i could make it work.
No_Mango7658@reddit
Omg I’d love a 395+ laptop paired with a 5090 mobile (work and play). I ended buying the framework desktop and using Tailscale to access it when I’m out…
Danmoreng@reddit
I dunno, 30 t/s for agentic tasks sounds too slow for my taste. I am currently using Codex in fast mode and often want more speed to be able to iterate faster. Haven’t tried out local models in an agentic harness yet, just using webui - and when I let small models (Gemma4 26B/Qwen3.5 35B) write code there at ~60 t/s it feels „slow“ to me.
FullOf_Bad_Ideas@reddit
That would mean that Claude Code with Opus 4.6 would be too slow for you. Which is OK, but most people like it.
For me good prefill and 10 t/s would be borderline but above that it's OK.
Ideally obviously prefill and generation would be 10k t/s but we will have to wait a year or two for that.
RedParaglider@reddit
How is the speed? Does it make your eyes bleed?
sn2006gy@reddit
MiniMax isn't an agentic coding model, its full of nonsensical tool training where it expect to control the environment, it has no sense of plan-do-act or claude code/opencode or some other agent tool taking the steering wheel and running with it.
get ready for glob glob glob glob glob all day long.
PS... I really wish Minimax was better. I was able to wrangle down qwen3-coder-next to work in plan-do-act with a few days of building a harness but MiniMax was just infinitely bad no matter what i tried to do. I mean, there were times where it was running 20 - 30 mocked tool calls with symobolic training names vs actually trying to find a file and with a project with 5 files in it, it kept globbing over and over between open, read, write, update and it took 30 minutes to do something that finished in 5 seconds with other models.
I emailed Minimax and kind of got back an "uhh yeah" response. Here's hoping to a 3.x...
I want open models to be killer... Minimax is just frustratingly bad if you write code for a living
jon23d@reddit
I’ve had the exact opposite experience. Minimax 2.7 has been reliable, fast, and accurate. Also, I can run it easily at home. At q8 it is doing as well for me as sonnet 4.6 was, at least so far.
SeaDisk6624@reddit
I could run it in fp8, currently using qwen 3.5 397b nvfp4, what is your harness for it?
jon23d@reddit
Opencode, injecting skills at the top deterministically
jon23d@reddit
Opencode, injecting skills at the top deterministically
colin_colout@reddit
what coding agent? did you set your own system prompt?
mrtime777@reddit
https://spark-arena.com/benchmark/b75bdb20-09cb-4c6a-b17d-8ce620961d3b
FalconX88@reddit
Our two sparks will arrive in a few days, definitely gonna try this. Sadly no nvfp4 yet?
Secure_Archer_1529@reddit
NVFP4 does not work not the spark. Community workarounds make it somewhat ok. But there’re better quants than NVFP4 atm. Go and have a look at the nvidia dgx spark developer forum if you didn’t already. Plenty of great stuff to turbocharge some builds and hit the ground running.
DOOMISHERE@reddit
i got my 2nd spark few days ago , and today got the cable!
wasted few hours on vLLM just to find out i cant run GGUF versions of minimax ( was hoping to run Q5 ), and llama.cpp cant work with clusters...
might try that quant u posted...
entsnack@reddit
llama.cpp support for rdma is in the works! you can run it in a cluster but it'll use ethernet right now.
_reverse@reddit
+1 to sparkrun, it’s great. It’ll use the 4bit AQB quant though. If you want to run a GGUF (for a 5bit quant or whatever) in a distributed setup you’ll be able to soon with the llama.cop tensor parallelism functionality, but it’ll probably be a couple of weeks until it’s stable.
mrtime777@reddit
Spark community has an excellent utility called `spakrun` and an official recipe for this model.. everything works out of the box
`$ sparkrun run u/official/minimax-m2.7-awq4-vllm`
DOOMISHERE@reddit
looks dope! ill test it for sure
DOOMISHERE@reddit
with support for multiple sparks ?
pirateadventurespice@reddit
Was literally setting up my second spark (also two ascents) today and wondering what model to try. Loading this one up now.
Speed absolutely does not matter for me. I’m an academic and I’d rather something run overnight and be correct than spit it out in real time; so, very excited to try this.
Ok-Measurement-1575@reddit
Is the PP still fairly strong clustered?.
96GB here - Minimax is my best model, too. I only trot it out for the trickier problems because ultra low PP.
pfn0@reddit
Why are people having such a problem calling vendor variant sparks, sparks. It says "Welcome to DGX Spark" when you log in. It's a spark.
VividLettuce777@reddit
For price of one spark you can buy two amd ai 395 powered mini-PCs. Just saying
pfn0@reddit
Thats false, GX10 is about $3500 now, minisforum ms-s2 (ai 395+) is $3300. You can get no-name 395+ machines for $2500 maybe (bosgame?). There is a slight price premium if comparing like for like (1TB drives), but it's close if you're OK with 1TB. You also get 200gbe which isn't available on ai max 395+ machines.
Initial_Run3719@reddit
Nice setup you have there!
What about prompt processing times? I have a Strix Halo and my favourite LLM is Qwen 3.5 122B. But loading takes up to 8 minutes with full context (I set up 120k). I know the sparks are much faster, does having two speed it up even more?
waiting_for_zban@reddit
How is pp with large context? if the spark had a pcie x16 gen5 it would have been a banger. Slapping an rtx6000 pro on it, would make it a perfect machine, right now I still struggle to see the value added compared to the strix halo or the m5 chips. It only makes sense if you stitch 2 together, and even then pp might not be that convincing, unless it's a sparse moe models.
jacek2023@reddit
I hope there will be second generation of all these sparks at some point
fallingdowndizzyvr@reddit
The Spark is like a spinoff of the Jetson. There have been plenty of Jetsons.