2x Asus Ascent GX10 - MiniMax M2.7 AWQ - cloud providers are dead to me

Posted by t4a8945@reddit | LocalLLaMA | View on Reddit | 98 comments

Hello,

I've been on a quest to get something "close enough" of Opus 4.5 running locally, for agentic coding, as SWE with 15 years of experience.

I tried with one spark (yeah I'm calling my Asus Ascent GX10 sparks - they're the same), with models like Qwen 3.5 122B-A10B, Qwen3-Coder-Next, M2.5-REAP, ... Nothing was scratching the itch, too much frustration. 128GB is simply not enough (for me) right now.

So I bought a second one (first one I paid 2800€, second one 2500€, plus 60€ cable - total 5360€ - that's without VAT because it's a business expense, so I get VAT back).

First I tried Qwen 3.5 397B-A17B thinking it would be "it". But it's not. It's not bad, it's just not up to the task of being a reliable agentic coworker. I found it a bit eager to say "it's done!".

Then I tried MiniMax M2.5 AWQ. 130GB for the Q4 version. Lots of room for KV-cache. It's slower than Qwen 3.5 397-A17B and doesn't have vision.

But oh boy is it a good agentic workhorse.

Then came M2.7 with its new license (that is clearly made to fight against shady inference providers, which I agree with - not made to fight against us) and while it's not light and day with M2.5, it's the best model I've used.

I've set it up with my own harness (an OpenCode-like interface that I've customized for my use case), and as long as I give it a way to verify its work, it delivers (either through tests or through using the playwright-cli).

It's amazing at planning, understanding issues, developing new features, fixing bugs... All the thing you'd expect.

Sure it's not perfect, but it IS close enough and fast enough. It does frustrate me from time to time, just like proprietary SOTA models do as well.

That does require to readjust your expectation a bit though, you can't expect the same thoroughness of GPT-5.4 or the sheriff attitude of Opus 4.6. It's different, it's local but it WORKS.

So I'm calling it, cloud providers are dead to me. 2x Spark is a great setup and with M2.7 I've got a solid agent working for me.

[(they actually have quite bad thermals, stacking them is not optimal, they now lay flat on a desk)](

PS: I have to pay my respects to the MiniMax team. They understand how to pack a great SWE in 229B parameters, while GLM-5.1 is at 754B (40B active), Kimi K2.5 at 1T (32B active), these guys understand compute. It's a win to be able to have such a smart agent in such a "small" footprint. They don't do it for us, they do it for themselves to provide great inference without as much compute as OpenAI/Anthropic/ZAI/Moonshot.

---

References:

Spark docker: https://github.com/eugr/spark-vllm-docker (recipe is https://github.com/eugr/spark-vllm-docker/blob/main/recipes/minimax-m2.5-awq.yaml with 2.5 replaced by 2.7, that's it - but I've tweaked it to use fp8 KV-cache and full 196K context)
The quant I'm running: https://huggingface.co/cyankiwi/MiniMax-M2.7-AWQ-4bit/

[-]

Fearless-Isopod-3231@reddit

The biggest bottleneck lately honestly isn’t even the models or the code anymore. it’s just finding affordable compute. Most of us experimenting outside of enterprise budgets are stuck mixing local hardware with rented VPS setups just to stay flexible. I’ve been looking more at independent providers like alexhosT lately rather than relying on the hyperscalers. Their Ryzen 9 nodes are actually pretty solid for memory-hard tasks, and since they own their own bunker in Moldova, you aren't paying that massive 'corporate overhead' markup. It’s a bit more of a 'manual' experience, but it’s one of the few places where the price-to-performance ratio still makes sense for heavy experimentation.

[-]

Fearless-Isopod-3231@reddit

Sinceramente, después de saltar por varios hosting baratos, me di cuenta de que muchos saturan sus nodos y ahí vienen las ralentizaciones. Moví un proyecto a AlexHost hace poco para probar y ha sido bastante más estable que otros nombres grandes. El soporte no es el más rápido del mundo, pero el rendimiento es consistente y el hecho de que tengan su propio bunker físico en Moldavia da un plus de privacidad que no encuentras en los hostings típicos.

[-]

Iajah@reddit

where did you get them so cheap?

[-]

t4a8945@reddit (OP)

Here: https://www.topbiz.fr/pc-de-bureau/387397-asus-ascent-gx10-gx10-gg0003bn-gb10-4711636204439.html (they only deliver to France though)

(and as I said in the post: " that's without VAT because it's a business expense, so I get VAT back")

[-]

1ncehost@reddit

I have a ryzen 395 laptop and i too found M2.7 to be the breakthrough model. Vibecoding in OpenCode locally with that at 30 tok/s is my "there" moment. If I had to end my subscriptions, it wouldnt be ideal but i could make it work.

[-]

philnm@reddit

m2.7 fits into 128gb ram still leaving enough space for context?

[-]

1ncehost@reddit

IQ3_S with Q8_0 KV FA

[-]

-dysangel-@reddit

if you haven't already tried it, I actually really like the IQ2_XXS. I tried the Q4 and it didn't seem any different, which this chart from Unsloth corroborates:

[-]

-dysangel-@reddit

The IQ2_XXS UD is only 65GB, and genuinely works well (I use it on my Mac)

[-]

sgmv@reddit

what kind of coding tasks do you have ? I found this model underwhelming, making lots of errors, not being smart at all. Tried q5, q8 as well as cloud. GLM 5.1 sooo much better

[-]

spaceman_@reddit

How are you running M2.7 on a Ryzen 395 laptop with enough memory left over to do actual work?

What is your prompt processing speed like?

[-]

No_Mango7658@reddit

Omg I’d love a 395+ laptop paired with a 5090 mobile (work and play). I ended buying the framework desktop and using Tailscale to access it when I’m out…

[-]

Danmoreng@reddit

I dunno, 30 t/s for agentic tasks sounds too slow for my taste. I am currently using Codex in fast mode and often want more speed to be able to iterate faster. Haven’t tried out local models in an agentic harness yet, just using webui - and when I let small models (Gemma4 26B/Qwen3.5 35B) write code there at ~60 t/s it feels „slow“ to me.

[-]

FullOf_Bad_Ideas@reddit

That would mean that Claude Code with Opus 4.6 would be too slow for you. Which is OK, but most people like it.

For me good prefill and 10 t/s would be borderline but above that it's OK.

Ideally obviously prefill and generation would be 10k t/s but we will have to wait a year or two for that.

[-]

jacek2023@reddit

I hope there will be second generation of all these sparks at some point

[-]

lucellent@reddit

If there is, it won't be this year most likely

at the earliest maybe CES27... one can only hope

[-]

unjustifiably_angry@reddit

With functional FP4.

[-]

NoahFect@reddit

Next step up is a big one, unfortunately.

[-]

Narrow-Belt-5030@reddit

Starting Price: $94,231.50

Jesus!

[-]

t4a8945@reddit (OP)

There'll always be a newer gen at some point, which will then be a perfect point for me to buy two extra and make my 4x spark cluster.

Just kidding, but yeah there will be innovation on this front, that's why if I'm happy today, that's enough. I don't need to future proof anything. Comparison is the thief of joy.

[-]

jacek2023@reddit

Could you post t/s?

[-]

t4a8945@reddit (OP)

I added the benchmark in the post

[-]

fallingdowndizzyvr@reddit

The Spark is like a spinoff of the Jetson. There have been plenty of Jetsons.

[-]

pfn0@reddit

Why are people having such a problem calling vendor variant sparks, sparks. It says "Welcome to DGX Spark" when you log in. It's a spark.

[-]

VividLettuce777@reddit

For price of one spark you can buy two amd ai 395 powered mini-PCs. Just saying

[-]

pfn0@reddit

Thats false, GX10 is about $3500 now, minisforum ms-s2 (ai 395+) is $3300. You can get no-name 395+ machines for $2500 maybe (bosgame?). There is a slight price premium if comparing like for like (1TB drives), but it's close if you're OK with 1TB. You also get 200gbe which isn't available on ai max 395+ machines.

[-]

br_web@reddit

The z13 is $2700 Amazon/BB/etc

[-]

pfn0@reddit

oh, that's pretty good; seems asus msrp is much higher than the actual product. I'd pick one up.....

[-]

VividLettuce777@reddit

Yeah, you’re right, my data is a bit outdated. I only knew the prices at release of the dgx spark, didn’t expect it to have changed that much since.

[-]

florinandrei@reddit

The Asus clones remain cool and quiet even after hours at 100% load. They're pretty well built.

[-]

t4a8945@reddit (OP)

My bad, it's just I bought the cheapest option of it and I have to pay my cheapness somehow.

[-]

Serprotease@reddit

Spark is the marketing name of the Nvidia one. The actual name of these are gb10. (And the Nvidia ones are probably the worse price/value ratio)

[-]

pfn0@reddit

It's been a little confusing, as I've seen the nvidia variant specifically sold as "Founders Edition". https://marketplace.nvidia.com/en-us/enterprise/personal-ai-supercomputers/dgx-spark/ I do see the distinction of the "DGX Spark" and "NVIDIA GB10" systems here though.

Logging in to any of these systems say "Welcome to NVIDIA DGX Spark" however.

$ ssh neuron
Welcome to NVIDIA DGX Spark Version 7.4.0 (GNU/Linux 6.17.0-1014-nvidia aarch64)

[-]

unjustifiably_angry@reddit

Cloud providers don't need to be "dead", just dead enough you only need to call on them a few times a day so you don't need a subscription. Spread it across Claude, ChatGPT, Gemini, you've got enough free usage to solve tricky problems.

[-]

RedParaglider@reddit

How is the speed? Does it make your eyes bleed?

[-]

mrtime777@reddit

https://spark-arena.com/benchmark/b75bdb20-09cb-4c6a-b17d-8ce620961d3b

[-]

CalligrapherFar7833@reddit

Tests with 128/256k context ?

[-]

Aaronski1974@reddit

This is super cool.

[-]

sn2006gy@reddit

MiniMax isn't an agentic coding model, its full of nonsensical tool training where it expect to control the environment, it has no sense of plan-do-act or claude code/opencode or some other agent tool taking the steering wheel and running with it.

get ready for glob glob glob glob glob all day long.

PS... I really wish Minimax was better. I was able to wrangle down qwen3-coder-next to work in plan-do-act with a few days of building a harness but MiniMax was just infinitely bad no matter what i tried to do. I mean, there were times where it was running 20 - 30 mocked tool calls with symobolic training names vs actually trying to find a file and with a project with 5 files in it, it kept globbing over and over between open, read, write, update and it took 30 minutes to do something that finished in 5 seconds with other models.

I emailed Minimax and kind of got back an "uhh yeah" response. Here's hoping to a 3.x...

I want open models to be killer... Minimax is just frustratingly bad if you write code for a living

[-]

segmond@reddit

MiniMax2.7 is all agentic, what are you yapping about? If you are having issues with MiniMax, you need to ramp up your skills.

[-]

sn2006gy@reddit

It sucks for agentic, but if y'all think its good, keep using it. I already said why it sucks and you chose to call it a skill error.

[-]

segmond@reddit

It's a skilled issue because you are struggling with it, but others are not. So you can either work hard and figure out how to make it work or give up and find your comfort zone with another model. It's no big deal really. I say the best model is the one that brings you joy, the one you can run comfortably and one that works for you. So enjoy, that's the fun thing with local LLMs. We have so much option we can now argue about these.

[-]

sn2006gy@reddit

Skill is knowing when a model is bad for the job and choosing the right model. Skill is knowing when the behaviors increase risk/liability more than they solve problems. Skill is writing a framework/tool/governor around said model and realizing its just bad to begin with and can't be wrangled in and can't be trusted.

If you just want to play around, use the model. I'll give minimax credit where credit is do - at least the model you pay for their API layer does a lot to correct for the shortcomings running it yourself.

Please. Don't try and teach me what a skill issue is when its the lack of skills that show people don't know what they're playing with here.

[-]

jon23d@reddit

I’ve had the exact opposite experience. Minimax 2.7 has been reliable, fast, and accurate. Also, I can run it easily at home. At q8 it is doing as well for me as sonnet 4.6 was, at least so far.

[-]

SeaDisk6624@reddit

I could run it in fp8, currently using qwen 3.5 397b nvfp4, what is your harness for it?

[-]

jon23d@reddit

Opencode, injecting skills at the top deterministically

[-]

jon23d@reddit

Opencode, injecting skills at the top deterministically

[-]

colin_colout@reddit

what coding agent? did you set your own system prompt?

[-]

unjustifiably_angry@reddit

Hoping someone will do a modest REAP on 2.7, it's just slightly too big to fit in FP8 with Q8 kv-cache, and since Sparks run FP8 at the same speed as INT4 it'd be a totally free upgrade. Less than a 5% REAP is required, it should be harmless if properly targeted.

[-]

Tashimm@reddit

That hardware setup is absolutely mental—two GX10s is basically a private mini-datacenter.

It’s really interesting to see your qualitative take align so well with the recent benchmarks. People often overlook the SWE-Pro metrics, but seeing M2.7 hitting around 56% (matching GPT-5.3-Codex level) and 55.6% on VIBE-Pro (nearly Opus 4.6) really validates why you're finding it usable for actual agentic workflows rather than just being a 'chat' model.

The MoE architecture (229B total but only 10B active) is likely the secret sauce for why it doesn't feel like a sluggish behemoth despite the massive parameter count, making it much more viable for the 'fast enough' requirement you mentioned. Also, since you've built your own harness, you're actually leveraging exactly what MiniMax designed it for—that iterative, self-improvement loop.

Would love to hear how the 196K context window holds up when you start feeding it larger repo structures in your custom interface. Are you seeing any degradation in reasoning or 'hallucination' of completed tasks as the KV-cache fills up?

[-]

t4a8945@reddit (OP)

That's actually a feature of the slower tg at higher context: it's not degrading, it's as smart as when 0 context.

200k is what I started with when Opus 4.5 came out, and it's more than enough to me. I sure enjoyed the 262k of Qwen, but let's face it, it's an illusion, just like 1M contexts.

The main agent can run sub agents for anything, which in turn saves context on the main agent.

[-]

Tashimm@reddit

That's a great point about the sub-agent architecture. Using a main agent to orchestrate specialized sub-agents is exactly how you bypass the 'lost in the middle' problem and keep the context window clean. It's much more efficient than just scaling the window to 1M and hoping for the best. The 'illusion' of massive context is real—if the reasoning doesn't hold up, the extra tokens are just noise. 200k is plenty if the delegation is handled well.

[-]

t4a8945@reddit (OP)

I don't know why, but I desperately need a brownie recipe right now.

[-]

Tashimm@reddit

Haha, I wish! 🍫 If only M2.7 had an iterative loop for brownie optimization like it does for its programming scaffold. For now, we're strictly stuck in the software engineering domain!

[-]

Toastti@reddit

If you were to simulate minimax 2.7 ability to provide a brownie recipe, what kind of example recipe would it provide?

[-]

Tashimm@reddit

That agentic approach is a solid way to handle it—using sub-agents to preserve the main agent's context is a great way to mitigate 'context rot.'

I'd argue, though, that the degradation isn't just about the latency/throughput (the 'slow tg' you mentioned). The real issue is the drop in actual reasoning quality and recall accuracy as the window fills up. Even if the model is still 'processing,' research shows that performance predictably degrades as more information is included, with the model becoming increasingly unreliable as it operates outside its training bounds. So, while 200k is a great practical limit, the 'illusion' of massive context windows is definitely a real technical problem.

[-]

matyhaty@reddit

It's more complex than that. You can run long long timed experimental code over night etc. Try doing that on claude subs. Your burn money and then it very much does become cost effecient.

[-]

ljubobratovicrelja@reddit

Thanks so much for the write-up. This really helps me, who with one gx10 is currently investigating getting of the hook as much as possible.

I've writing agentic harnesses for general agentic tasks, which gave me insight how powerful and honestly underrated they are in comparison to models. One thing I fear is that attaching a smaller model like qwen3.5 122-A10b to Claude Code or opencode isn't so trivial as both these harnesses are made for larger models. I haven't yet played with open hands though aider really doesn't see to cut it.

So I'm really curious if you could share anything regarding this customization you've made on top of open code. Also, even though you finally settled for 2x sparks, which I believe I'll also do in due time, any advice what to do in the meantime to maximize the potential of agentic coding using only one? Thanks!

[-]

t4a8945@reddit (OP)

I mean yeah with pleasure, the project is there: https://github.com/co-l/openfox

It's not a customization on top of OpenCode, it's a harness dedicated to local models.

I started working on it for 122B after noticing it was just always forgetting things of the plan, so I made a workflow in OpenFox to have clear acceptance criteria made by the planner, and a validation of this criteria with a subagent to catch any discrepency.

I haven't polished the README file so it's a little blunt, but it launches a server which then gives you access either locally only or on your network (with a password) to the tool through the browser (you can "install" it as a PWA so it has its own wrapper).

Tell me if you end up using it, I'd appreciate your feedback.

[-]

ljubobratovicrelja@reddit

Oh this is amazing, thanks for open sourcing it! I most certainly will try it as soon as I can. I was reluctant to start something like this myself, but more tests I'd do, more obvious it was it's necessary. I'm really happy to have bumped across you, I'd be really eager to use this and contribute to the project if you'd be interested. In any case, I'll get back to you as soon as I have a chance to test it out. Cheers!

[-]

Bootes-sphere@reddit

While cloud providers are convenient, the Asus Ascent GX10 + MiniMax M2.7 AWQ combo is a killer on-prem setup for running large language models. The performance and control you get is unbeatable. I'm running a similar rig with 4x A100s and it blows away any cloud instance I've tried, even the latest offerings. The only downside is the up-front investment, but long-term the TCO is lower. Curious to hear how your experience has been - any hiccups with setup or integration? I'd be happy to share some tips if you're still getting things dialed in.

[-]

t4a8945@reddit (OP)

Man you've got an awesome setup as well. That's much faster than mine.

Yes it wasn't straightforward, but with the help of the Spark community on the NVidia forum and AI it wasn't that difficult.

There are some quirks to be aware of, like some bugs in power delivery and shutdowns if thermals aren't watched. So now I downclock it automatically a little bit and run a benchmark before launching vLLM to make sure everything is running smoothly.

[-]

freehuntx@reddit

250gb/s lul

[-]

anzzax@reddit

273 GB/s :) But it has 200gbe interface so when connected in cluster, you run tensor parallel and throughput almost doubles. I have single spark and now I want second one. Even single spark gives me more practical value than pc with 5090 and 96gb ram.

[-]

SkyFeistyLlama8@reddit

You can put two Sparks or GB10s on a table without rewiring your home.

[-]

Glittering-Call8746@reddit

Even if it fits on one spark 2 spark will double it ? Give me rl use case of this exact scenario

[-]

Disastrous_Hope_9373@reddit

Dude you may have spent 5k in euros, but you never have to rely on cloud providers ever again, that's priceless :)

Ever since qwen3.5-122b-a10b came out, I basically leave 80% of all my LLM inference on my flow z13, and use the 20% for claude/gemini/grok. No more need to buy LLM subscriptions anymore, just run everything locally, and when I have a difficult problem, give it to claude, until the free usage runs out, then go to gemini, then grok.

But you have hardware powerful enough to do 100% LLM inference locally, which is fucking awesome.

[-]

MrHighVoltage@reddit

How are you guys paying for all of that. Two sparks set you back 7000€. I know that is not the point and I'm also into selfhosting models, but you can get so much Claude for that, and let's be honest, Opus 4.6 still wipes the floor with all models we get locally. If it is not for privacy, if these Sparks are not working 24/7, I really can't see the point of investing so much money.

[-]

t4a8945@reddit (OP)

Yeah it's not about cost savings at all.

It's about independence more than anything. Since I now rely on these tools to work, I dug myself out of the Claude dependency.

It was a business expense for me, so 7000€ for a consumer equates to 3500€ (I'm barely exaggerating ) because it's pre tax and without vat.

[-]

MrHighVoltage@reddit

If this satisfies your business needs, then it is something completely different. All these companies can kick you out from their services anytime if they feel like it, so absolutely understandable. And I also guess that this in the end actually a way to save some money, if you use it for hours per day. I'm just never sure for some posts here when it is private and when it is business :D

[-]

FalconX88@reddit

Our two sparks will arrive in a few days, definitely gonna try this. Sadly no nvfp4 yet?

[-]

Secure_Archer_1529@reddit

NVFP4 does not work not the spark. Community workarounds make it somewhat ok. But there’re better quants than NVFP4 atm. Go and have a look at the nvidia dgx spark developer forum if you didn’t already. Plenty of great stuff to turbocharge some builds and hit the ground running.

[-]

florinandrei@reddit

What do you mean? The Spark has a Blackwell GPU, it should work very well with NVFP4.

[-]

t4a8945@reddit (OP)

Yeah I was under the illusion nvfp4 would be the quant I use for everything, sadly it's not it. Other quants are faster.

[-]

Aaronski1974@reddit

I’ve re-cast models to nvfp4 and got them working but yea, it’s not really worth it.

[-]

lemon07r@reddit

I've only got to talk to like one or two of them, but I was trying to push them towards QAT so we can squeeze out even more quality at round q4/int4 sizes. Kimi k2.5 was really genius for that.

[-]

DOOMISHERE@reddit

i got my 2nd spark few days ago , and today got the cable!
wasted few hours on vLLM just to find out i cant run GGUF versions of minimax ( was hoping to run Q5 ), and llama.cpp cant work with clusters...

might try that quant u posted...

[-]

t4a8945@reddit (OP)

Yeah we'll see in a few months when dev in vLLM have stabilized the GGUF support, maybe one day. But for the dual spark setup, vLLM has been the best inference engine I've tried. It sure limits you to Q4 or FP8, but that's ok for me.

[-]

entsnack@reddit

llama.cpp support for rdma is in the works! you can run it in a cluster but it'll use ethernet right now.

[-]

_reverse@reddit

+1 to sparkrun, it’s great. It’ll use the 4bit AQB quant though. If you want to run a GGUF (for a 5bit quant or whatever) in a distributed setup you’ll be able to soon with the llama.cop tensor parallelism functionality, but it’ll probably be a couple of weeks until it’s stable.

[-]

mrtime777@reddit

Spark community has an excellent utility called `spakrun` and an official recipe for this model.. everything works out of the box

`$ sparkrun run u/official/minimax-m2.7-awq4-vllm`

[-]

DOOMISHERE@reddit

looks dope! ill test it for sure

[-]

DOOMISHERE@reddit

with support for multiple sparks ?

[-]

waiting_for_zban@reddit

How is pp with large context? if the spark had a pcie x16 gen5 it would have been a banger. Slapping an rtx6000 pro on it, would make it a perfect machine, right now I still struggle to see the value added compared to the strix halo or the m5 chips. It only makes sense if you stitch 2 together, and even then pp might not be that convincing, unless it's a sparse moe models.

[-]

t4a8945@reddit (OP)

For this particular model it starts at 3K tps, up to 1K at 100K context. So with a properly managed KV-cache it's just fast enough all the way.

Yes it's definitely a setup for MoE, not dense models.

[-]

Aaronski1974@reddit

Are you running it in tensor parallel? I was under the impression the 200gb networking wasn’t fast enough for that.

[-]

t4a8945@reddit (OP)

Yes tp=2, it works well. I can't blame it for anything.

[-]

Endothermic_Nuke@reddit

OP or someone can you please explain why not an M4 Max 256?

[-]

t4a8945@reddit (OP)

I'm not in the Apple ecosystem, so it wasn't really on my radar. Also it's more expensive and has lower availability.

[-]

Aaronski1974@reddit

I’m having the same experience on one spark. Using unsloths dynamic 2bit quant. Did you try it on a single spark? I’m seriously tempted to buy a second one. This is the first model I’ve ran locally that “gets it”. I have it working in a custom harness as well with Claude code keeping an eye on it, Claude seems to find its ability’s impressive as well. I’m getting around 35 tps to start dropping to about 20 at 40k tokens. I run oom at around 60k tokens though. Do you find intelligence to be ok at higher token depths?

[-]

t4a8945@reddit (OP)

That's one of the thing I noticed vs Qwen 3.5 397B: it stays smart even deep in context.

I don't regret buying a second one.

Also my experience with llama.cpp is worse than with vLLM. It's worth the extra work to have vLLM up and running, but yeah that restricts to Q4 or FP8 (I read they have experiment GGUF support, but that's too recent for me to want to try it when I have enough room for a Q4)

[-]

pirateadventurespice@reddit

Was literally setting up my second spark (also two ascents) today and wondering what model to try. Loading this one up now.

Speed absolutely does not matter for me. I’m an academic and I’d rather something run overnight and be correct than spit it out in real time; so, very excited to try this.

[-]

t4a8945@reddit (OP)

Nice! I found there is no "one model for all" especially at this size point. MiniMax M2.7 might be it for you, or not. For software engineering, it's a dream come true for me. For research and broader intepretation I'm not so sure, maybe Qwen 3.5 397B will be better.

Have fun! (but I know from experience the beginning of setting it up is NOT fun - go here if you need to debug anything precise forums.developer.nvidia.com/c/accelerated-computing/dgx-spark-gb10/dgx-spark-gb10/721 )

[-]

Ok-Measurement-1575@reddit

Is the PP still fairly strong clustered?.

96GB here - Minimax is my best model, too. I only trot it out for the trickier problems because ultra low PP.

[-]

t4a8945@reddit (OP)

I don't exactly know clustered vs non-clustered, but for this particular model it starts at 3K tps, up to 1K at 100K context. So with a properly managed KV-cache it's just fast enough all the way.

[-]

Blackdragon1400@reddit

I can’t speak for minimax but 397b gets over 1k t/s PP

[-]

Initial_Run3719@reddit

Nice setup you have there!

What about prompt processing times? I have a Strix Halo and my favourite LLM is Qwen 3.5 122B. But loading takes up to 8 minutes with full context (I set up 120k). I know the sparks are much faster, does having two speed it up even more?

[-]

t4a8945@reddit (OP)

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048	3121.55 ± 32.45		779.28 ± 6.82	656.16 ± 6.82	779.35 ± 6.82
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32	41.60 ± 0.06	42.94 ± 0.07
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d4096	2642.58 ± 6.81		2448.14 ± 5.98	2325.02 ± 5.98	2448.21 ± 5.98
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d4096	39.73 ± 0.04	41.02 ± 0.04
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d8192	2456.91 ± 3.91		4290.97 ± 6.63	4167.85 ± 6.63	4291.04 ± 6.63
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d8192	38.56 ± 0.06	39.81 ± 0.06
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d16384	2196.05 ± 1.09		8516.37 ± 4.16	8393.25 ± 4.16	8516.44 ± 4.16
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d16384	35.67 ± 0.04	36.83 ± 0.04
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d32768	1815.85 ± 2.53		19296.54 ± 26.75	19173.42 ± 26.75	19296.61 ± 26.74
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d32768	31.35 ± 0.17	32.36 ± 0.17
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d100000	1047.93 ± 1.09		97504.06 ± 101.52	97380.94 ± 101.52	97504.14 ± 101.53
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d100000	21.20 ± 0.05	22.00 ± 0.00

llama-benchy (0.3.5) date: 2026-04-13 14:54:14 | latency mode: generation

PP speed is great, it degrades linearly with the rest. The key is preserving KV-cache integrity and then it's just smooth.

Loading a 100K context from scratch is about 60-80s and after reboot I prefer starting from a fresh context anyway, it's very good at getting up to speed on existing changes.