Dual dgx spark (Asus GX10) MiniMax M2.7 results
Posted by koibKop4@reddit | LocalLLaMA | View on Reddit | 20 comments
hi all
I have dual 3090 and 8 x mi50 32gb and I was tired of heat and loudness of these machines. So inspired by this post and others on nvidia forum I've purchased dual Asus GX10 (dgx spark) and I'm so happy.
Each GX10 consumes about 100W during inference.
Time to first token is quite high but for me it's a win
Without a hassle I can run https://huggingface.co/cyankiwi/MiniMax-M2.7-AWQ-4bit/
I've used open code and hermes agent, no errors, just going - I love it!
Here are my results:
| test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|----------------:|----------------:|-------------:|------------------:|------------------:|------------------:|
| pp2048 | 3452.05 ± 73.32 | | 626.82 ± 19.83 | 511.74 ± 19.83 | 626.84 ± 19.83 |
| tg32 | 38.84 ± 0.01 | 40.09 ± 0.01 | | | |
| pp2048 @ d4096 | 2848.85 ± 35.82 | | 2022.61 ± 28.98 | 1907.54 ± 28.98 | 2022.65 ± 28.98 |
| tg32 @ d4096 | 37.37 ± 0.23 | 38.57 ± 0.24 | | | |
| pp2048 @ d8192 | 2579.85 ± 18.26 | | 3523.69 ± 61.33 | 3408.62 ± 61.33 | 3523.73 ± 61.33 |
| tg32 @ d8192 | 36.27 ± 0.14 | 37.44 ± 0.15 | | | |
| pp2048 @ d16384 | 2411.34 ± 7.68 | | 6791.62 ± 57.14 | 6676.55 ± 57.14 | 6791.66 ± 57.14 |
| tg32 @ d16384 | 34.12 ± 0.11 | 35.23 ± 0.12 | | | |
| pp2048 @ d32768 | 1988.05 ± 12.95 | | 15512.61 ± 147.98 | 15397.54 ± 147.98 | 15512.65 ± 147.98 |
| tg32 @ d32768 | 30.72 ± 0.08 | 31.00 ± 0.00 | | | |
I start to consider selling my mi50 ;)
ifheartsweregold@reddit
I have dual sparks too but still can’t find anything that comes close to Qwen 3.5 397B for speed and quality. Minimax is just too slow in my opinion.
koibKop4@reddit (OP)
would you share your results? which quant exactly? what context can you fit?
ifheartsweregold@reddit
Same results as this recipe and benchmark: https://spark-arena.com/benchmark/70f41a43-9660-45f5-936d-f4426e7d2f18
Dzekoh@reddit
Hi u/ifheartsweregold, I’m currently setting up a dual Spark cluster (connected via DAC) specifically for an agentic coding workflow, and I'd love to test your setup with the Qwen 3.5 397B.
Unfortunately, the spark-arena link you provided seems to be dead (it just shows an 'Unknown Model / Failed to load' page).
Would you mind sharing the raw
launch-cluster.shcommand or the recipe YAML you used? I'm particularly interested in your vLLM flags for tensor parallelism and how you managed to squeeze a 256K context window into the remaining VRAM.Thanks in advance
ifheartsweregold@reddit
256K context.
audioen@reddit
The only thing bad is that you are stuck with 4-bit inference. Isn't something like 6 bit just within reach, if you split well? I would suggest using llama.cpp as much as possible and the higher quality model versions than are available in AWQ side. This model is known to be severely degraded at 4-bit, in at least GGUF world, and I suspect AWQ 4-bit is no better. Even if prompt processing took a severe hit, I would still look into running it with llama.cpp as 6-bit 2-way cluster, because I believe you will not be getting the full model quality without this.
fallingdowndizzyvr@reddit
Why do you say that? It seems to hold up at 4 bit right fine.
spvn@reddit
What did t/s look like in actual use? For agentic coding in opencode for example?
t4a8945@reddit
It looks like that 38-20 tps in token generation for 0-128K context, and incremental context is handled beautifully (as you can see by the inflated "PP T/S" numbers, which aren't adjusted for taking the KV-cache into account).
(ignore the 12.9t/s at the end, that's an outlier while I had two sessions running)
gxcreator@reddit
What is a name of that lovely stats UI?
t4a8945@reddit
It's my own harness https://github.com/co-l/openfox
I'm using it and improving it everyday ; it's quite solid now and I'd be happy to take feedback on it.
koibKop4@reddit (OP)
Results come form llama benchy, they represent real value, not theoretical maximum. It's not llama-bench.
t4a8945@reddit
Hey! I'm the OP of the post you're referencing. Happy to see you be happy!
I'm still loving it every day, been working only with it and it performs very good.
Most session are perfect with acceptable back and forth to finalize to my liking ; when the session goes south for some reason (bad prompt, bad investigation, that happens), I'm quick to start a fresh one with the acquired knowledge and start from a different angle.
Enjoy OP!
unjustifiably_angry@reddit
I'm planning to get this running on my Sparks as well, hoping to use it as the "expert" to call in when my smaller dumber faster model can't figure something out.
havenoammo@reddit
What was t/s look like with mi50s?
Ok-Measurement-1575@reddit
Awesome, thanks.
Can you do a 100k run, too?
koibKop4@reddit (OP)
just added 100k depth to table too
xXprayerwarrior69Xx@reddit
Seconding that please try high context
madsheepPL@reddit
more benches straight from the trenches - https://spark-arena.com/leaderboard - you can filter for minimax results
anzzax@reddit
Do some batched (n = 4 and 8) inference bench, I found awq on gb10 scales very well. I have single GX10 and want 2nd one but price jump hurts.