Dual dgx spark (Asus GX10) MiniMax M2.7 results

Posted by koibKop4@reddit | LocalLLaMA | View on Reddit | 20 comments

hi all
I have dual 3090 and 8 x mi50 32gb and I was tired of heat and loudness of these machines. So inspired by this post and others on nvidia forum I've purchased dual Asus GX10 (dgx spark) and I'm so happy.
Each GX10 consumes about 100W during inference.

Time to first token is quite high but for me it's a win
Without a hassle I can run https://huggingface.co/cyankiwi/MiniMax-M2.7-AWQ-4bit/
I've used open code and hermes agent, no errors, just going - I love it!

Here are my results:

|            test |             t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|----------------:|----------------:|-------------:|------------------:|------------------:|------------------:|
|          pp2048 | 3452.05 ± 73.32 |              |    626.82 ± 19.83 |    511.74 ± 19.83 |    626.84 ± 19.83 |
|            tg32 |    38.84 ± 0.01 | 40.09 ± 0.01 |                   |                   |                   |
|  pp2048 @ d4096 | 2848.85 ± 35.82 |              |   2022.61 ± 28.98 |   1907.54 ± 28.98 |   2022.65 ± 28.98 |
|    tg32 @ d4096 |    37.37 ± 0.23 | 38.57 ± 0.24 |                   |                   |                   |
|  pp2048 @ d8192 | 2579.85 ± 18.26 |              |   3523.69 ± 61.33 |   3408.62 ± 61.33 |   3523.73 ± 61.33 |
|    tg32 @ d8192 |    36.27 ± 0.14 | 37.44 ± 0.15 |                   |                   |                   |
| pp2048 @ d16384 |  2411.34 ± 7.68 |              |   6791.62 ± 57.14 |   6676.55 ± 57.14 |   6791.66 ± 57.14 |
|   tg32 @ d16384 |    34.12 ± 0.11 | 35.23 ± 0.12 |                   |                   |                   |
| pp2048 @ d32768 | 1988.05 ± 12.95 |              | 15512.61 ± 147.98 | 15397.54 ± 147.98 | 15512.65 ± 147.98 |
|   tg32 @ d32768 |    30.72 ± 0.08 | 31.00 ± 0.00 |                   |                   |                   |

I start to consider selling my mi50 ;)

[-]

koibKop4@reddit (OP)

would you share your results? which quant exactly? what context can you fit?

[-]

ifheartsweregold@reddit

Same results as this recipe and benchmark: https://spark-arena.com/benchmark/70f41a43-9660-45f5-936d-f4426e7d2f18

[-]

Dzekoh@reddit

Hi u/ifheartsweregold, I’m currently setting up a dual Spark cluster (connected via DAC) specifically for an agentic coding workflow, and I'd love to test your setup with the Qwen 3.5 397B.

Unfortunately, the spark-arena link you provided seems to be dead (it just shows an 'Unknown Model / Failed to load' page).

Would you mind sharing the raw launch-cluster.sh command or the recipe YAML you used? I'm particularly interested in your vLLM flags for tensor parallelism and how you managed to squeeze a 256K context window into the remaining VRAM.

Thanks in advance

[-]

ifheartsweregold@reddit

256K context.

[-]

koibKop4@reddit (OP)

Results come form llama benchy, they represent real value, not theoretical maximum. It's not llama-bench.

[-]

koibKop4@reddit (OP)

just added 100k depth to table too

I have dual sparks too but still can’t find anything that comes close to Qwen 3.5 397B for speed and quality. Minimax is just too slow in my opinion.

audioen@reddit

The only thing bad is that you are stuck with 4-bit inference. Isn't something like 6 bit just within reach, if you split well? I would suggest using llama.cpp as much as possible and the higher quality model versions than are available in AWQ side. This model is known to be severely degraded at 4-bit, in at least GGUF world, and I suspect AWQ 4-bit is no better. Even if prompt processing took a severe hit, I would still look into running it with llama.cpp as 6-bit 2-way cluster, because I believe you will not be getting the full model quality without this.

fallingdowndizzyvr@reddit

This model is known to be severely degraded at 4-bit (at least in GGUF world)

Why do you say that? It seems to hold up at 4 bit right fine.

spvn@reddit

What did t/s look like in actual use? For agentic coding in opencode for example?

t4a8945@reddit

It looks like that 38-20 tps in token generation for 0-128K context, and incremental context is handled beautifully (as you can see by the inflated "PP T/S" numbers, which aren't adjusted for taking the KV-cache into account).

(ignore the 12.9t/s at the end, that's an outlier while I had two sessions running)

gxcreator@reddit

What is a name of that lovely stats UI?

It's my own harness https://github.com/co-l/openfox

I'm using it and improving it everyday ; it's quite solid now and I'd be happy to take feedback on it.

Hey! I'm the OP of the post you're referencing. Happy to see you be happy!

I'm still loving it every day, been working only with it and it performs very good.

Most session are perfect with acceptable back and forth to finalize to my liking ; when the session goes south for some reason (bad prompt, bad investigation, that happens), I'm quick to start a fresh one with the acquired knowledge and start from a different angle.

Enjoy OP!