Considering two Sparks for local coding

Posted by chikengunya@reddit | LocalLLaMA | View on Reddit | 28 comments

I'm currently running a 4x RTX 3090 system (96GB VRAM, DDR4 2133 RAM) and have tested opencode and pi.dev using Qwen3.5-122B-A10B (AWQ) up to 200k context for web app coding (html/js/python). I'm now seriously considering picking up two Sparks paired with MiniMax M2.7 for local inference.

Two units are needed to keep prompt processing at acceptable speeds. Output tokens/sec stays the same regardless (\~15 tok/s at \~100k context, based on what I've seen here). Combined 2 * 128 GB = 256 GB VRAM leaves headroom for future models (next MiniMax version, Qwen3.6-122B).

Idle power draw: \~50 W per Spark measured at the wall. My 4x 3090 rig idles at \~130 W (all cards power-limited to 275 W, 22W idle per card in nvidia-smi; under full load with the 122B model it peaks at \~750 W).

I need context up to \~120k tokens for coding sessions. Based on the numbers above, two sparks with MiniMax M2.7 should deliver acceptable speeds in that range which would be enough for me.

I can't properly benchmark MiniMax M2.7 on my current setup, 96 GB VRAM isn't enough to load it comfortably, and the slow DDR4 2133 RAM makes prompt processing a bottleneck anyway.

I'm curious what your experience is. How much better is MiniMax M2.7 than Qwen3.5-122B-A10B (AWQ) for real-world coding tasks (HTML/JS/Python)? Thanks in advance.

[-]

Invent80@reddit

I have a spark and RTX6000 pro. Get a second 6000pro. No brainer. Spark is slow. A single one is fine but for larger models unless you're ok with 10t/s speed, I'd pass on it.

[-]

Ok-Measurement-1575@reddit

It is better but I'm not sure I'd drop 8 grand on two tortoises.

Just add one 6000 pro to the existing rig, perhaps.

[-]

unique-moi@reddit

The thing about the two tortoises is that once you’ve linked them you’ve got 256gb of nearly unified memory, and a fairly good ecosystem with sparkrun & spark-arena.com

[-]

Karyo_Ten@reddit

A Pro 6000 will give a total of 192GB of VRAM.

It has over 7x the memory bandwidth and 2.5x the compute of a Spark.

And the models OP want to run would fit. 4x 3090 + 1x Pro 6000.

[-]

redpandafire@reddit

Noob here, does the 6000 do the heavy lifting of running the active layers of the model while the 3090’s do some math but generally there to hold the rest of the model in vram? If so I think that’s a good plan too rather than having one global speed limit on memory.

[-]

Karyo_Ten@reddit

I think the way to get max speed is to have the 4x RTX 3090 in a TP=4 block (tensor parallelism) and pipeline parallel the RTX Pro 6000.

I don't know if it's faster to make order (Pro6000 -> 4x3090) or (4x3090 -> Pro6000)

With TP4, each 3090 process 4x smaller tensors so that is a huge reduction in mem bandwidth needed. Also as matrix multiplication grows with O(n³) a division by 4 in size means 4³ less compute needed per GPU so 64x less compute per GPU.

[-]

MelodicRecognition7@reddit

it is possible to divide pro6000 into 4x virtual gpus with 24gb each

[-]

Karyo_Ten@reddit

For an inference server I don't see the use case.

[-]

g_rich@reddit

While true you’re over overlooking the power requirements, and other environmental factors such as noise and heat. Two DGX Sparks will give 256GB of unified memory, take up very little space, use less than 500 watts of power, produce very little heat and are nearly silent.

Inference speed will be slower than your proposed solution but will still be usable and for some people the trade offs are worth it.

[-]

chikengunya@reddit (OP)

If it wouldn't be that big and power hungry :)

[-]

Sufficient_Prune3897@reddit

You can run it at 300w, and it's more power efficient due to just being faster. At least if you're actually using it for the time the PC is turned on.

[-]

ego100trique@reddit

That's not how power consumption works mate

[-]

Such_Advantage_6949@reddit

U will regret having spark.. i own multiple 4090 3090 and rtx 6000 pro. Agentic coding need alot of tokens at speed. U will feel the downgrade in speed maybe half what 4x3090 give.

[-]

ImportancePitiful795@reddit

If you can afford 2 Sparks and the Connect7 cable needed to wire them, and use vLLM, then you should be better off than 4 RTX3090s.

Just make sure you get those phase cooling platforms for laptops, to cool the devices. :P

There are gazillion discussions in here about this.

Here is a good one by u/eugr

Real-world DGX Spark experiences after 1-2 months? Fine-tuning, stability, hidden pitfalls? : r/LocalLLaMA

[-]

g_rich@reddit

Two Sparks will use less power, produce less noise, produce less heat, and take up a lot less space than a rig with 4x RTX3090’s.

Sometimes it’s not only about the raw power or tokens per second.

[-]

ImportancePitiful795@reddit

Yep. Completely agree :)

And also on dual DGX, can run 200-300B models. :)

Also vLLM works amazingly well with DGX Spark, with exceptionally good concurrency speeds when using agents to talk to the "loaded" LLM.

[-]

marscarsrars@reddit

Let me know if you are interested in checkinf out how two dgx work work.

We can help.

[-]

guai888@reddit

You can checkout the leaderboard at https://spark-arena.com/. It should give you some idea about performance of different Spark setup. Rule of thumb is more parameters will performance betters but you need to test your own usage case. I personally like Qwen3.5-122B-A10B. You can get 50 tok/s with albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4: Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%)

[-]

unique-moi@reddit

Do you find the qwen3.5-122b to be better than the qwen3.6 27b & 35b models ?

[-]

t4a8945@reddit

Not the person you're responding to, but I have two sparks and tried the FP8 versions side by side of 3.6 27B and 3.5 122B-A10B ; the 3.6 27B was smarter.

Not by a huge margin though

source of the graph

[-]

rpkarma@reddit

Yep my evals have shown the same.

[-]

GCoderDCoder@reddit

I think qwen 3.6 27b beats qwen 3.5 122b at any given quant and while I technically have minimax m2.7 q6kxl running in my lab at 40t/s I find myself leaning on qwen 3.6 27b q8 more. I imagine 4x3090s can push dense qwen models better than something like the spark.

Minimax m2.7 is an agent to do stuff but it seems to be designed to keep token count low so not my favorite to talk to. It technically makes iterative improvements which is great but less likely to one shot things that aren't more procedural code.

Minimax m2.7 is the model i give instructions to and set it loose in the background. Qwen 3.6 27b is more well rounded. Gemma 4 31b is my favorite coder with a near tie with qwen 3.6 27b. I use qwen 3.6 27b and gemma4 31b to review eachother's output and I have typically liked that more than minimax m2.7.

2x3090s should fit each of the dense models at q8 with decent cache at q8. If i only got to choose one I would choose qwen 3.6 27b which doesn't like low bandwidth devices but maybe tensor parrallel spark might be ok...?

[-]

ThrowWeirdQuestion@reddit

I saved some money by buying two ASUS gx10 instead. It comes with only 1TB SSD but runs cooler and at least here in Japan it was significantly cheaper.

So far the smaller SSD works for me and I really like that setup. I keep extra models on a NAS, and for the ones I am actively experimenting with 1TB is plenty

[-]

thesuperbob@reddit

I'm running MiniMax-M2.7-GGUF/UD-IQ4_XS on 4xMi50, it's... not fast, but usable if you just leave it alone and let it code. For example giving it a project concatenated into a markdown file (\~80kB) and asking a question just now, processing that first request took about 3min and now I'm getting 12t/s generation.

But yeah, it's surprisingly correct and consistent, I mostly used it for Java so far, it can break down (reasonable) tasks, write tests, correct based on test output... Not Claude level stuff obviously, but compared to other stuff I've run locally, this is the first model that actually looks capable of fire and forget operation in the background. I'm using Kilo Code. I haven't really tried tuning it yet either, just giving llama.cpp the model file and going with defaults in the gguf file, so there's some performance left there.

[-]

braydon125@reddit

It's about memory bandwidth dude

[-]

guai888@reddit

What do you think about Opencode and Pi.dev? I ran a few tests and Pi.dev seems to do better than Opencode.

[-]

chikengunya@reddit (OP)

Context window is shown precisely on opencode but on pi.dev it seems buggy. Both are fine actually, Inlike pi.dev cli more (on wsl)

[-]

jacek2023@reddit

I think sparks are slower than 3090. But then rtx 6000 pro is more expensive