2 x DGX Spark! Give me your non-inference workloads
Posted by entsnack@reddit | LocalLLaMA | View on Reddit | 130 comments
2 x DGX Spark with a 200Gbps interconnect.
I posted here when my first Spark came in and everyone responded with inference workloads. I still tested them, but inference monkeys please BTFO this time.
Give me your big model non-inference workloads to test, something to push the 256GB unified memory. I have a few LORA training ones from the last post to try. I already have nanochat pretraining running. GRPO without PEFT planned.
FloofBoyTellEm@reddit
It's actually a 98 Gbps link. PCIE limitation.
entsnack@reddit (OP)
Not if you use GPU direct, which is what you should be doing.
FloofBoyTellEm@reddit
No. The NIC is connected to the board (including the GPU) by PCIe 5.0 x 4 (4 lanes... no 16 lanes or even 8 lanes). You will never get 200 GbE from a single port over anything on this system. This is a hardware limitation. You might be able to get 120 GbE if you could avoid 100% of the overhead, which GPU Direct can help with.
This is a pitfall of just believing the NVIDIA marketing vs. real world testing.
entsnack@reddit (OP)
I'll take 180 GbE instead of 200GbE. How is being smart working out for you?
FloofBoyTellEm@reddit
Is this using 2 DACs vs 1? Confused why your rate is almost exactly double mine.
Since you're smarter than me, and you claim "not if you're using GPU Direct", but Nvidia stated GPU Direct doesn't work on DGX Sparks, how did you get it to work?
https://forums.developer.nvidia.com/t/dgx-spark-gb10-faq/347344#p-1694056-q-is-gpudirect-rdma-supported-on-dgx-spark-13
entsnack@reddit (OP)
I didn't do anything special, I just followed this tutorial: https://build.nvidia.com/spark/nccl/stacked-sparks. Make sure you don't connect both ports, that halves the bandwidth. I'm also using a QSFP56 cable that's not officially supported (they announced the supported cables well after I bought mine).
FloofBoyTellEm@reddit
I know that's how I originally got the sparks going over nccl when I first got them, but will try that again.
Currently my test results mirror that of ServeTheHome's iperf results and I get similar results (half of what you get) over the all_gather_perf test. I may just have lost an environment variable somewhere along the way since then breaking things.
I know it says to ignore the enP2 interfaces in that playbook, but I just did a lag (802.3ad) between [enp1s0f0np0, enP2p1s0f0np0] and went from 96 Gbit/s to 127 Gbit/s on my ZFS pool. So, at least when not using some form of nccl, that was necessary, and it looks like the ServeTheHome test was probably stuck to half the lanes. So, now I really am hitting the PCIe limitation where the data must go through the CPU.
Going to try again for the all_gather_perf using your link.
I have the official mellanox DAC. But ordered some shorter generics from FS.com today to see if that's part of the issue.
Can you do me a huge favor and share your results for these two commands, I'm wondering if I have a real issue or this is a red herring?
sudo lspci -vv -s 0000:01:00.0 | egrep -i 'LnkCap|LnkSta'
sudo dmesg | grep mlx5_pcie_event
[ 2.787624] mlx5_core 0000:01:00.0: mlx5_pcie_event:296:(pid 162): Detected insufficient power on the PCIe slot (27W).[ 3.374174] mlx5_core 0000:01:00.1: mlx5_pcie_event:296:(pid 162): Detected insufficient power on the PCIe slot (27W).[ 3.960778] mlx5_core 0002:01:00.0: mlx5_pcie_event:296:(pid 408): Detected insufficient power on the PCIe slot (27W).[ 4.571786] mlx5_core 0002:01:00.1: mlx5_pcie_event:296:(pid 380): Detected insufficient power on the PCIe slot (27W).FloofBoyTellEm@reddit
You seem to be the anomaly.
https://forums.developer.nvidia.com/t/connectx-7-nic-in-dgx-spark/350417/42
nobody else can hit those numbers...
entsnack@reddit (OP)
There is someone with 22.9 GBps avg bus bw in that thread, higher than mine.
FloofBoyTellEm@reddit
They are using 2 DACs, are you?
Note the 4 interface bond.
entsnack@reddit (OP)
I see 4 messages about detecting insufficient power on the PCIe slot.
FloofBoyTellEm@reddit
Trying to lag my interfaces now as a test. Thinking that may be the real issue here. Really hoping to get your bandwidth rate. Built an NVME-oF TrueNAS Scale server this week and think this has been a misunderstanding on my part (hopefully) with how QSFP56 links actually work. It's clicking now.
Embarrassed_Win7667@reddit
Can anyone post example unSloth example of how two use two over the connection - I just got one and don’t know how to invoke both in one training run. Nvidia guide just shows you how to test NCCL.
Analytics-Maken@reddit
The context window is real, I'm consolidating the data I need into a data warehouse using Windsor ai an ETL tool to reduce the token usage on MCP calls. So far, it has been a better approach, but it requires work upfront, making joins and calculations.
Eugr@reddit
Actually, inference benchmarks would be very interesting: both comparing the same model with single node inference and running something like qwen3-235b in awq 4-bit.
Lot of people posted benchmarks for a single spark (including myself), but I haven't seen anything substantial for dual spark inference.
entsnack@reddit (OP)
sigh OK will do. Now you got me curious too but my expectations are low.
Eugr@reddit
Any luck with this?
Eugr@reddit
In theory, you should get some speedup with data-parallel or tensor-parallel on smaller models. Qwen3-235 should be able to run in 4-bit quant, but won't fit in FP8.
SpecialistNumerous17@reddit
Yes please! It would be awesome to see benchmarks comparing performance of 1 vs 2 nodes inferencing the same models.
xxPoLyGLoTxx@reddit
Hey he said BTFO!
Wisepunter@reddit
Whats your experience so far training models you have tried. Is it decent performance? How does it compare to multiple consumer GPU etc?
entsnack@reddit (OP)
My use case is a bit niche: I need the Grace ARM CPU and the CX7 interconnect to test CUDA kernels for a GB200 that I rent time on. The Spark a good machine to both learn and prototype on.
For pretraining nanochat, I can compare it to my H100 and 4090:
Eugr@reddit
Spark doesn't have downscaled Grace ARM CPU though, it uses Mediatek one with different architecture. Still ARM though.
rawdmon@reddit
Not sure where you've heard this, but GB10 literally stands for Grace Blackwell 10.
Eugr@reddit
I know what it stands for. But the architecture is still different, see below:
https://www.mediatek.com/press-room/newly-launched-nvidia-dgx-spark-features-gb10-superchip-co-designed-by-mediatek
https://www.servethehome.com/nvidia-dgx-spark-review-the-gb10-machine-is-so-freaking-cool/2/
entsnack@reddit (OP)
Yeah ARM is good enough + CX7 with GPU direct.
auradragon1@reddit
Um, isn't this the exact reason Nvidia released the Spark? It's a local machine for CUDA devs that need to deploy changes to enterprise Nvidia GPUs.
entsnack@reddit (OP)
It is, but I need to explain that on this sub because it's mostly inference monkeys who think this is a Mac Mini replacement.
SkyFeistyLlama8@reddit
Being said inference monkey who still wants a Spark on my desk... I salute you.
Wisepunter@reddit
I don't know a lot about it, but that's a nice uplift from a beefy 4090. I know ram speed is a big issue with inference, what's the bottleneck with training that makes it soo much better than a 4090?
entsnack@reddit (OP)
The 4090 low performance in training is indeed strange, I still need to debug it. I relegated my 4090 to gaming a year ago though, 24GB VRAM was enough in the BERT days but not anymore.
ProcessComplete5797@reddit
I want 2 so bad
braindeadtheory@reddit
Large scale aerial reconstruction using COLMAP x fVDB for GSPlat and TSDF mesh or metas new sparse / dense reconstruction transformer.
redblood252@reddit
Ffmpeg av1 transcoding?
Budget-Juggernaut-68@reddit
what can you finetune with that? 70b models?
txgsync@reddit
Read up on SeedLM on Arxiv. Try compressing a model using PRNG FP16 substitution. “Seed search” is a killer time sink across 16-bit space. I kept tripping over lack of comprehensive support for tensors on Mac. Can the DGX spark improve on it? Post some benchmarks.
entsnack@reddit (OP)
niiiice never heard of this and very interested to test, will post back
txgsync@reddit
Yeah, SeedLM is a very Apple way of approaching things. It’s been impractical on non-Apple platforms: the PCIe transit cost was too high between system RAM and GPU VRAM.
But now that both AMD and nvidia have gotten into unified memory, it seems like using CPU for PRNG matrix weights and GPU for tensors might be practical outside the Apple sandbox.
I will be noodling too. Let me know if you get stuck. I have not committed my code to GitHub for SeedLM yet; it’s very MLX-specific right now.
Wrong-Historian@reddit
Crysis
txgsync@reddit
Fuck, you are old.
somealusta@reddit
Tetris
CV514@reddit
Pong
txgsync@reddit
Space War
iamlazyboy@reddit
Do you have ever thought that maybe we're not that old but you're young? The game isn't even 20 years old
dogfighter75@reddit
Ancient millennials using their GPUs for graphics processing in 2025..
Advanced-Virus-2303@reddit
Don't speak the old magic to me, witch.
eloquentemu@reddit
People may age but a good meme never dies
Regular-Forever5876@reddit
Someone actually tried that https://youtu.be/6iVftb0cbnc?si=r204SUgxoQfFrkWK
entsnack@reddit (OP)
😂 every time, but still funny
nickpsecurity@reddit
Try pretraining with these for a real test. They're designed for single- or low-GPU setups. Use PG-19 dataset (or more of Gutenberg) instead of theirs so whatever you produce has no copyright issues. There's also no question of benchmark training or parroting modern stuff if the dataset considers the year 1919 "modern." ;)
Cramming Language Model
GPT2 From Scratch
PG-19 Benchmark
entsnack@reddit (OP)
nanochat prertraining benchmark compendium: https://www.reddit.com/r/LocalLLaMA/s/NLmbm2NelU
HumanDrone8721@reddit
Wow, congrats, beat me to the punch :), we have the same setup preparing to arrive, this time with Gigabyte ATOMs that are floating somewhere on the road :(.
I think there are many interesting suggestions poster here (along with the Sturgeon's ration of 90% garbage) but I have another suggestion:
HARDWARE RELIABILITY TESTING UNDER LOAD PLIZZ !!!
A few days ago there was an INTENSE astroturfing campaign: "The man, the legend, the idol programmer tested one and come to say that it only consumes 100W at max load and it crashes and reboots soon..." followed by more metoos... followed by articles that were citing articles that were citing a Twitter post that was posting a screenshot of a "community post"... followed by smirks saying "you should have got a strix, it can play vidya gamez as well..."
Anyway, please keep this post as a repository of knowledge about the mini-cluster of these and please do some hardware testing under load and post your methods and actual code so I can try to reproduce it here as well.
entsnack@reddit (OP)
I find that entire story weird. I HAVE made it crash, but I did it deliberately by setting the nvidia-smi boost-slider to 4 (it comes at 0 by default), which is an undocumented hack.
Also, the rated peak power draw is about 100W for the GPU and 140W for the rest of the components (CPU, network).
Not saying its "better" than a Strix or Mac, depends on your use case. If you want to learn and flex your ability to optimize models for the NVL72 and other GB clusters, this is the only kit to learn on.
FullOf_Bad_Ideas@reddit
I posted this on previous thread and I repeat it again here.
entsnack@reddit (OP)
Will do. JFYI I was able to get distributed pretraining of nanochat working, the speed goes from 1,600 tok/sec with a single DGX Spark to 6,600 tok/sec with 2 DGX Sparks. Not sure why the non-linear jump in speed.
nicko170@reddit
Yikes. I have it running on a single a40 at 4,500 too/sec ;-)
entsnack@reddit (OP)
What is your --depth, maximum sequence length, and time to pretraining completion? Share it here: https://www.reddit.com/r/LocalLLaMA/s/OWpYwBpEng
SkyFeistyLlama8@reddit
How are they hooked up?
entsnack@reddit (OP)
You can also use etherner but it will be significantly slower and also involve the CPU.
noctrex@reddit
Run some benchmarks on a MoE model and find out if the MXFP4 quant is faster than the normal Q4 one
entsnack@reddit (OP)
Hmm I've tried gpt-oss-120b but not a Q4 vs. MXFP4 test. The new 4-bit hype is for NVFP4.
Freonr2@reddit
Just a heads up and not sure if this is what you were contemplating, but gpt oss isn't going to be a great way to compare GGUF quants and mxfp4, because the GGUF quants aren't changing any of the mxfp4 layers to Q at all. We don't have a bf16 version of gpt oss to use as a basis for quantizing with different quantization algos.
ex.
https://huggingface.co/unsloth/gpt-oss-120b-GGUF/blob/main/Q2_K/gpt-oss-120b-Q2_K-00001-of-00002.gguf
The actual files they're not a lot smaller than originally distributed, and if you dive in to look at the layer dtypes only a few layers are in GGUF formats, and none of the FFN layers get changed from mxfp4 from my poking around.
I generally think requantizing from a 4 bit quant to some other type of 4 bit quant is likely to ruin the model anyway as there will be essentially rounding errors all over the place.
It would however be interesting to take a bf16 model and quantize it into GGUF, nvfp4, and mxfp4 and benchmark on various hardware.
entsnack@reddit (OP)
I'll admit I know very little about GGUF, it's not a format that's used much outside hobbyist circles, especially not on CUDA GPUs.
noctrex@reddit
Well, I can quantize any MoE model we want, I already have a bunch on my repo on hf.
It would be interesting to see also if the I-quants are better
Freonr2@reddit
https://www.youtube.com/watch?v=vW30o4U9BFE
noctrex@reddit
Yeah I've seen the hype, but I'm very curious about the MX one, maybe because I (shameless plug) quantize in it, and I would be interesting if there is any advantage on newer hardware with FP4 support
johnkapolos@reddit
Bookmarked!
entsnack@reddit (OP)
holy shit, you’re a SERIOUS dude. gonna prioritize this request.
noctrex@reddit
no worries no hurries :) Just do your thing and when you have the time have a look at it. Nothing serious really, just my adhd brain read up on the FP4 quant and I went down the rabbit hole of quantizing
rinaldo23@reddit
Minecraft server
siegevjorn@reddit
Congrats...I'm jealous...How’d you slip an $8k+ DGX Spaek into the house? Told your partner they are internet switches/ routers?
entsnack@reddit (OP)
lmao no they're for "work", my only personal GPU is a 4090 I bought from a scalper during COVID. The DGX Sparks are the only work GPUs I get to keep at home.
Maleficent-Ad5999@reddit
Please tell me how to apply for this job
Daily_Heavy@reddit
Can you look in the BIOS menu to see if there is any way to adjust the LPDDR clock speed? If so, please post the min and max possible settings.
thereisnospooongeek@reddit
Can you do an OCR performance benchmark for OLMOCR2, DeepseekOCR, and ChandraOCR?
entsnack@reddit (OP)
DeepSeekOCR is a 3B model. Isn’t 240GB VRAM wasted on this?
thereisnospooongeek@reddit
It would be still great to know the output rate. I just want to know whether it will be a good investment. I need to do OCR of approx 1.2TB PDF files. Hence the request.
entsnack@reddit (OP)
Oh so I can try batching and tell you the total throughput. Will do.
thereisnospooongeek@reddit
Thanks Mate
akram200272002@reddit
Honest to God, I wana see this thing do a cycles render
entsnack@reddit (OP)
I have no idea what this is. Link?
akram200272002@reddit
Google blender
entsnack@reddit (OP)
hmm I’ll have to hook this up to a monitor, it’s not near one right now. Will try.
Nic4Las@reddit
No need to set up a monitor just for this. https://opendata.blender.org/ blender is awesome and has a dedicated benchmark tool you can just run from the command line. Blender is probably one of the best open source professional tools ever created and the community online is great.
MaterialSuspect8286@reddit
Will Cycles be fast here? I thought render engines are limited by compute, rather than RAM?
GatePorters@reddit
😉
GatePorters@reddit
You can shoot all the rays at once, but hold on let me do the math to see where they all are aiming.
Alright now let’s see where they all hit their first point.
Alright now let’s pull all the normals for the first bounce and do the occlusion stuff.
Now let’s go ahead and do all the extra bounces to pump up that indirect lighting.
(After 2 minutes)
Alright are you ready to try the next timestep?
SlowFail2433@reddit
Blender moment
1T-context-window@reddit
Pihole \s
Denolien_@reddit
@op What software are you using to cluster or cross the devices?
Are you planning to use them clustered or spin up only when needed ?
entsnack@reddit (OP)
Just torchrun. I'm planning to use them clustered, mainly because I'm learning to develop for multinode Grace Blackwell clusters and need to understand NCCL and all that jazz.
entsnack@reddit (OP)
Just torchrun. I'm planning to use them clustered, mainly because I'm learning to develop for multinode Grace Blackwell clusters and need to understand NCCL and all that jazz.
MikeRoz@reddit
Trust us, we have plenty of inference workloads that can give 256 GB a thorough workout.
entsnack@reddit (OP)
IMHO the Spark is wasted on inference, a Mac would be more cost effective since CUDA isn't essential for this type of workload.
Ok_Demand_3197@reddit
Pre-train your own foundational model
entsnack@reddit (OP)
Not my own model, but I am pretraining Karpathy's nanochat. With 2 DGX Sparks, pretraining time goes down from 10 days (with a single Spark) to 4 days.
zdy1995@reddit
how long does it take to train nanochat? I am running with rtx 6000 pro and it takes too much time… just don’t worth it..
entsnack@reddit (OP)
I've been working on this actively. With a single DGX Spark, depth=20 and device_batch_size=32, pretraining will complete in 10 days. With 2 DGX Sparks and all other parameters the same, pretraining will complete in 4 days. The RTX 6000 Pro is pretty fast, pretraining isn't supposed to be a quick thing like inference or fine-tuning.
aiueka@reddit
Dinov3 finetune
Lumpy_Law_6463@reddit
RFDiffusion - generative protein system
https://docs.nvidia.com/nim/bionemo/rfdiffusion/latest/benchmarking.html
HansaCA@reddit
Can you mine 10 Bitcoins for me?
entsnack@reddit (OP)
You joke but when I first got my 4090 my plan was to mine BTC and get back what I overpaid for it. I spent more in electricity than mining and lost like $50 doing this.
Igot1forya@reddit
My 3090 was earning $27-19/d at one point during COVID mining ETH. It lasted about 6 months before the bottom fell out. It more than paid for itself. It's crazy that I can still sell this card for $800 if I wanted.
entsnack@reddit (OP)
You made a good bet with timing and choice of crypto. Teach me your ways.
Igot1forya@reddit
I wish it was still going. It helped pay for my solar panels (making mining free), my and my wife's car off. I do Chia still, but it's pretty worthless now. I'm on the fence to sell everything.
decrement--@reddit
I looked at Chia, tried it for a long while, didn't make shit, then just closed it all out.
Igot1forya@reddit
I was premining Chia for 6 weeks before it went to main on launch day. I won 4 XCH on day 1 when it was valued at $3500 each. In the 12 hours it took to sell, it had dropped to $1800 each. it paid for the NAS and drives in the first month and at the same time my GPU was mining. It helped pay off my solar in a year. It's the ONLY reason I still have Chia as the electrical costs me nothing due to the solar.
decrement--@reddit
Same. Wanted the 3090 for ML, couldn't find one, bought an Alienware PC with 3090, mined for a year, and my wallet is still over $6000 from that time, and I've spent at least $1500 in BTC, which was all mined.
ThenExtension9196@reddit
You fought the good fight and lost. There is honor in that.
Pro-editor-1105@reddit
5 more for me too!
highdimensionaldata@reddit
2.5 for me too!
Silver_Jaguar_24@reddit
1.25 please
nomorebuttsplz@reddit
what about HunyuanImage-3.0?
entsnack@reddit (OP)
dude come on I tolerated the inference monkeys in my last post
nomorebuttsplz@reddit
But it might be the best local setup for that giant model.
SlowFail2433@reddit
It is “only” 80B its not that big
nomorebuttsplz@reddit
I think fp16 is more worthwhile for image than llms personally
SlowFail2433@reddit
New model type so its unknown
SlowFail2433@reddit
Inference monkeys is best new term
Secure_Archer_1529@reddit
I appreciate that you offer your time and hardware to the community :)
entsnack@reddit (OP)
Just contributing back, I've learned a lot from others' posts here!
egomarker@reddit
Is there any kind of tldr results table of previous inference testing?
pmttyji@reddit
Please REAP Prune below models.
EnergyNo8536@reddit
Thank you for your offer to ask!
Is it possible to fine-tune GLM-4.5V with this setup using this unsloth notebook
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_VL_(8B)-Vision.ipynb
Would one DGX Spark be enough to finetune the Q4 quant
cpatonn/GLM-4.5V-AWQ-4bit ?
EnergyNo8536@reddit
And do you use the unsloth docker image for fine-tuning that can be accessed from the DGX Spark?
entsnack@reddit (OP)
No I usually don't do PEFT because it doesn't play well with RL (until recently) but let me try it now. This thing can fine tune a lot of big models without LORA though.
RemarkableAd66@reddit
I'd be interested in training speed for image or video models. I can train them on my M3 Max macbook but speed is slow compared to nvidia hardware. Most people train Lora or Lokr or similar adapters for image models.
Maybe Qwen-edit-2509?
Or possibly Flux Kontext?
I wouldn't know what video models people train.
entsnack@reddit (OP)
This is excellent, will do some research.
Excellent_Produce146@reddit
Train nanochat on this boxes.
see https://github.com/karpathy/nanochat/discussions/28#discussioncomment-14735913 - not yet mastered
entsnack@reddit (OP)
Already in progress!