2 x DGX Spark! Give me your non-inference workloads

[-]

FloofBoyTellEm@reddit

It's actually a 98 Gbps link. PCIE limitation.

[-]

entsnack@reddit (OP)

Not if you use GPU direct, which is what you should be doing.

[-]

No. The NIC is connected to the board (including the GPU) by PCIe 5.0 x 4 (4 lanes... no 16 lanes or even 8 lanes). You will never get 200 GbE from a single port over anything on this system. This is a hardware limitation. You might be able to get 120 GbE if you could avoid 100% of the overhead, which GPU Direct can help with.

This is a pitfall of just believing the NVIDIA marketing vs. real world testing.

[-]

entsnack@reddit (OP)

I'll take 180 GbE instead of 200GbE. How is being smart working out for you?

Warning: Permanently added '169.254.174.230' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

# nccl-tests version 2.17.6 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 108332 on prior-node device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  42734 on posterior-node device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   391017   43.94   21.97       0   386256   44.48   22.24       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 22.1036
#
# Collective test concluded: all_gather_perf

[-]

FloofBoyTellEm@reddit

Is this using 2 DACs vs 1? Confused why your rate is almost exactly double mine.

Since you're smarter than me, and you claim "not if you're using GPU Direct", but Nvidia stated GPU Direct doesn't work on DGX Sparks, how did you get it to work?

https://forums.developer.nvidia.com/t/dgx-spark-gb10-faq/347344#p-1694056-q-is-gpudirect-rdma-supported-on-dgx-spark-13

[-]

entsnack@reddit (OP)

I didn't do anything special, I just followed this tutorial: https://build.nvidia.com/spark/nccl/stacked-sparks. Make sure you don't connect both ports, that halves the bandwidth. I'm also using a QSFP56 cable that's not officially supported (they announced the supported cables well after I bought mine).

[-]

FloofBoyTellEm@reddit

I know that's how I originally got the sparks going over nccl when I first got them, but will try that again.

Currently my test results mirror that of ServeTheHome's iperf results and I get similar results (half of what you get) over the all_gather_perf test. I may just have lost an environment variable somewhere along the way since then breaking things.

I know it says to ignore the enP2 interfaces in that playbook, but I just did a lag (802.3ad) between [enp1s0f0np0, enP2p1s0f0np0] and went from 96 Gbit/s to 127 Gbit/s on my ZFS pool. So, at least when not using some form of nccl, that was necessary, and it looks like the ServeTheHome test was probably stuck to half the lanes. So, now I really am hitting the PCIe limitation where the data must go through the CPU.

Going to try again for the all_gather_perf using your link.

I have the official mellanox DAC. But ordered some shorter generics from FS.com today to see if that's part of the issue.

Can you do me a huge favor and share your results for these two commands, I'm wondering if I have a real issue or this is a red herring?

sudo lspci -vv -s 0000:01:00.0 | egrep -i 'LnkCap|LnkSta'

    `LnkCap:`   `Port #0, Speed 32GT/s, Width x4, ASPM not supported`

    `LnkSta:`   `Speed 32GT/s, Width x4`

    `LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-`

    `LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+`

sudo dmesg | grep mlx5_pcie_event

[ 2.787624] mlx5_core 0000:01:00.0: mlx5_pcie_event:296:(pid 162): Detected insufficient power on the PCIe slot (27W).

[ 3.374174] mlx5_core 0000:01:00.1: mlx5_pcie_event:296:(pid 162): Detected insufficient power on the PCIe slot (27W).

[ 3.960778] mlx5_core 0002:01:00.0: mlx5_pcie_event:296:(pid 408): Detected insufficient power on the PCIe slot (27W).

[ 4.571786] mlx5_core 0002:01:00.1: mlx5_pcie_event:296:(pid 380): Detected insufficient power on the PCIe slot (27W).

[-]

FloofBoyTellEm@reddit

You seem to be the anomaly.

https://forums.developer.nvidia.com/t/connectx-7-nic-in-dgx-spark/350417/42

nobody else can hit those numbers...

[-]

entsnack@reddit (OP)

There is someone with 22.9 GBps avg bus bw in that thread, higher than mine.

[-]

FloofBoyTellEm@reddit

They are using 2 DACs, are you?

Note the 4 interface bond.

[-]

entsnack@reddit (OP)

I see 4 messages about detecting insufficient power on the PCIe slot.

[-]

FloofBoyTellEm@reddit

Trying to lag my interfaces now as a test. Thinking that may be the real issue here. Really hoping to get your bandwidth rate. Built an NVME-oF TrueNAS Scale server this week and think this has been a misunderstanding on my part (hopefully) with how QSFP56 links actually work. It's clicking now.

[-]

Embarrassed_Win7667@reddit

Can anyone post example unSloth example of how two use two over the connection - I just got one and don’t know how to invoke both in one training run. Nvidia guide just shows you how to test NCCL.

[-]

Analytics-Maken@reddit

The context window is real, I'm consolidating the data I need into a data warehouse using Windsor ai an ETL tool to reduce the token usage on MCP calls. So far, it has been a better approach, but it requires work upfront, making joins and calculations.

[-]

Eugr@reddit

Actually, inference benchmarks would be very interesting: both comparing the same model with single node inference and running something like qwen3-235b in awq 4-bit.

Lot of people posted benchmarks for a single spark (including myself), but I haven't seen anything substantial for dual spark inference.

[-]

entsnack@reddit (OP)

sigh OK will do. Now you got me curious too but my expectations are low.

[-]

Eugr@reddit

Any luck with this?

[-]

Eugr@reddit

In theory, you should get some speedup with data-parallel or tensor-parallel on smaller models. Qwen3-235 should be able to run in 4-bit quant, but won't fit in FP8.

[-]

SpecialistNumerous17@reddit

Yes please! It would be awesome to see benchmarks comparing performance of 1 vs 2 nodes inferencing the same models.

[-]

xxPoLyGLoTxx@reddit

Hey he said BTFO!

[-]

Wisepunter@reddit

Whats your experience so far training models you have tried. Is it decent performance? How does it compare to multiple consumer GPU etc?

[-]

entsnack@reddit (OP)

My use case is a bit niche: I need the Grace ARM CPU and the CX7 interconnect to test CUDA kernels for a GB200 that I rent time on. The Spark a good machine to both learn and prototype on.

For pretraining nanochat, I can compare it to my H100 and 4090:

1 DGX Spark: 1,600 tok/sec
4090: 500 tok/sec
1 H100: 12,000 tok/sec
8 H100s (Karpathy reported): 1.1 million tok/sec

[-]

Eugr@reddit

Spark doesn't have downscaled Grace ARM CPU though, it uses Mediatek one with different architecture. Still ARM though.

[-]

rawdmon@reddit

Not sure where you've heard this, but GB10 literally stands for Grace Blackwell 10.

[-]

Eugr@reddit

I know what it stands for. But the architecture is still different, see below:

[-]

entsnack@reddit (OP)

Yeah ARM is good enough + CX7 with GPU direct.

[-]

auradragon1@reddit

My use case is a bit niche: I need the Grace ARM CPU and the CX7 interconnect to test CUDA kernels for a GB200 that I rent time on.

Um, isn't this the exact reason Nvidia released the Spark? It's a local machine for CUDA devs that need to deploy changes to enterprise Nvidia GPUs.

[-]

entsnack@reddit (OP)

It is, but I need to explain that on this sub because it's mostly inference monkeys who think this is a Mac Mini replacement.

[-]

SkyFeistyLlama8@reddit

Being said inference monkey who still wants a Spark on my desk... I salute you.

[-]

Wisepunter@reddit

I don't know a lot about it, but that's a nice uplift from a beefy 4090. I know ram speed is a big issue with inference, what's the bottleneck with training that makes it soo much better than a 4090?

[-]

entsnack@reddit (OP)

The 4090 low performance in training is indeed strange, I still need to debug it. I relegated my 4090 to gaming a year ago though, 24GB VRAM was enough in the BERT days but not anymore.

[-]

ProcessComplete5797@reddit

I want 2 so bad

[-]

braindeadtheory@reddit

Large scale aerial reconstruction using COLMAP x fVDB for GSPlat and TSDF mesh or metas new sparse / dense reconstruction transformer.

[-]

redblood252@reddit

Ffmpeg av1 transcoding?

[-]

Budget-Juggernaut-68@reddit

what can you finetune with that? 70b models?

[-]

txgsync@reddit

Read up on SeedLM on Arxiv. Try compressing a model using PRNG FP16 substitution. “Seed search” is a killer time sink across 16-bit space. I kept tripping over lack of comprehensive support for tensors on Mac. Can the DGX spark improve on it? Post some benchmarks.

[-]

entsnack@reddit (OP)

niiiice never heard of this and very interested to test, will post back

[-]

txgsync@reddit

Yeah, SeedLM is a very Apple way of approaching things. It’s been impractical on non-Apple platforms: the PCIe transit cost was too high between system RAM and GPU VRAM.

But now that both AMD and nvidia have gotten into unified memory, it seems like using CPU for PRNG matrix weights and GPU for tensors might be practical outside the Apple sandbox.

I will be noodling too. Let me know if you get stuck. I have not committed my code to GitHub for SeedLM yet; it’s very MLX-specific right now.

[-]

Wrong-Historian@reddit

Crysis

[-]

txgsync@reddit

Fuck, you are old.

[-]

somealusta@reddit

Tetris

[-]

CV514@reddit

Pong

[-]

txgsync@reddit

Space War

[-]

iamlazyboy@reddit

Do you have ever thought that maybe we're not that old but you're young? The game isn't even 20 years old

[-]

dogfighter75@reddit

Ancient millennials using their GPUs for graphics processing in 2025..

[-]

Advanced-Virus-2303@reddit

Don't speak the old magic to me, witch.

[-]

eloquentemu@reddit

People may age but a good meme never dies

[-]

Regular-Forever5876@reddit

Someone actually tried that https://youtu.be/6iVftb0cbnc?si=r204SUgxoQfFrkWK

[-]

entsnack@reddit (OP)

😂 every time, but still funny

[-]

nickpsecurity@reddit

Try pretraining with these for a real test. They're designed for single- or low-GPU setups. Use PG-19 dataset (or more of Gutenberg) instead of theirs so whatever you produce has no copyright issues. There's also no question of benchmark training or parroting modern stuff if the dataset considers the year 1919 "modern." ;)

Cramming Language Model

GPT2 From Scratch

PG-19 Benchmark

[-]

entsnack@reddit (OP)

nanochat prertraining benchmark compendium: https://www.reddit.com/r/LocalLLaMA/s/NLmbm2NelU

[-]

HumanDrone8721@reddit

Wow, congrats, beat me to the punch :), we have the same setup preparing to arrive, this time with Gigabyte ATOMs that are floating somewhere on the road :(.

I think there are many interesting suggestions poster here (along with the Sturgeon's ration of 90% garbage) but I have another suggestion:

HARDWARE RELIABILITY TESTING UNDER LOAD PLIZZ !!!

A few days ago there was an INTENSE astroturfing campaign: "The man, the legend, the idol programmer tested one and come to say that it only consumes 100W at max load and it crashes and reboots soon..." followed by more metoos... followed by articles that were citing articles that were citing a Twitter post that was posting a screenshot of a "community post"... followed by smirks saying "you should have got a strix, it can play vidya gamez as well..."

Anyway, please keep this post as a repository of knowledge about the mini-cluster of these and please do some hardware testing under load and post your methods and actual code so I can try to reproduce it here as well.

[-]

entsnack@reddit (OP)

I find that entire story weird. I HAVE made it crash, but I did it deliberately by setting the nvidia-smi boost-slider to 4 (it comes at 0 by default), which is an undocumented hack.

Also, the rated peak power draw is about 100W for the GPU and 140W for the rest of the components (CPU, network).

Not saying its "better" than a Strix or Mac, depends on your use case. If you want to learn and flex your ability to optimize models for the NVL72 and other GB clusters, this is the only kit to learn on.

[-]

FullOf_Bad_Ideas@reddit

try to do a full finetune of qwen3 4b at 2048 sequence length or qlora of mistral large 2 at 8k sequence length on RP dataset.

I posted this on previous thread and I repeat it again here.

[-]

entsnack@reddit (OP)

Will do. JFYI I was able to get distributed pretraining of nanochat working, the speed goes from 1,600 tok/sec with a single DGX Spark to 6,600 tok/sec with 2 DGX Sparks. Not sure why the non-linear jump in speed.

[-]

nicko170@reddit

Yikes. I have it running on a single a40 at 4,500 too/sec ;-)

[-]

entsnack@reddit (OP)

What is your --depth, maximum sequence length, and time to pretraining completion? Share it here: https://www.reddit.com/r/LocalLLaMA/s/OWpYwBpEng

[-]

SkyFeistyLlama8@reddit

How are they hooked up?

[-]

entsnack@reddit (OP)

You can also use etherner but it will be significantly slower and also involve the CPU.

[-]

noctrex@reddit

Run some benchmarks on a MoE model and find out if the MXFP4 quant is faster than the normal Q4 one

[-]

entsnack@reddit (OP)

Hmm I've tried gpt-oss-120b but not a Q4 vs. MXFP4 test. The new 4-bit hype is for NVFP4.

[-]

Freonr2@reddit

Just a heads up and not sure if this is what you were contemplating, but gpt oss isn't going to be a great way to compare GGUF quants and mxfp4, because the GGUF quants aren't changing any of the mxfp4 layers to Q at all. We don't have a bf16 version of gpt oss to use as a basis for quantizing with different quantization algos.

ex.

https://huggingface.co/unsloth/gpt-oss-120b-GGUF/blob/main/Q2_K/gpt-oss-120b-Q2_K-00001-of-00002.gguf

The actual files they're not a lot smaller than originally distributed, and if you dive in to look at the layer dtypes only a few layers are in GGUF formats, and none of the FFN layers get changed from mxfp4 from my poking around.

I generally think requantizing from a 4 bit quant to some other type of 4 bit quant is likely to ruin the model anyway as there will be essentially rounding errors all over the place.

It would however be interesting to take a bf16 model and quantize it into GGUF, nvfp4, and mxfp4 and benchmark on various hardware.

[-]

entsnack@reddit (OP)

I'll admit I know very little about GGUF, it's not a format that's used much outside hobbyist circles, especially not on CUDA GPUs.

[-]

noctrex@reddit

Well, I can quantize any MoE model we want, I already have a bunch on my repo on hf.

It would be interesting to see also if the I-quants are better

[-]

Freonr2@reddit

https://www.youtube.com/watch?v=vW30o4U9BFE

[-]

noctrex@reddit

Yeah I've seen the hype, but I'm very curious about the MX one, maybe because I (shameless plug) quantize in it, and I would be interesting if there is any advantage on newer hardware with FP4 support

[-]

johnkapolos@reddit

Bookmarked!

[-]

entsnack@reddit (OP)

holy shit, you’re a SERIOUS dude. gonna prioritize this request.

[-]

noctrex@reddit

no worries no hurries :) Just do your thing and when you have the time have a look at it. Nothing serious really, just my adhd brain read up on the FP4 quant and I went down the rabbit hole of quantizing

[-]

rinaldo23@reddit

Minecraft server

[-]

siegevjorn@reddit

Congrats...I'm jealous...How’d you slip an $8k+ DGX Spaek into the house? Told your partner they are internet switches/ routers?

[-]

entsnack@reddit (OP)

lmao no they're for "work", my only personal GPU is a 4090 I bought from a scalper during COVID. The DGX Sparks are the only work GPUs I get to keep at home.

[-]

Maleficent-Ad5999@reddit

Please tell me how to apply for this job

[-]

Daily_Heavy@reddit

Can you look in the BIOS menu to see if there is any way to adjust the LPDDR clock speed? If so, please post the min and max possible settings.

[-]

thereisnospooongeek@reddit

Can you do an OCR performance benchmark for OLMOCR2, DeepseekOCR, and ChandraOCR?

[-]

entsnack@reddit (OP)

DeepSeekOCR is a 3B model. Isn’t 240GB VRAM wasted on this?

[-]

thereisnospooongeek@reddit

It would be still great to know the output rate. I just want to know whether it will be a good investment. I need to do OCR of approx 1.2TB PDF files. Hence the request.

[-]

entsnack@reddit (OP)

Oh so I can try batching and tell you the total throughput. Will do.

[-]

thereisnospooongeek@reddit

Thanks Mate

[-]

akram200272002@reddit

Honest to God, I wana see this thing do a cycles render

[-]

entsnack@reddit (OP)

I have no idea what this is. Link?

[-]

akram200272002@reddit

Google blender

[-]

entsnack@reddit (OP)

hmm I’ll have to hook this up to a monitor, it’s not near one right now. Will try.

[-]

Nic4Las@reddit

No need to set up a monitor just for this. https://opendata.blender.org/ blender is awesome and has a dedicated benchmark tool you can just run from the command line. Blender is probably one of the best open source professional tools ever created and the community online is great.

[-]

MaterialSuspect8286@reddit

Will Cycles be fast here? I thought render engines are limited by compute, rather than RAM?

[-]

GatePorters@reddit

😉

[-]

GatePorters@reddit

You can shoot all the rays at once, but hold on let me do the math to see where they all are aiming.

Alright now let’s see where they all hit their first point.

Alright now let’s pull all the normals for the first bounce and do the occlusion stuff.

Now let’s go ahead and do all the extra bounces to pump up that indirect lighting.

(After 2 minutes)

Alright are you ready to try the next timestep?

[-]

SlowFail2433@reddit

Blender moment

[-]

1T-context-window@reddit

Pihole \s

[-]

Denolien_@reddit

@op What software are you using to cluster or cross the devices?

Are you planning to use them clustered or spin up only when needed ?

[-]

entsnack@reddit (OP)

Just torchrun. I'm planning to use them clustered, mainly because I'm learning to develop for multinode Grace Blackwell clusters and need to understand NCCL and all that jazz.

[-]

entsnack@reddit (OP)

Just torchrun. I'm planning to use them clustered, mainly because I'm learning to develop for multinode Grace Blackwell clusters and need to understand NCCL and all that jazz.

[-]

MikeRoz@reddit

Give me your big model non-inference workloads to test, something to push the 256GB unified memory.

Trust us, we have plenty of inference workloads that can give 256 GB a thorough workout.

[-]

entsnack@reddit (OP)

IMHO the Spark is wasted on inference, a Mac would be more cost effective since CUDA isn't essential for this type of workload.

[-]

Ok_Demand_3197@reddit

Pre-train your own foundational model

[-]

entsnack@reddit (OP)

Not my own model, but I am pretraining Karpathy's nanochat. With 2 DGX Sparks, pretraining time goes down from 10 days (with a single Spark) to 4 days.

[-]

zdy1995@reddit

how long does it take to train nanochat? I am running with rtx 6000 pro and it takes too much time… just don’t worth it..

[-]

entsnack@reddit (OP)

I've been working on this actively. With a single DGX Spark, depth=20 and device_batch_size=32, pretraining will complete in 10 days. With 2 DGX Sparks and all other parameters the same, pretraining will complete in 4 days. The RTX 6000 Pro is pretty fast, pretraining isn't supposed to be a quick thing like inference or fine-tuning.

[-]

aiueka@reddit

Dinov3 finetune

[-]

Lumpy_Law_6463@reddit

RFDiffusion - generative protein system

https://docs.nvidia.com/nim/bionemo/rfdiffusion/latest/benchmarking.html

[-]

HansaCA@reddit

Can you mine 10 Bitcoins for me?

[-]

entsnack@reddit (OP)

You joke but when I first got my 4090 my plan was to mine BTC and get back what I overpaid for it. I spent more in electricity than mining and lost like $50 doing this.

[-]

Igot1forya@reddit

My 3090 was earning $27-19/d at one point during COVID mining ETH. It lasted about 6 months before the bottom fell out. It more than paid for itself. It's crazy that I can still sell this card for $800 if I wanted.

[-]

entsnack@reddit (OP)

You made a good bet with timing and choice of crypto. Teach me your ways.

[-]

Igot1forya@reddit

I wish it was still going. It helped pay for my solar panels (making mining free), my and my wife's car off. I do Chia still, but it's pretty worthless now. I'm on the fence to sell everything.

[-]

decrement--@reddit

I looked at Chia, tried it for a long while, didn't make shit, then just closed it all out.

[-]

Igot1forya@reddit

I was premining Chia for 6 weeks before it went to main on launch day. I won 4 XCH on day 1 when it was valued at $3500 each. In the 12 hours it took to sell, it had dropped to $1800 each. it paid for the NAS and drives in the first month and at the same time my GPU was mining. It helped pay off my solar in a year. It's the ONLY reason I still have Chia as the electrical costs me nothing due to the solar.

[-]

decrement--@reddit

Same. Wanted the 3090 for ML, couldn't find one, bought an Alienware PC with 3090, mined for a year, and my wallet is still over $6000 from that time, and I've spent at least $1500 in BTC, which was all mined.

[-]

ThenExtension9196@reddit

You fought the good fight and lost. There is honor in that.

[-]

Pro-editor-1105@reddit

5 more for me too!

[-]

highdimensionaldata@reddit

2.5 for me too!

[-]

Silver_Jaguar_24@reddit

1.25 please

[-]

nomorebuttsplz@reddit

what about HunyuanImage-3.0?

[-]

entsnack@reddit (OP)

dude come on I tolerated the inference monkeys in my last post

[-]

nomorebuttsplz@reddit

But it might be the best local setup for that giant model.

[-]

SlowFail2433@reddit

It is “only” 80B its not that big

[-]

nomorebuttsplz@reddit

I think fp16 is more worthwhile for image than llms personally

[-]

SlowFail2433@reddit

New model type so its unknown

[-]

SlowFail2433@reddit

Inference monkeys is best new term

[-]

Secure_Archer_1529@reddit

I appreciate that you offer your time and hardware to the community :)

[-]

entsnack@reddit (OP)

Just contributing back, I've learned a lot from others' posts here!

[-]

egomarker@reddit

Is there any kind of tldr results table of previous inference testing?

[-]

pmttyji@reddit

Please REAP Prune below models.

AI21-Jamba-Mini-1.7
GroveMoE-Inst
FlexOlmo-7x7B-1T
Phi-3.5-MoE-instruct

[-]

EnergyNo8536@reddit

Thank you for your offer to ask!

Is it possible to fine-tune GLM-4.5V with this setup using this unsloth notebook

https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_VL_(8B)-Vision.ipynb

Would one DGX Spark be enough to finetune the Q4 quant

cpatonn/GLM-4.5V-AWQ-4bit ?

[-]

EnergyNo8536@reddit

And do you use the unsloth docker image for fine-tuning that can be accessed from the DGX Spark?

[-]

entsnack@reddit (OP)

No I usually don't do PEFT because it doesn't play well with RL (until recently) but let me try it now. This thing can fine tune a lot of big models without LORA though.

[-]

RemarkableAd66@reddit

I'd be interested in training speed for image or video models. I can train them on my M3 Max macbook but speed is slow compared to nvidia hardware. Most people train Lora or Lokr or similar adapters for image models.

Maybe Qwen-edit-2509?
Or possibly Flux Kontext?

I wouldn't know what video models people train.

[-]

entsnack@reddit (OP)

This is excellent, will do some research.

[-]

Excellent_Produce146@reddit

Train nanochat on this boxes.

see https://github.com/karpathy/nanochat/discussions/28#discussioncomment-14735913 - not yet mastered

[-]

entsnack@reddit (OP)

Already in progress!