Is the DGX Spark worth the money?
Posted by Lorelabbestia@reddit | LocalLLaMA | View on Reddit | 54 comments
I've seen a lot of DGX Spark discussions here focused on inference performance, and yeah, if you compare it to 4x 3090s for running small models, the DGX loses both in price and performance.
The Spark actually excels for prototyping
Let me break it down:
I just finished CPT on Nemotron-3-Nano on a \~6B tokens dataset.
I spent about a week on my two Sparks debugging everything: FP32 logit tensors that allocated 34 GB for a single tensor, parallelization, Triton kernel crashes on big batches on Blackwell, Mamba-2 backward pass race conditions, causal mask waste, among others. In total I fixed 10+ issues on the Sparks.
The Sparks ran stable at 1,130 tokens/sec after all patches. ETA for the full 6B token run? 30 days!!!. Not viable for production. Instead I tried the same setup on a bigger Blackwell GPU, the B200, actually 8x B200.
Scaling to 8x B200
When I moved to 8x B200 on Verda (unbelievable spot pricing at €11.86/h), the whole setup took about 1 hour. All the patches, hyperparameters, and dataset format worked identically as in the DGX, I just needed to scale. The Spark's 30-day run finished in about 8 hours on the B200s. 167x faster (see image).
For context, before Verda I tried Azure, but their quota approval process for high-end GPU instances takes too long. Verda instead let me spin up immediately on spot at roughly a quarter of what comparable on-demand instances cost elsewhere.
Cost analysis (see image)
If I had prototyped directly on cloud B200s at on-demand rates it would be about \~€1,220 just for debugging and getting the complete model-dataset properly set up. On the Spark? €0 cost as the hw is mine.
Production run: €118. Total project cost: €118.
Cloud-only equivalent: €1,338 (if I chose the same setup I used for training). That's 91% less by starting first on the DGX.
Ok, also the Spark has a price, but \~€1,200 saved per prototyping cycle, the Spark pays for itself in about 6-7 serious training projects. And most importantly, you'll never get a bill while prototyping, figuring out the setup and fixing bugs.
The honest opinion
The DGX Spark is not an inference machine and it's not a training cluster. It's a prototyping and debugging workstation. If you're doing large training work and want to iterate locally before burning cloud credits, it makes a lot of sense. If you just want to run LLMs for single-turn or few-turns chatting, buy something like the 3090s or the latest Macs.
For anyone interested in more details and the process from starting on the DGX and deploying to the big Blackwell GPUs, you can find the whole research here.
Happy to answer any questions about the Spark, the 2-node cluster setup, and B200/B300 Blackwell deployment.
bad_detectiv3@reddit
man, all this AI hardware is just too damn expensive. How the F will the world progress when cost are so high. Its as if all these companies want is quick profit for sake of 'AI'
Coast_Lopsided@reddit
compare it with the PC prices in the 80s
Lorelabbestia@reddit (OP)
Yeah its quite expensive and will continue to be, but if you see the efficiency of B200/B300 compared to H and A series, the price per token is waaaay cheaper (only to run, to buy not really), especially if you would consider the Vera Rubin CES claims.
I only use B200s for training and B300 for inference, the price per token is the cheapest you can get. Even if the GPU itself still costs a lot and they are the pricier per hour, if you take into account a training pipeline or inference, they are much much faster, which in turn makes them the cheapest options available.
Emotional-Baker-490@reddit
Now include the price of the spark itself, because it isnt free.
Lorelabbestia@reddit (OP)
Emotional-Baker-490@reddit
In short: No, Lmao
In long: This ignores the spark being ENORMOUSLY slower.
According to the chart, the 8*B200 for an hour (5$) is equivalent to the dgx spark for 167 hours, 167 hours/24 hours in a day=6.969(nice) days.
11.86 euros = 13.76 usd, A dgx spark is 4,700$. You would need to run that B200 cluster for 341.6 hours to break even between the 2. This would be 9.4, not 6-7(hehe) training runs.
Assuming free electricity, that the device can survive being loaded at the max for this period of time, and that you can wait over half a year The dgx spark would take 6.5 years to pay itself off.
Lorelabbestia@reddit (OP)
The DGX pays itself when you consider any other HW for testing before deploying to enterprise Nvidia GPU.
Your math looks only at the value the DGX can provide from a computing perspective, which is exactly what I try to debunk here: that's not its purpose!
From a compute perspective it will never pay itself, it pays itself by being the cheapest NVIDIA option to set up, debug and test full LLM workflows (CPT, SFT, you name it) before committing to cloud bills.
Take the blackwell family, you could buy a 5090 but that's 32GB of VRAM. You're not fitting even a 4B model for CPT in that, full training needs optimizer states, gradients and activations on top of weights, easily 4x to 5x the memory footprint. I had to use two DGX Sparks (256GB) just to fit a 4B model for CPT. There isn't any NVIDIA hardware currently available where you can prototype at this scale at this price. Other options would be directly a cloud B200s.
Yes for sure you increase the cost by increasing the number of devices, but the cost increases much less than it would on any other hardware for the purpose I presented here. Two Sparks give you 256GB of memory, try getting that any other way on Nvidia for less.
The DGX Spark and B200 are not competitors, they're complementary. You prototype locally on the Spark (no billing), then deploy the proven workflow to a B200/B300 for the real training. Comparing them head to head on compute misses the point. Also your comparison assumes you either have the Spark or have nothing, instead you should compare the cost of owning a spark + the cost of training compared to other options available to complete the same task.
A little bit of help from Claude about other options, more specifically AMD:
For deployment, AMD does have cloud GPUs (MI300X from \~$1.71/hr, MI350 coming) so you could in theory prototype on Strix Halo and deploy to MI300X, but here's the thing: the Strix Halo runs RDNA and the MI300X runs CDNA, they're two completely different GPU architectures even within AMD, so your local workflow doesn't transfer 1:1 to the cloud the way DGX Spark → B200 does where everything is Blackwell/CUDA and just works.
I would love to have a $500 machine with 256GB that does everything, but that's not what's out there, and no prototyping machine comes close to the DGX for this workflow.
I would be happy to hear from you which are the cheaper and better options from your point of view, u/Emotional-Baker-490.
Emotional-Baker-490@reddit
"From a compute perspective it will never pay itself, it pays itself by being the cheapest NVIDIA option to set up, debug and test full LLM workflows (CPT, SFT, you name it) before committing to cloud bills." You know you can pay per second, not per hour, right?
Lorelabbestia@reddit (OP)
Every provider charges on a different basis. Charging by the second, minute or hour is mostly irrelevant as prototyping takes a lot of time and effort. 1 second or even 1 hour over 40+ hours prototyping is negligible.
Have you ever even deployed an S-tier instance?
Emotional-Baker-490@reddit
"Charging by the second, minute or hour is mostly irrelevant as prototyping takes a lot of time and effort. 1 second or even 1 hour over 40+ hours prototyping is negligible." Do you leave unused cloud gpus idle while coding?
"Have you ever even deployed an S-tier instance?" Bot?
SexyAlienHotTubWater@reddit
Dude come on, it takes time to spin up an H100 cluster. Iterating on that hardware is extremely expensive.
Lorelabbestia@reddit (OP)
😂 You turning on and off an instance to code??? Then where you test the code? You redeploy an instance each time to test? So let's say 15 min just to turn on and off each time, let's say you do 50 iterations between dataset, patches, optimizations, hyperparametrization, early checkpoints assessment and eventual bugs, that would be about 750 min aka 12,5 hours of downtime.
I guess your time is worthless, but that clearly explains your reasoning.
Also, you know the majority of providers charge even when shut down?? So each time you would need to set up everything back from scratch, so realistically time would be more around 30 min+ each time. 24 hours of downtime. Nice, good job. I'm pretty sure you don't have a company and never worked for yourself, if you manage to get a job at all I'd be ashamed of employing you. You disrespect your own self with all this gibberish.
I don't blame you, it's not your fault. People sometimes just born like this.
Emotional-Baker-490@reddit
You jump to ad hominem AND are confidently incorrect, not even doing basic research on the topic you are pretending to be an expert in?
The time to spin up a serverless instance is usually less than 5 seconds, not 15mins. The fact you dont know this means you never tried it.
The price to pay a serverless instance is seconds. The fact you dont know this means you didnt even bother to read the front page. Its on several major hosts.
Wait a minute... Your that guy who got downvoted to oblivion because you said a model trained in int4 is not open weights because it was trained in int4!
IntelligentIsland477@reddit
Here u/lorelabbestia. I was about to roast you another time, but believe me or not, god told me "Stay away from a fool, for you will not find knowledge on their lips."
This discussion has been very counterproductive, so I propose you an Idea, you setup serverless and server for a big training, like CPT and full SFT. And I provide the credits.
I can give you all the credits you may need, then you compare serverles vs server cost on 8xB200 to CPT.
PS: only if you find a serverles with 8xB200
Let me know.
Emotional-Baker-490@reddit
Ok your just trolling or overconfidently larping. Blocked. Keep pretending you actually work on llms while you havent even taken a cursory glance at the hosting page, apparently cant understand the concept of doing multiple runs back to back for testing purposes.
Emotional-Baker-490@reddit
Ok your just trolling or stupid. Blocked.
Emotional-Baker-490@reddit
Ok, im just going to call you out as a bot or a larper translating with a bot. That completely ignores any time not doing a test run would not need gpus running which someone who actually did do this would know so you clearly havent, and ending the post with a question as utterly nonsensical as "Have you ever even deployed an S-tier instance?" which is only something an LLM could screw up and only an would EVER format a post like that.
Anarchaotic@reddit
OP doesn't understand how time works maybe? Imagine holding up your machine for 30 days to do a single training run and it turns out you messed something up, and now you have to do the whole thing again? Insanity.
Lorelabbestia@reddit (OP)
Yes, instanity! That's why I used 8xB200, one of the fastest and CHEAPEST (price/token) options available for my run, but only after setting up everything properly on the DGX.
Regarding:
That's ETA (Estimated Time of Arrival), how long it would take, not how long it took. I ran just to get a couple checkpoints, optimize for Blackwell, stabilize TPS and estimate completion.
Also, if you are on a budget and have plenty of time, you can surely do a 30 day run. u/Anarchaotic, there's a thing called checkpoint, which you set how often you want the model being trained to be saved (checkpoint). If shit happens, you simply continue training from that checkpoint, and while the model continues training you can assess throughout checkpoints if the model is behaving like you intended to.
SexyAlienHotTubWater@reddit
Reading through these comments is extremely bizarre. Having a machine that footprints a B200 cluster for pipeline development before you kick off your expensive run is... Clearly extremely fucking useful?
What do people not understand about your numbers lol
Lorelabbestia@reddit (OP)
It is not free, it is cheaper
eliko613@reddit
Great breakdown on the prototyping vs production workflow. The cost analysis really shows the value of that hybrid approach.
One thing I've seen teams struggle with as they scale this kind of workflow is tracking costs across the different stages and providers. You mentioned moving from Spark → Verda B200s, and trying Azure - it becomes tricky to get unified visibility into where your LLM budget is actually going, especially when you're iterating on hyperparameters and scaling experiments.
The €118 production run cost is clean, but I imagine tracking the true cost per experiment (including failed runs, debugging time, different batch sizes) gets complex when you're bouncing between local and multiple cloud providers. We started testing zenllm.io to get insights around our production workflows and it's been decent so far.
How are you currently tracking costs across your different training runs? Are you just using the cloud provider dashboards, or have you built something more unified for cost attribution per experiment?
rhofield@reddit
I think there is more nuance here, like speed is a bigger factor for many situations than just raw dollar cost.
That's not how that works, you still need to pay for the hardware and it has a tangible cost that you need to account for although there are sublties with that too e.g. it's not like gone the machine has intrsitic value still.
Overall if you're more price sensative than money sensative and need to run multiple prototyping sessions, it might (and only might) be worth considering. You'd have to comapre against going another route like strix halo (which will have more value as a machine so "cost" less) or building local machine or renting slower hardware.
Lorelabbestia@reddit (OP)
Yes, exactly like you said. You find the math on the full article.
You can't train on strix halo, or barely can, but still very limited. Which machine other than the DGX will give me 128GB, cluster multiple devices and deploy to enterprise HW?
Macs are amazing, but where will you deploy anything created on a Mac other than another Mac?
Emotional-Baker-490@reddit
Your a LARPer, arent you?
Lorelabbestia@reddit (OP)
Man, just saw your comments on Reddit, be hating on everybody. Sorry but you can't be taken seriously. Your reputation speaks for itself.
Everything I do is LLMs boy, started small now slowly growing. I strongly advise you going back to your mom's basement, under your dusty 8GB GPU, there you'll find the love you deserve.
Agusx1211@reddit
Opportunity cost is not real guys pack it up
Lorelabbestia@reddit (OP)
The opportunity cost of prototyping on the Spark is €0. The opportunity cost of debugging on 8x B200s at €11.86/h is €1,220. That's the whole point of the post.
ScrapEngineer_@reddit
Now factor in that your rig will be 167x more used, i' betting you're losing a lot of time on this
Lorelabbestia@reddit (OP)
The 167x is for the production run, which I never did on the Spark. The Spark is for prototyping. They're two different stages, not two alternatives.
Fuzzy-Chef@reddit
Lorelabbestia@reddit (OP)
Haha, true. What about the energy bill though? I wish I had em
Direct_Turn_1484@reddit
While you’re buying GPUs just drop a few hundred thousand on windmills and solar panels. Easy.
Fragrant_Ganache_9@reddit
too old style, just build a small nuclear power plant, its all nuclear talk in recent days.
Lorelabbestia@reddit (OP)
Mom told me she's building one next week.
Lorelabbestia@reddit (OP)
I'll talk to Xi about that
MelodicRecognition7@reddit
upvoting, thanks for another confirmation that Spark is not worth the money.
Lorelabbestia@reddit (OP)
I feel like nobody is getting the point here.
Assuming the CPT scenario I presented here, are you able to find a cheaper HW to spend the whole week prototyping to then deploy to enterprise HW (like the 8xB200)?
I bought two DGX as it was the only viable option a couple months ago. Since I'll be needing more compute soon, I'd like to hear from anyone if there are better options at the moment, considering my intensive and not so usual workload.
I see people judging the DGX while running on AMD and barely being able to produce a LoRA. About the Mac Studio, the hw is f amazing, in some aspects even better than the DGX, but once I got the prototype running on the Mac, where the hell am I supposed to deploy it?
Aggravating_Disk_280@reddit
I use one and it is slow. You can feel that the memory is the limited. In my use case each round takes 8-12 h, for that is the spark fine, because I do not need to power off the machine.
Lorelabbestia@reddit (OP)
8-12h is still reasonable, I wouldn't go to external HW unless you are time constrained.
Anarchaotic@reddit
Wait so you're saying it takes your spark 30 continuous days to run a single training project? What if something happens like your power goes out or the Spark has a slight error?
You're saying you'd rather wait a full month than pay 118 Euro to get results the same-day so you can actually see if the project worked out?
What if instead of buying the spark in the first place, you rented a much cheaper 2-4 Euro/hr machine to do that troubleshooting?
Idk man, if you don't value time whatsoever then I can see the reasoning here.
Lorelabbestia@reddit (OP)
Yes, instanity! That's why I used 8xB200, one of the fastest and CHEAPEST (price/token) options available for my run, but only after setting up everything properly on the DGX.
Regarding:
That's ETA (Estimated Time of Arrival), how long it would take, not how long it took. I ran just to get a couple checkpoints, optimize for Blackwell, stabilize TPS and estimate completion.
Also, if you are on a budget and have plenty of time, you can surely do a 30 day run. u/Anarchaotic, there's a thing called checkpoint, which you set how often you want the model being trained to be saved (checkpoint). If shit happens, you simply continue training from that checkpoint, and while the model continues training you can assess throughout checkpoints if the model is behaving like you intended to.
pfn0@reddit
Where's your depreciation and electricity usage?
Lorelabbestia@reddit (OP)
Its on the full article I posted on Medium.
TLDR: Its so low compared to what I'd spend if I didn't have them that it is not worth bothering.
As I said:
It pays for itself in 6-7 projects. It basically eats depreciation. About the bill, with two DGX at 100% for a full month it would cost me about €50/month.
harpysichordist@reddit
>"Is the cost of DGX Spark worth it?"
>excludes the most obvious cost in the chart featured prominently
Yes, "break it down," AI slop.
Lorelabbestia@reddit (OP)
I wish someone did this research beforehand, it would've been much easier. But out there other than the labs almost nobody is doing anything other than LoRA and single user inference.
Lorelabbestia@reddit (OP)
Yes, haha. Not AI slop though, it's AI inspiration, if I had copied and pasted what claude gave me nobody would even read it.
Believe me or not, I put a lot of heart on this one.
Primary-Wear-2460@reddit
Unless you really need an Nvidia specific memory appliance. No.
Lorelabbestia@reddit (OP)
For sure there are many options more affordable, but if you take into consideration the time you'll spend porting let's say from rocm to cuda, it is not worth it. When you get to setup time > money saved (depending on your wage ofc) do some 10 runs and you'll pay in time the same as brand new DGX.
Also, a big selling point for Nvidia is the efficiency of Blackwell compared to other GPUs, I did some math and the B200 is the cheapest to train on, and together with the B300 they are also the fastest.
Primary-Wear-2460@reddit
Depends what you are doing. If you specifically need CUDA, stay with Nvidia.
If you are running inference for a typical LLM model you don't need CUDA. Without know exactly what you are doing I can't say how problematic your use case would be.
Where things are today for mainstream use cases:
Nvidia and AMD are basically running close to even on text gen inference for same tier cards.
Nvidia is ahead in image gen but AMD has finally reached a decent usable stage.
AMD barely works for image gen training.
Lorelabbestia@reddit (OP)
Yeah, it depends a lot.
You seem to know about AMD, how are they doing on new architectures like mamba-2 or even newer ones, are they catching up too?
Primary-Wear-2460@reddit
I have a bunch of cards from both vendors so I've got some perspective on the differences as a user.
I couldn't tell you regarding mamba-2. I tend to stick to the more mainstream stuff like text, image, video generation. My two R9700 Pros are my goto for text and image gen right now.
baseketball@reddit
Curious to know what datasets you added to the CPT and how much that improved performance for your use case.
Lorelabbestia@reddit (OP)
Still trying that out!! I'll post about that as soon as I have some benchmarks done on trained vs. baseline. I'm prepping the benchmarks and I'll run them before and after DPO to assess the difference between w/ and w/o DPO.