DGX Spark vs RTX 5090 for local AI workflows (LLMs + diffusion) — overkill or real upgrade?

Posted by Bisnispter@reddit | LocalLLaMA | View on Reddit | 12 comments

I’m evaluating hardware for a local AI setup that mixes diffusion workflows (image/video generation) with LLM inference, but in a non-production context. The goal isn’t to serve requests or maximize throughput, but to build, test, and iterate on workflows locally with as much flexibility and stability as possible.

The obvious baseline is a high-end consumer GPU like a 5090. It gives you massive VRAM, strong performance, and a very flexible environment where you can run pretty much anything — local LLMs, diffusion pipelines, custom tooling, etc. For most people, that’s already more than enough, and scaling beyond that usually means just adding more GPUs or moving to cloud.

However, I’m considering whether something like a DGX Spark actually changes the equation. Not in terms of raw performance per dollar — which I assume is worse — but in terms of how the system behaves when you start combining different types of workloads. In my case, that means running diffusion pipelines (ComfyUI-style), doing some video generation, and also running local LLMs (via things like Ollama or LM Studio), sometimes within the same broader workflow.

What I’m trying to understand is whether DGX Spark provides any real advantage in that kind of mixed workload scenario. Does it actually improve stability, memory handling, or workflow orchestration when you’re juggling multiple models and processes? Or does it end up being essentially the same as a powerful consumer GPU, just more expensive and less flexible?

Another concern is how “open” the environment really is. A big part of working locally is being able to tweak everything — models, runtimes, pipelines, integrations — and I’m not sure if a DGX-style system helps with that or gets in the way compared to a standard Linux workstation with one or more GPUs.

So the core question is: for local AI work that combines LLMs and diffusion, but doesn’t require production-level throughput, does DGX Spark offer anything that justifies the jump from a 5090? Or is it mostly relevant once you move into multi-user or production-scale environments?

Would really appreciate input from anyone who has used DGX systems in practice, especially outside of strictly enterprise or production use cases.

[-]

Shot-Buffalo-2603@reddit

The main advantage would be the 128GB of unified ram so you can run more even though it would be slower. If everything you want fits on a 5090 it would be the better choice. If you’re not comfortable using a linux terminal the spark will also be a lot more difficult to use. To get the most out of it you should be using vllm or sglang since it is basically designed to be a dev device mirroring a real server stack at low power. You can likely get things like ollama/lmstudio to run on it, but it being arm based and a non-standard OS I would expect issues to pop up that will be difficult to resolve. I have one and went straight to running vllm in a docker container

[-]

Bisnispter@reddit (OP)

This is super helpful, especially the point about vLLM / sglang — that actually aligns more with what I’m trying to understand.

I think the confusion in my original question is that I’m not really evaluating this as “run a model locally”, but more as part of a system where different stages are chained (generation, processing, QA, etc).

So the tradeoff I’m trying to understand is not just VRAM vs speed, but whether having more memory (like in the Spark) actually helps when you want to keep multiple things available vs constantly loading/unloading on a single GPU.

From your experience, did you find that the unified memory made a practical difference for that kind of workflow, or did the lower compute basically cancel out the benefit?

[-]

Miserable-Dare5090@reddit

The cpu on these little boxes is fast. I mean, it’s surprising bc people talk abiut the unified ram. But the biggest draw is the 1000 network card, which allows you to hook up 2,4,8 of these with 200Gb bandwidth gpu to gpu (no pcie bus overhead, no ddr5 ram speed overhead). So my 2 node cluster doubles the cuda compute and inference is 1.5X. Now that only comes to about the number of cuda cores of a 5090 for two sparks together, but running models 8x larger.

[-]

Serprotease@reddit

Just be aware for comfyUI, it seems that there is a bug in the way that comfyUI handles unified memory that effectively halves the amount of vram available.

It’s seems to be something between cuda-Ubuntu-unified memory-comfyUI and Nvidia has mentioned that it should be solved with a soon-to-come kernel update. But promises only ties the one who believes in them, so for all instances and purposes you only have 64gb of available memory with comfyUI.

vllm-Omni could be a workaround and used inside a comfyUI workflow but I have yet to try it.

[-]

Miserable-Dare5090@reddit

havent seen this issue in the spark

[-]

Shot-Buffalo-2603@reddit

I do training and single user agenic tasks, mostly with one model at a time, so I can’t speak from experience. If the models you want to run locally can all fit in the 5090 individually, you would almost certainly get better performance Unloading/loading one at a time vs having them all loaded at once on a spark. In this situation I would only go the spark route if I wanted to run something that wouldn’t fit in my 5090. I would also consider the trade offs of a mac studio with similar vram instead of a spark depending on your interest in a developer like setup vs just being a user. On a spark you will almost certainly be compiling cuda kernels and spending hours debugging stuff, vs with a mac it will just work out of the box and be slightly slower pre-filling with similar inference speeds.

[-]

Miserable-Dare5090@reddit

lmstudio runs great on the spark.

putrasherni@reddit

would rather you buy another 5090 than a DGX

dobkeratops@reddit

diffusion workflows are probably better on a 5090

the DGX Spark will handle larger MoE LLMs.

I have a spark and 4090 to compare ..the spark gets more use because I'm less worried about the electricity bill, fire risk , "is this jet engine fan noise waking up the neighbours" ,etc.

ImportancePitiful795@reddit

Well here is how you need to look at it.

5090 these days cost as much as a DGX Spark.

So if you want just 32GB VRAM to do everything you want, then get 5090. (assuming you having a desktop to hook it already and make sure you have ATX3.1 PSU, do not cheapen out even for ATX3.0 PSU).

DGX Spark with 128GB unified memory will run slower models fit in 32GB VRAM but can fit 100GB+ models in it, which cannot run at all on the 5090.

So you take the pick.

Also if you do not NEED CUDA, consider a AMD 395 based miniPC/laptop with 128GB. MiniPCs 395/388 + 128GB start from around $2000ish (Bogsame M5 for example), and the most expensive models doesn't mean they are any faster. So do not buy Framework or GMTech X2 which are ridiculously priced. At that money DGX make more sense.

Perf wise is similar to DGX Spark.

Also if you do not NEED CUDA but want somewhere in the middle with option to upgrade, consider the option of R9700s. Can get 2 of these for way less than a single 5090.

Especially if you have a current desktop that can do 8x8 its PCIe slots, happy days. Here you are with 64GB VRAM at respectable bandwidth 3 times higher than of the DGX Spark/AMD 395 and still can fit twice as large models than a single 5090.

It's all down to what you want at the end of the day and not having regrets.

So to sum up. If you believe you will regret not having CUDA and will regret been restricted to 32GB VRAM, get the DGX Spark. Otherwise look for AMD for more options at less money at comparatively speeds.

I have a Bosgame M5 128GB (got it months ago for €1700) but also want a DGX Spark just to fiddle with it. Just food for thought. In any case should be able to run a large model on each and use the A0 on my desktop to use them together.

SC_W33DKILL3R@reddit

DGX Spark is as open as anything else. It runs all the major apps and most of the other stuff is in Python, which is comes with. You sometimes need to compile the python app for the Spark, but it's Linux and that's a given.

I have found the Spark to be fast enough for inference, using the larger models like Qwen 3 code, it is a bit slower using the thinking models.

You can also look at the AMD Ryzen AI Max+ 395 as that comes with 128GB ram and is a little bit more of a consumer system.

mangoking1997@reddit

If it fits on a 5090, it's way way faster. If it doesn't, it will struggle to run depending on implementation.

You won't be running different workloads at the same time regardless (unless you make some big compromises to the models). You simply don't have the performance with either option. 5090 doesn't have the vram to keep them loaded, and dgx is just not very fast, it's only benefit is enough memory to run larger models.