GB10 / DGX Spark owners: is 128GB unified memory worth the slower token speed (on a max $4,000 budget)?
Posted by Soltan-007@reddit | LocalLLaMA | View on Reddit | 52 comments
I’m a full‑stack web developer, very into “vibe coding” (building fast with AI), and I’m considering a GB10‑based box (DGX Spark / ASUS GX10) as my main “AI team” for web & SaaS projects. My maximum budget is $4,000, so I’m choosing between this and a strong RTX workstation / Mac Studio. What I’m trying to understand from real users is the trade‑off between unified memory and raw generation speed:GB10 gives 128GB of unified memory, which should help with:Long‑context work on large Laravel / web / SaaS projects with lots of files and services. Keeping more of the codebase, docs, API schemas, and embeddings “in mind” at once.Running multiple agents/models in parallel (architect, coder, reviewer/QA, support/marketing bots) without running out of AI memory. Competing setups (high‑end RTX workstation or Mac Studio) usually have:Much faster token generation, butLess AI‑usable memory than 128GB (VRAM / unified), so you’re more limited in:How big your models can be.How much context you can feed.How many agents/models you can keep loaded at the same time. From people actually using GB10 / DGX Spark / ASUS GX10:On big web/Laravel projects with many files and multiple services, does 128GB unified memory really help with long context and understanding the whole project better than your previous setups? In practice, how does it compare to a strong RTX box or a Mac Studio with less memory but faster generation, especially under a ~$4k budget? When you run several agents at once (architect + coder + tester + support bot), do you feel the large unified memory is a real win, or does the slower inference kill the benefit? If you’ve used both (GB10 and RTX/Mac), in day‑to‑day “vibe coding”:Do you prefer more memory + more concurrent agents,Or less memory + faster tokens?Also, roughly how long does your setup take to generate usable code for:Small web apps (simple CRUD / small feature).Medium apps (CRM, booking system, basic SaaS).Larger apps (serious e‑commerce or multi‑tenant SaaS).I care less about theoretical FLOPS and more about real workflow speed with long context + multiple agents, within a hard budget cap of $4,000. Any concrete experiences would really help.
ImportancePitiful795@reddit
DGX is NOT for inference at home system
Get AMD 395 128GB based miniPC (around $2000 depending model) or build a system with 2 R9700s if you are in that sort of budget ($4000)
bebetterinsomething@reddit
How Strix Halo is around $2000? All I see is ministorum for $3.5K or Bosgame for $2.6K.
ImportancePitiful795@reddit
4 MONTHS ago when made the post yes.
bebetterinsomething@reddit
Just 4 months ago they were $600 cheaper? WtH is with these prices???
ImportancePitiful795@reddit
Actually was $1000 cheaper in September.
Well seems you live under a rock last 6 months? (joking) 🤣
Altman signed a deal with Samsung and SK Hynix to buy 50% of their raw DRAM wafers production at inflated prices for several years up front. RAM prices have gone up 500%-700% in just 6 months.
A €130 32GB DDR5-6400 kit from September now is close to €500. 128GB DDR5-6400 is over €1000.
The 16x64 DDR5-5600 RDIMM modules bought last year for €3600, last time checked going €22000.
At this point buying a full GH200 server at €18000 like ARS-111GL-NHR-LCC, makes sense 🤣
bebetterinsomething@reddit
Yeah, I guess I'm a slowpoke (: How is the Bosgame M5? It's the cheapest option for me for $2.6K currently. Minisforum is $3.5K.
ImportancePitiful795@reddit
I love mine, with it's quicks.
Have put 2 x 2TB PCIe5 NVME drives (slow ones 10GB/s) so they work at max speeds (7.9GB/s) on PCIe4 slots it has, since the default drive is slow at 3GB/s.
Also replaced the thermal paste with MX7, improving the cooling and reducing the heat which consequently improved perf. Now if you can find the Abee model at same price, you won't have to do this, because is fully watercooled.
That's a modification quite a lot of people have done paste replacement, even on the other models since all have the same PCB and almost all companies used laptop coolers.
Don't try to replace the thermal paste with anything less than MX7 or Honeywell TP. Almost all are tested and they perform worse than the default paste. Check the Strix Halo Discord Channel.
Also there you will find a lot of modifications. Like hooking 2 together in a 2U case, custom cases, you name it. :)
bebetterinsomething@reddit
Thanks for sharing that! So, the default drives are subpar and I'll have to put even more money to optimize the performance. Tinkering and putting everything into a separate case... With that framework mb doesn't sound like a bad idea.
ImportancePitiful795@reddit
The system is fine from the get go. I wanted to replace the drives, paste etc, because wanting to tinker with the systems I have.
As for Framework Desktop PCB alone is over €3000 for totally bare bones.
So still need to buy NVME, fan case etc. Which become much more expensive than getting the Bosgame M5 and buy the cheapest PCIE5 NVME, IF you want to be faster.
And at the price Framework Dekstop PCB only, which is totally ridiculous price double what it was last year, is not that far off from buying the DGX Spark.
bebetterinsomething@reddit
My memory doesn't serve me well. You're totally right! That board itself is $3.0K, $400 more than Bosgame's full package. Why did I think it was $2.0K?
ImportancePitiful795@reddit
That was the price the Framework Desktop launched fully with case and NVME.
Serprotease@reddit
The relatively poor bandwidth is what makes him a somewhat dubious choice. The AMD 395 has the same bandwidth AND 4x slower prefill speed. It’s an even worse choice, just cheaper.
ImportancePitiful795@reddit
Yet same if not faster token generation....
StardockEngineer@reddit
Nah, not for coding. Prefill is too slow.
DuragonYama@reddit
What did you end up doing?
aceangel3k@reddit
I have the DGX Spark and even though it is an awesome and surprisingly capable machine, you will be asking too much of it. The MiniMax M2.1 model is very usable for coding but you will only be running that one at a time since it will take up 110GB of RAM. It barely has enough memory unless you get at least two of them. I agree that the money is better spent on some cloud compute credits, it will be better for your use case.
reality_comes@reddit
Bro, space bar exists.
MonthlyWorn@reddit
lmao yeah that wall of text hit me like a truck too
definitely needs some paragraph breaks but the question's actually solid underneath all that
MarkoMarjamaa@reddit
Wow, really? How can I get there?
FBIFreezeNow@reddit
Where is it?
StardockEngineer@reddit
Below and to the left of his other missing key.
allSynthetic@reddit
FYI, I have one and I have reasonable results with the gpt-oss-120. It's nvnp4 so it's very efficient. This is not a gguf model. This works wonders with kilocode.
SillyLilBear@reddit
I don't believe so. I have a Strix Halo and dual 6000 Pro. I found the strix unusable for anything.
robberviet@reddit
I think people should have clear definitions between before buying that: If it could run big models vs If it can works fast enough for their use cases.
Most people expect it would be fast enough for code generation/vibe coding will be disappointed. When I works with anything useful, context is always around 50k to 150k.
SillyLilBear@reddit
I find it too slow for anything but just asking chat "how you doing today"
Eugr@reddit
I have a Strix Halo and a dual DGX Spark. Strix Halo is OK with got-oss-120b, but DGX Spark has 2x+ faster prompt processing. Dual Sparks running vLLM with tensor parallel are surprisingly capable. Not as good as dual 6000 pro, but good enough to run minimax 2.1 in AWQ 4-bit with usable speeds.
SillyLilBear@reddit
Strix is ok with 120B, until you do something. Sure I can get 50t/sec, but it is incredibly slow doing anything.
Eugr@reddit
41 t/s generation at small context, prefill - hard to say as vllm's reporting is all over the place and TTFT includes other latencies. Surprisingly usable in Cline/Copilot/Claude Code.
SillyLilBear@reddit
I assume the speed degrades very quickly. Even on rtx 6000 it degrades rapidly with m2/m2.1. Far faster than glm.
Eugr@reddit
I was seeing 25 t/s at 30% kv cache utilization or so. I will run benchmarks when I get a chance.
tamerlanOne@reddit
In my opinion, you could build a hybrid system with an AMD AI MAX +395 for around $2,500 to work locally and solve serious problems in the cloud with APIs, thus optimizing your workload and budget. But it depends a lot on your work context and the complexity of the projects you develop.
MarkoMarjamaa@reddit
As always, it's about costs and speed/memory. If you have deep pockets, you'd buy a supercomputer.
So, questions for OP. How much are you planning to use it? How important is speed? If you have an estimate how much compute you need and how much revenue you get for it, it will be just math.
With Ryzen 395 there might be some problems with python packages, because of cuda, but it's getting better. Yesterday I tried Microsofts Florence2 and had to install special version of flash-attention for Triton. It's not always straightforward but it's getting better.
I have Ryzen395 and running gpt-oss-120b, It's ok for now. But one good point with it is, when you are actually running it, you learn to know what you need and how much you need it. And if you need moar power, it will be easy to sell Ryzen395 and buy the next step.
HealthyCommunicat@reddit
NO. DO NOT BUY. i bought the rog z13 flow 128gb AND the dgx spark, im retarded, they both have same memory bw, the token/s on the m4 max at 500+gb/s is BARE usable for me, if you are a real production level worker as a software dev or even plan on it, be careful. i returned it both and went for the m4 max and will be going for the m3 ultra soon. spare the money and go for the m3 ultra, microcenter carries out a 256gb ram higher end one for like $5300, same price as the m4 max laptop 128gb ram.
StardockEngineer@reddit
Nah. At just 8k context the DGX can start and finish before the M3 Ultra even gets started on generating tokens.
HealthyCommunicat@reddit
ur right about this, i have a direct vid from a few days ago when i still had the dgx spark outbeating the m4 max when it comes to speed on gpt oss 120b - but any other model....
StardockEngineer@reddit
I’ve found this to be the case with every model I’ve tried. I also have an M4 Max.
HealthyCommunicat@reddit
ur saying every single model runs faster on the dgx spark than the m4 max?
StardockEngineer@reddit
I’m saying I’ve tried a lot of models and that’s been my experience.
HealthyCommunicat@reddit
the same exact gpt oss 120b high reasoning same exact prompt did 42 token/s with 0.3 prompt process on dgx sparl while it did 70 token/s with 3 full seconds prompt processing - this is due to the higher memory bandwidth, the m4 max LITERALLLY has near 2x the memory bw speed. i think you should read up a bit about how to best utilize ur m4 max if you rlly think ur dgx spark outbeats it. its a pretty widely accepted thing that its a massive dissapointment.
StardockEngineer@reddit
What was the context length? Because you didn’t provide any relevant details.
CodeSlave9000@reddit
The DGX Spark is not an inference machine - it's a training and prototyping lab for nVIDIA infrastructure. If you're building DGX systems then this is a great box - it's basically the development box. If you're looking to be actually running LLM's then this is NOT the box for you. You will be frustrated - performance will be somewhere between the RTX 5060 Ti and the RTX 5070 at best. If you need that VRAM and want a similar budget, go get a used GPU server and put an RTX 6000 Pro in it or get a MAC.
Serprotease@reddit
Honestly, it’s not as bad as it seems. Prefill is decent for a $3-4,000 price and with MoE, tokens generation is tolerable. If you plan to run 70-123b dense models (devstral) then get 9700 pro or 4x 3090.
MAC is good, but it’s better for purely chat based interactions where tokens generation is very important and the length of the message sent is not that big. For agents systems, slow-ish prefill can be felt.
one-wandering-mind@reddit
I think any at home LLM inference will be disappointing and come with compromises. RTX 6000 pro is the only thing that will have fast prompt processing and fast throughput, but as far as I can tell costs over 8k just for the card if you can find it. Spark is niche and early, but excels at fp4 compute. So for something like gpt-oss-120b, you get much better prompt processing than anything at its price, but generation throughput is limited by the slow memory bandwidth.
Still feels like running your own LLMs at home rarely makes financial sense. Doing it with small models you fine tune I think does because otherwise, you have to spin up cloud compute every time you want to run. But if you are running models as they are, you only have to pay for the exact tokens you are using and could take advantage of running in the cloud economics and speed. For the extreme example, you can use cerebras and get 2000 tokens per second generation throughout for gpt-oss-120b. That gives you a really fast feedback loop.
CodeSlave9000@reddit
Yeah I agree - LORA and fine-tune are perfect for home-running. Also once your context size gets big you're really paying a lot per-token for cloud. But in the end depends on what you're expectations are. The blackwell cards are still maturing in software support and I've had hiccups, and fp4 is really only happening for training right now. You can get really good results with the 40 series ADA cards too - I see 100+ tokens/sec on a lot of MOE models. You won't get 128GB models at the price of DGX, but I'd think you'd probably be happy with Strix Halo if you're really dead-set on it. And for coding, you're spot on - you can get gemini, qwen, amp and a few others for basically nothing right now. Use it.
doradus_novae@reddit
I have 6000pros, 5090s and sparks. the difference is clear, Spark CAN DO some things but its not a workhorse. Too slow for most things, it can do some decent numbers with certain models and some work - here's a thread about this from the other day: https://www.reddit.com/r/LocalLLaMA/comments/1ptakw0/2x_dgx_spark_vs_rtx_pro_6000_blackwell_for_local/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
thatguyinline@reddit
I returned mine. Couldn't find anything it was particularly good for.
abnormal_human@reddit
Ok that was unreadable but I do own a DGX Spark and I would not suggest using it for this. Put the money towards cloud apis instead.
The smallest box I would consider using locally for “vibe coding” support is 4x6000 Blackwell ($40k) with something like GLM 4.7 but that’s still not as good as Opus or GPT5 Codex and honestly that’s a stupid way to spend 40k because the depreciation runs faster than you could ever spend in the cloud.
Illustrious_Matter_8@reddit
Forget it, I have a 3080ti 12gb ther are no good coding model. In this memory range, the fun stuff starts at 20 gb but that's even small for the real models. You be able to run stable difusión pictures and barely short movies
bigh-aus@reddit
IMO only use case for a GB10 is for building out a stack to scale up to full on blackwell servers. For your usecase a mac studio would be a better way to go.
Based off how far you’re intending to get, I’m concerned that your needs are going to far exceeded a $4000 budget. If I were you, I would be renting some compute somewhere to try out different models. Once you know what model you want to run allow this to set how much VRAM. Then dial speed based on budget. If you're looking to run a 200B+ model unified ram is your best bet unless you either have deep pockets and/or are willing to build out a 3090/4090/5090 rig.
Also, don’t discount resale ability either. The secondhand market for a GB10 is going to be significantly smaller than that of a Mac studio. That said also note in June (WWDC) there are rumors of an M5 max studio.
I would like to buy a local rig as well however I’m holding off until June The intent in the meantime is to use Claude and experiment with a rackmount server I have at home (256gb ram) understanding that it's going to be hellishly slow. Everything I’ve read Claude is still superior to local models unless you’re in the very high parameter count, but I would love to be proved wrong on that assumption.
That said 6 months of Claude at $200 a month burns through $1200 which could be used to buy a gpu.
teh_spazz@reddit
Put the 4000$ towards cloud APIs.
spiffco7@reddit
I think you’ll be happier just using cloud inference plus getting the mac. I own macs with 96 and 128 unified memory and a spark too.
texasdude11@reddit
This will answer your question:
https://youtu.be/HliRC6qCkqk