Have the GB10 devices become the current "best value" for LLMs?
Posted by DiscombobulatedAdmin@reddit | LocalLLaMA | View on Reddit | 53 comments
I want to buy some real hardware because I feel like I'm falling behind. 3090s are >$1000 on ebay, and building out the server would be very expensive with current memory and storage prices. Macs are backordered for the next 5 months. I have no idea on the status of AMD products or Intel, but I don't want to fight driver and compatibility issues on top of trying to get models and harnesses running.
Are the GB10 variants the best value if you want to buy now? Is it better to try to wait on the M5 releases in 2-4 months? That seems like forever in today's fast-moving environment.
ketosoy@reddit
1x Strix device at half the cost will usually keep pace with 1x gb10 on moe decode (and usually lose on dense models, moe prefill, and a topology of 2. A topology of 3-4 and it gets complicated again).
A epyc ram sled can get you better performance per dollar and access to bigger moe models, a used Mac m1 can get you really interesting performance per dollar on ~50b moe models. I haven’t modeled either of these out on dense models yet.
t4a8945@reddit
I own 2 Asus Ascent GX10, paid total 5400€ (including cable ), very happy about it. Minimax m2.5 is where it's at for me and agentic coding.
The GB10 is a sweet spot for Moe, very bad for dense models.
Power consumption (and thus heat generation ) is also very nice, around 240w for the cluster while working hard. People often forget this as the recurring cost for beefy dual GPU setup.
But of course, new tech will supersede it some day, as always. Then I'll make myself a 4 unit cluster haha.
Zyj@reddit
Which M2.5 quant do you use and what is performance like?
t4a8945@reddit
That's one full session of agentic coding with cyankiwi/MiniMax-M2.5-AWQ-4bit.
Performance scales down linearly from tg 42 tps, down to 16 at the far end of the context.
Prompt processing numbers in the screenshot take into account the KV-cache, it's not "incremental prompt processing speed".
Easy-Unit2087@reddit
I think the GB10 has the best price/value for local LLMs at 2 nodes (2x $3,400 Asus GX10 1TB, $80 QSFP56 cable). Thanks to 200GbE, adding a second node nearly doubles speed and model size that will fit.
Two nodes run Intel/Qwen3.5-397B-A17B-int4-AutoRound at around 1,500t/s PP and 30t/s TG on vLLM, perfect for agentic work. They are also great at stable diffusion with ComfyUI (worker node for half of the images, and half of the upscaling), fine-tuning small models.
Igot1forya@reddit
I managed to get a 3090 connected via M.2 to Oculink to my Spark as well. Boot from USB works great and Models loaded via NAS/SAN also works well to.
Easy-Unit2087@reddit
Whoa. Would you mind sharing how you did this? How do you use it together with GB10?
Igot1forya@reddit
I posted pictures in a Git issue with some info. https://github.com/chappa-ai-llc/spark-smi/issues/1
StardockEngineer@reddit
lol wut. You’re a wild one. How do you use it? What’s the flow?
Igot1forya@reddit
I've found the 3090's about 3-6x faster than the GB10 and makes for a great model router for OpenClaw. I have my smaller models on the 3090 and my large on the GB10 and it works incredibly well. The Oculink is a bottleneck for sure, but having an extra GPU with fast memory is awesome.
hurdurdur7@reddit
I would not buy a GB10. It can fit big models and process prompts quite fast but token generation is slow for big models as memory bandwidth is low. On that last point the price vs performance is just wrong.
I would wait for M5 ultra. Or go for a multi gpu rig.
last_llm_standing@reddit
what would your suggestion be for multi gpu rig under 2000usd?
cunasmoker69420@reddit
chinese modified 22GB RTX 2080 Tis you can find on eBay
Tyme4Trouble@reddit
Under 2000 USD? For anything other than inference you really are going to want 1:1 DRAM to vRAM which in this economy is going to byte even with DDR4
HCLB_@reddit
Why 1:1 dram vs vram?
last_llm_standing@reddit
I have access to a hardware disposal dump from a mnc, they dump all stuff other than GPUs (sometime they slip through), so for anything that is not soldered to the machine, I should be able to pull off, i already found two sets of DDR5-6000 (2x16GB), im trying to build from junkyard so gpu is the only thing im trying to buy
laffer1@reddit
I just bought a 7800xt for 400 new from woot. I was upgrading from an instinct mi 25.
Performance of Microsoft phi 4 model: Mi 25: 16tps 5070: 30-34tps 9060xt: 29tps 7800xt: 50tps
The mi25 was a huge pain to get to work due to its eol status. The rest were painless.
HCLB_@reddit
16GB vram for 7800xt?
laffer1@reddit
All of them have 16gb except the 5070 which has 12gb
pfn0@reddit
atm, it's the "easiest" path (2x) to get to 256gb of uma with a decent bandwidth. a mac pro is cheaper: $6000 for 256gb, but you get incredibly slow prompt-processing with it. possibly faster token generation tho.
I have 2 GX10. They're running qwen3.5 397B. About 1700t/s PP and 26t/s TG.
They're not my only AI hardware either, I also have 96gb blackwell, and 5090. I use the sparks the most for general purpose inferencing tasks now. The blackwell cards do more diffusion stuff lately. Or run things like gemma 31B which kills the bandwidth on the sparks.
Traditional-Gap-3313@reddit
Did you maybe test prompt processing on 2x sparks on gemma4 31B in TP=2? Saturation test with vllm?
My workload is mostly "read this heavy document, write a short analysis", mostly 5-15k tokens in -> 150 tokens out. And it's really parallel. So I'm wondering what would it look like with 2xSpark on Gemma 31B with vllm. I've looked at that Sparkrun web but most people submitted 26B moe results. But on my benchmark Gemma 31B is really close to both Opus and GPT 5.4. While the 26B flies, it's not accurate enough.
pfn0@reddit
I'm mostly interested in single-stream performance as I am the primary user and I don't run much in the way of sub agents. I took some measurements of 31B in single and dual-node configurations
dual spark:
Traditional-Gap-3313@reddit
Thank you, it does give me a rough idea.
These are llama.cpp?
pfn0@reddit
these are running vllm, and benchmarks using llama-benchy
pfn0@reddit
single spark:
Tyme4Trouble@reddit
We need to know what you want to do to answer this accurately. Do you just want something to run Ollama/Llama.cpp on? Do you want to run SGLang or vLLM? Are you trying to run diffusion models, or fine tune diffusion models. Are you wanting a platform to tinker with concepts like quantization or agentic harnesses?
The answer to these questions is going to have a big impact on our recommendations.
For example, if you are mostly interested in fine tuning medium sized models (32-70B) then a GB10 would be my pick.
DiscombobulatedAdmin@reddit (OP)
The answer to your scenarios above is “yes.” :) I’m in IT and need to learn this, so I’ll be experimenting with this hardware. It will mainly be inferencing and agentic use with openclaw-style harnesses, maybe some fine tuning, understanding it could be slow. Maybe some attempts at experimenting with machine learning models for PLCs and IoT to support my customers, but thats down the road and would be just to see if I can. If I get to that point and need hardware, then I’ll be ready to buy more.
Tyme4Trouble@reddit
Then you should get a GB10 box and if you can get your work to pay for it. I’ve got a Spark, and a Strix Halo system and a RTX 6000 ADA and a Radeon Pro W7900 (listing this to preempt commenters telling me I don’t know what I’m talking about because I haven’t tried X.)
You will have far fewer headaches with the GB10 in part because most of what you want to learn is already in their playbooks https://build.nvidia.com/spark.
Everything you want to do can be done on a Strix Box but it will be harder software wise. With that said my friends at AMD say that they are working on their own Playbooks mirroring Nvidia’s to address this soon.
khudgins@reddit
I second this. I have a GB10 and it's pretty useful for general AI tech tinkering. Training models, fine-tuning, etc all work fairly well. It's slow for inference, so if that's your primary use case, you'll do better building a rig.
For general purpose use, it's actually pretty nice.
The one problem is that the hardware requires an odd cuda version that isn't fully available, so it can take some experimentation to get off-the-table models to run when you veer away from Nvidia's playbooks. There hasn't been much I haven't been able to do yet, though.
Look_0ver_There@reddit
GB10 devices are now well in excess of $4000 nowadays. There may be one of two exceptions, but the DGX Spark was at $4500 at the local MicroCenter yesterday.
You can always pick up 2 or 3 Radeon AI 9700 Pro cards and install them. 32GB for $1300. Two will give you 64GB and run rings around either the Strix Halo or the GB10 boxes for inferencing. Get three, and there's very few models that a DGX Spark will run that you cannot. and you'll running at well over twice the speed for both pre-processing and inferencing.
I see another commenter asking about squeezing in under the $2000 mark. Unfortunately in today's market, that really isn't possible unless you luck out and pick up some 2nd hand cards.
pfn0@reddit
No, 4TB GB10 devices are well in excess of $4000. 1TB models are about $3400
Easy-Unit2087@reddit
At consumer prices, you can run small models really fast (dedicated GPU), bigger models at acceptable speeds (unified memory), or huge models slow (at least you used to, at current prices the CPU + RAM route is out).
Arguing the first option is better than the second is not helpful. They each have their own use cases.
Look_0ver_There@reddit
The core problem here though is that at current prices for unified memory machines, that the cross-over is becoming increasingly blurred.
If the UMA machines (Strix Halo, Mac Studio/MacBook Pro, GB10) remain relatively cheaper than a bunch of discrete GPUs, then yes, the two product segments don't really cross over and each has their own use case.
When a 128GB UMA machine costs as much as a discrete GPU solution, and the GPU solution can run the exact same models at 2-3x the speed of the UMA solution, then it very much does become a case of one being a better approach than the other.
With today's memory prices driving up the prices of the UMA devices, the "deployment space" of where UMA solutions make economical sense has shrunk dramatically to where it was even just 4 months ago.
If/when UMA solutions return back to sensible prices, then my point will no longer stand.
Easy-Unit2087@reddit
I said elsewhere in this thread that 2xGB10 is a sweet spot in cat 2, after all the 200GbE is 1/3 of the value on those devices. It's outside cat 1 electricity and PCIe slot territory to make sense.
I will concede that in the <128Gb the picture is more muddled, but models < 64GB are limited in usefulness imo. Not good enough for coding, even with agent skills and SOTA models writing the briefs.
hurdurdur7@reddit
I think your math is correct.
Monkey_1505@reddit
That ram bandwidth doesn't exactly look optimal to me.
abnormal_human@reddit
Typing this from one now. No.
sleepingsysadmin@reddit
Amd strix halo is more or less same thing for much affordable price.
Does the cuda and nvfp4 justify the high price?
How about the new mac m5 with 128gb, slightly more expensive but better? Like 4x memory bandwidth.
As for 3090. It's appeal right now is its unusual memory bandwidth and ability to run dense models like Qwen3.5 27b. Maybe that's good enough for you and you're solid for 2 years off that. Then you buy the ddr6 dip.
Flipside, its probably an ideal time to sell the 3090. Grab a 5090 or RTXpro. Because when DDR6 drops in a year, compute on cpu will be 3090 speeds. Offload setups on DDR6 will be mint.
oxygen_addiction@reddit
DDR6 Dual-Channel will be around 250GB/s and there is no "dip" in two years. That's probably going to be the pricing top, as it starts entering the market in late 2026-2027.
The 3090RTX is around 930GB/s while the 5090RTX is around 1.8TB/s, same for the Blackwell 6000.
Strix Halo is around 215-250GB/s
sleepingsysadmin@reddit
>DDR6 Dual-Channel will be around 250GB/s and there is no "dip" in two years.
Who in the world is running dual channel anything? We're only talking about quad or 8 channel here. Which is 500-1TB/s.
8 channel will be faster than a 3090.
oxygen_addiction@reddit
You'll need a server for that, which is going to be insanely expensive, no?
sleepingsysadmin@reddit
Its an expensive hobby.
54id56f34@reddit
If you're going to say that, then just buy HGX 🤪
Karyo_Ten@reddit
^ bait
Easy-Unit2087@reddit
Agentic use, i.e. spawning multiple agents and tools, made PP speed (and KV cache hit rate on the software side) much more important. On a 5-subagent code audit for example, the GB10 will be finished before a Mac M5 even starts inference.
Look_0ver_There@reddit
Have you ever used mlx_lm on a Mac? I have a work supplied M4 Max based MacBook Pro, and it can have an 80GB model (Qwen3-Coder-Next) up and ready to serve within 15s. I've never used a DBX Spark. How quickly can it get an 8-bit 80B model up and ready to go?
bluelobsterai@reddit
Facebook marketplace and find an old M1 with 64gb RAM. Best place to start for budget inference.
jacek2023@reddit
I was considering spark as a second device next to my 3090s but I have impression these devices are SLOWER not faster
Tommonen@reddit
No. B70 if you wabt gpu, stric halo or mqc if you want unified memory
semangeIof@reddit
B70 hahaha. How many TPS are you getting? The memory bandwidth on that card is a joke. $1k USD to fit a Q6 of Gemma 4 31B and you still have to quantize KV cache and it isnt even human read speed at 0 context.
That card is not good. Maybe it will be better months from now but it is too slow.
braydon125@reddit
Love my 3090s lol just gotta snipe those sub 900$ ads
__heroes_@reddit
ebay is expensive, I got my 3090 for less than 800 bucks few months ago
segmond@reddit
What are your goals, what's your budget?
Be realistic. You need to know those before you start figuring out GPU.