16x Spark Cluster (Build Update)
Posted by Kurcide@reddit | LocalLLaMA | View on Reddit | 143 comments
Build is done. 16 DGX Sparks on the fabric, all hitting line rate.
Setup was time consuming but honestly smoother than I expected. Each Spark runs Nvidia’s flavor of Ubuntu out of the box with mostly everything pre installed and ready to go. For setup I had to rack them, power on, create the same user/pass across all nodes, wait about 20 minutes per node for updates, then configure passwordless SSH, jumbo frames, IPs, etc. which I scripted to save time.
Each Spark connects to the FS N8510 switch with a single QSFP56 cable. The DGX Spark bonds its two NIC interfaces into each port, so you get dual rail over one cable. I'm seeing 100 to 111 Gbps per rail, which aggregates to the advertised 200 Gbps.
Why this over H100s or a GB300?
Unified memory. The whole point is maximizing unified memory capacity within the Nvidia ecosystem. With 8 nodes I was serving GLM-5.1-NVFP4 (434GB) at TP=8. Now going to test with DeepSeek and Kimi
The longer term plan is a prefill/decode split. The Spark cluster handles prefill (massive parallel throughput), and once the M5 Ultra Mac Studios drop I'll add 2 to 4 into the rack for decode.
—
Full rack, top to bottom:
- 1U Brush Panel
- OPNSense Firewall
- Mikrotik 10Gb switch (internet uplink)
- Mikrotik 100Gb switch (HPC to NAS)
- 1U Brush Panel
- QNAP 374TB all U.2 NAS
- Management Server
- Dual 4090 Workstation
- Backup Dual 4090 Workstation (identical specs)
- FS 200Gbps QSFP56 Fabric Switch (Spark cluster)
- 1U Brush Panel
- 8x DGX Spark Shelf One
- 8x DGX Spark Shelf Two
- 2U Spacer Panel
- SuperMicro 4x H100 NVL Station
- GH200
Raredisarray@reddit
Yo you weren’t fuckin around! I thought for sure you were showing the stock at the store you work at 😂😭
__JockY__@reddit
How ~~fast~~ slow is it?
TheRealSol4ra@reddit
Ok bro, you got slap your dick in my face money but can I ask why this over like 8 RTX 6000 pros. Thats 768gb of VRAM thats more than enough to run these models at FP8 or Q6, Like sure you absolutely can run any model now. But youll top out at like 15-25t/s right? Which is fine but compared to the 6000 pro is nothing.
NotumRobotics@reddit
According to our experience, less, more like 5-7tps.
Eugr@reddit
It's meaningless to talk about performance without mentioning model/quant/cluster size.
starkruzr@reddit
no, Ziskind did it at about 21.
TheRealSol4ra@reddit
Yeah thats rough man… 80 grand to get less than 10 t/s. Hopefully they got a good return policy 💀
PutMyDickOnYourHead@reddit
You've got what?
Such_Advantage_6949@reddit
Please share some statistic how fast it run
kmouratidis@reddit
Considering OP never answered similar questions, the regrets/second are way higher than the tokens/second.
Irythros@reddit
I mean this has already been known. Not for the new models like GLM 5.1 but just comparing the known speeds on old models vs other known hardware on the same old models we can infer.
https://www.youtube.com/watch?v=QJqKqxQR36Y
8 nodes of qwen 3.5-397b-a17b was 24.21 tg/s and 1498 pp/s
Honestly I would just spend the money on RTX 6000's. Less memory but god damn the sparks are slow.
Ok_Top9254@reddit
This guy knows shit about LLMs. "Write 1000 word story" is not a benchmark. The fact you use him as a source makes your statement worthless.
2Norn@reddit
spark, studio, ai max
it's not their job to be fast its lpddr5x usually after all, it's just they are readily clusterable and considerably cheaper on size/power consumption and way more portable
8x spark costs less than 4x 6000 especially if u go msi/dell/asus etc which is just exactly the same thing
but on top of all this 4x 6000 will require a special workstation built for it as well
you're looking at like $40k vs $60k type of situation plus the extra power bill after that
depends on the type of work if its 24/7 openclaw type of thing or something else
Irythros@reddit
I can get one Spark for $4700. I can get a 6000 for $9300. So I can get 4x 6000's ($37.2k) for $400 less than 8x sparks ($37.6k)
A spark is rated for 240w. An RTX 6000 is rated for 600w. Previous testing posted here in /r/LocalLLaMA has shown that reducing a cards power draw to 80% has effectively zero impact on performance (reducing t/s from 30 to 29.) That would be 480w if done. Same power draw as 2x sparks.
Qwen3-VL-32B is 6 t/s on 2x Sparks. On 1x RTX 6000 it is 20.37 t/s ( https://www.reddit.com/r/LocalLLaMA/comments/1kvf8d2/nvidia_rtx_pro_6000_workstation_96gb_benchmarks/ )
Effectively same price for compute, effectively same power usage, over 300% the token speed.
2Norn@reddit
i'm not gonna repeat what i already explained. that's just wrong.
StardockEngineer@reddit
You can get that speed on just two nodes, tho. So there is either a diminishing returns situation or optimization problem. Nvidia themselves only recommends linking two.
Eugr@reddit
Yes, but the number above is for BF16 version. Otherwise, 4-bit quant runs well on 2 nodes.
flobernd@reddit
Yeah, tg is mostly memory bandwidth limited while prefill is compute bound. Unified memory is slow (compared to regular VRAM) and also the slow interconnect (200G NIC) hurts when doing TP inference.
OPs plan is to use this only for prefill and outsource tg to a Mac, if I understood correctly.
Irythros@reddit
Its diminishing and pretty heavily. He provides speeds for 1, 2, 4, and 8 nodes.
djdeniro@reddit
Qwen3.5-3977B-A17B-MXFP4 with vLLM and 8xR9700 got 32 t/s at tg and 3000 t/s in pp 170k max model len and 80k kv cache. but with 4x concurrent request got 100+ t/s generation
Eugr@reddit
The number above, I believe, was for a BF16 version, not quantized.
-dysangel-@reddit
The sparks are way faster than Macs for prefill at least, and it turns out exo can link the two: https://blog.exolabs.net/nvidia-dgx-spark/ .
I think this combo is going to be hard to beat for a compact, low power solution until the M5 Ultra comes out. One thing this stack has over an M5 Ultra (other than that it isn't out yet, and who knows when it will be!!) is it also lets you play around with CUDA only projects.
DistanceSolar1449@reddit
Meh, it's SM121
It doesn't support SM100 features that makes it worth using CUDA. Basically SM121 isn't real blackwell. You don't have CUTLASS. You don't get FlashAttention-4.
-dysangel-@reddit
Yeah I understand it's not the same thing as server Blackwell. There's a bunch of stuff I'd like to try which is CUDA only though, such as Isaac Gym. Plus many new models or smaller repos are CUDA only, so it will open some fun doors vs my current Mac-only setup.
I don't mind tinkering on kernels here and there when necessary too. If anything it's all just a good learning experience, which is one of the most important parts of this for me. If it were just about performance, API is cheaper and faster.
starkruzr@reddit
24 tps generation is honestly not bad at all for a model that size, spending $32K to run it. that's 3 RTX Blackwells and change, and that isn't anywhere close to enough VRAM.
flobernd@reddit
If I recall correctly I observed 6000 prefill and 150-200t/s for that model when I tested on 4x RTX Pro 6000.
Historical-Internal3@reddit
I own two sparks and love them and would defend them but this comment made me audibly laugh - hard.
Charming-Author4877@reddit
That thing looks like a lasting memory for a lifetime, something you can tell your grand kids :)
PassengerPigeon343@reddit
They processed the comment very quickly, but it’s taking them a while to finish typing their response
Such_Advantage_6949@reddit
Yea sounds like it
Freonr2@reddit
There are many knobs that beg to be tweaked, and I imagine it will take significant fiddling with them to get the best performance. It's an unusual setup compared to a normal DC cluster so it's not like there's some recipe waiting to be read and trivially implemented.
Probably need to give OP another few days or week to figure out how to squeeze the juice.
ResidentPositive4122@reddit
It run nowhere, that rack will make sure of it :)
Such_Advantage_6949@reddit
I meant glm 5.1, i am really wonder fast it run, sound like really solid setup to run big model
AppleBottmBeans@reddit
I think the guy above you was pretty clear that the rack OP got is a pretty solid structure. So not only will the rack not be able to run away, but either will the software (like glm 5.1).
validol322@reddit
What are your primary use cases and industry field where you operate?
kaliku@reddit
Look at Ops profile, he's a rich dude playing. Or at least - that's the vibe I get from his posts. Anyway OP don't take this comment to heart. If I had the money I'd prolly do the same. Hell... At my level, I, in fact did. I splurged on a rtx pro 6000 because I wanted to learn and not be restricted by hardware. And I could afford the 6000.
Some people like fast cars, others like gpus. OP likes both haha. Green with envy I am. Peace.
Kurcide@reddit (OP)
I don’t get offended by these comments. It’s all just fancy “playing”. I’m using the to test an agentic layer im building ontop of LLM harnesses and going to use them to support my engineering teams which is what the whole rack has always been used for
6000 pro is great, if I didn’t have the H100s I would have gotten some
pirateadventurespice@reddit
But do you get tired of the people blatantly unable to read your like four sentences of explanation?
The folks who seem obsessed with t/s and ignore that you've already said (repeatedly) these are meant to handle prefill long-term is just exhausting.
At my own scale, I've been enjoying two sparks with an m3 ultra. The concept is solid and works, you're just taking it to a whole new level.
kaliku@reddit
Thanks for sharing with us
Maleficent-Ad5999@reddit
Shhh.. we don’t discuss use cases here.. we just brag about our builds
MisticRain69@reddit
I have noticed pretty much anytime anyone asks someone how they afford such an ungodly amount of hardware its complete radio silence. Not even one peep of what the use case is. Why so secretive?
Maleficent-Ad5999@reddit
Either they all do something shady that they’re embarrassed to admit or they must have signed nda at their workplace not to reveal stuff.. or, some are just cruel
Polite_Jello_377@reddit
I think the more likely reason is they dumped a lot of money into something that they got interested in but don't actually have meaningful use-cases for it
Suitable-Economy-346@reddit
I always thought it was mostly employees at companies and they like the praise and don't want to admit they're broke asses like the rest of us.
xienze@reddit
There's a good chance it's just someone who made a lot of money on Bitcoin and likes tech for the sake of it. Sorta like the guys you see on r/homelab who have entire 42U racks full of gear but like six ethernet ports actually in use.
Dany0@reddit
The industry is orphan crushing machines and orphan crushing machine accessories unlesss specified otherwise
We must bully the gpu rich into submission. Post use case or face the wrath of leddit
Prof_ChaosGeography@reddit
Like you I used to want to know when I was building a system but I found out the hard way.... for many it ain't software development even if that's their day day job, people got weird kinks they use image gen and roleplay for
Shot-Buffalo-2603@reddit
I’m sure some people do this, but there has to be a better way than getting GLM to run on 16 sparks to goon
temperature_5@reddit
💸>🧠, but glad you're having fun.
Birdinhandandbush@reddit
so thats like 80-100k on the sparks alone, are we gonna call this a local home build :) :) :)
Kurcide@reddit (OP)
It’s locally in my home
Birdinhandandbush@reddit
No need for central heating, am I right
fyrn@reddit
I can see my thoughts in your rack ... "I like how this Sliger looks but you can barely see it because it's black, now that I need a second one maybe I should pick one of these colors?" :)
(Except I just went for black again, but stuck RGB fans behind it, the white looks awesome though.)
Party-Special-5177@reddit
Just popping in to show some love.
I completely adore the main thesis behind this build (iirc semi-solving the Mac prefill issues with a properly fat cluster of gb10s).
IrisColt@reddit
Honestly, just seeing that this can be done is an experience unto itself.
pixelpoet_nz@reddit
Yeah there's a lot of obvious stuff to be said about money etc, but you have to appreciate dedication to the game / vision. Well played, and I would love to have such a setup for ridiculous 3D rendering (using custom Vulkan code).
IrisColt@reddit
Mind-blowing! I'm easily wowed, but this is something else.
Ok-Measurement-1575@reddit
How are you planning to split PP / TG?
I didn't realise this was a supported option.
bick_nyers@reddit
SGLang can do it (probably not with Mac though?). It's called PD (prefill-decode) disaggregation.
It's great for when you want to drive latency (TTFT) down
Ok-Measurement-1575@reddit
That's very interesting, thanks. Never imagined this would be possible.
bick_nyers@reddit
At scale we use it to dynamically change the ratio of how many GPUs are used for prompt processing vs. decode. When the average context length of users prompts increases throughout the day -> shift some GPUs from decode to prompt processing. When length of prompts levels back out -> shift some GPUs back to decode to give them faster token speeds.
No_Afternoon_4260@reddit
Sglang allows you to change the number of GPU allocated for P and G dynamically? Do your u have any documentation by any chance?
bick_nyers@reddit
So you do it at the cluster/orchestration level, not necessarily SGLang.
You setup a prefill cluster and a decode cluster, each has workers attached to it (GPUs).
Then you monitor externally and programmatically shutdown a worker in one cluster and then spin up a worker in another cluster. SGLang can be setup to do service discovery to auto-detect that a worker was added to the cluster.
https://docs.sglang.io/docs/advanced_features/sgl_model_gateway#pd-mode-discovery
No_Afternoon_4260@reddit
Wow really interesting I need to look into that mechanic
By cluster do you mean network?
the320x200@reddit
Somewhat off topic, but how likely do you think it's that the major providers are shifting quant levels throughout the day to balance load?
bick_nyers@reddit
I think it's highly likely. I wish there was some kind of fingerprinting/guarantees as a user in that regard
Ok-Measurement-1575@reddit
Awesome.
KingMitsubishi@reddit
Yea, I am wondering about this too…
flobernd@reddit
I got your point about prefill, split gen and memory, but did you consider 8x RTX Pro 6000 Blackwell? Might have been the easier solution (single host) at a similar price point. Power usage is a bit on the higher side, but it runs Kimi26, GLM51-nvfp4 etc. with very good prefill and 100+t/s regardless of the PCIe bottleneck (that you also kinda have with the Sparks in form of the 200G NICs).
moorsh@reddit
Because 8x 96gb RTX Pro is only 768GB while his setup is over 2TB. You need over 1.5TB for some of the flagship open source models.
flobernd@reddit
Granted, there is less VRAM! My suggestion was based on the models he mentioned. K26 runs at full precision, GLM51 requires a 4-bit quant. DS4 Pro will be unusable on his cluster (performance is already very bad on 8xRTXPro6k) and this will most likely be the case for all MoEs with many active params.
Fun experiment anyways (if you have the money).
segmond@reddit
what's the performance for DSv4Pro on 8xpro6k?
xienze@reddit
Sure, but sometimes you gotta ask yourself if it's better to run a faster, smaller model than a larger model that's so slow it's practically unusable.
bnm777@reddit
Yeah, isn't the token per second rate far higher with 6000s Vs sparks?
flobernd@reddit
I‘d guess so! Let’s see what OP reports :-)
NickCanCode@reddit
https://www.youtube.com/watch?v=QJqKqxQR36Y
Someone already tried 8 DGX few months ago.
VLLM probably have improved in the pass two months so number should be a little higher now. I think \~40 tps is ok for normal use case. For coding, RTX Pro with NV Link will be much faster and more enjoyable.
flobernd@reddit
Unfortunately the RTX Pro does not have NV link. But regardless of that, it gets 100-150t/s for K26 for example on a Turin server.
FaustAg@reddit
the new server blackwell 6000 version has 2x 200gbe network connections on each card. they take the place of nvlink. I have the regular OG blackwell 6000 and mine doesn't have those
flobernd@reddit
It’s not comparable. 2x200G rather compares to PCIe 5 x16 on the RTX Pro 6000 speed wise. NVLink is like 900 Gb/s - it’s in a completely different scale.
FaustAg@reddit
yes that's true and the lack of tcgen05 pisses me off so much, but it's better than just pcie. I only have one blackwell and I run out of vram trying to train the things I want so if I could I'd still trade in for two server versions.
NickCanCode@reddit
Oh I see. Didn't expect even pro cards don't have NV link these days.
flobernd@reddit
Yeah NVIDIA basically ran a scam with these cards. They are not „real“ Blackwell (sm100). They use sm120/sm121 which lacks several architectural features like tmem and also NVlink. The Spark uses the same architecture. NVIDIA did that on purpose for market segmentation I guess..
StardockEngineer@reddit
We know that two nodes can do 28 tok/s via Spark Arena https://spark-arena.com/leaderboard
philmarcracken@reddit
He mentioned this, if its the same guy. These are for prefill and some m4 mac or whatever is for token gen. I mindblank on apple trash so its whatever
holdthefridge@reddit
Try using DFlash to get throughput faster, and in future if you run out of 2TB ram, use turboquant. Let us know the tokens/s once you get DFlash working on all.
cusspvz@reddit
What’s your configuration and stack to run these in a cluster?
Seventh_monkey@reddit
Humor me, can you describe what exactly will you use it for (to support my engineering teams is as broad a description as it gets) so that this is an investment that will pay off?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Pleasant-Shallot-707@reddit
If this is a hobby…zoiks!
bick_nyers@reddit
If you can batch your workflow even a little bit I would be curious if expert parallel gives you better numbers
DiscussionAncient626@reddit
Address?
DiscussionAncient626@reddit
Sorry the worst joke. amazing setup
Chris279m@reddit
I have one spark 😔
IndividualGold4667@reddit
How much did this cost?
Kurcide@reddit (OP)
Sparks all in around $70k, $13k for the switch, $2k for the cables.
If you mean the whole rack… a lot more
Eyelbee@reddit
Couldn't you just build a 8xb200 node at this point?
a_slay_nub@reddit
Dunno about b200 but we got quoted 400k for 8xH200.
Kurcide@reddit (OP)
yup, and it’s around $300k for H200 refurb. I considered it but it was too big of a jump
starkruzr@reddit
not even close. not even close to close.
debackerl@reddit
I'll buy the cables
IndividualGold4667@reddit
Sweet set-up! Congratulations !
urazyjazzy@reddit
Everybody here talking about prefill speed being the main purpose of this system. Didn't M5Max solve this as the experiments showed which means M5 Ultra will be even 2x faster . So buying 4 M5 Ultras will be a much better deal than this in the future. Token generation speed of 1 x M5Max is 3x faster than a single sparks so if you make a calculation M5Ulra will be 6 x faster. So 4 of M5Ultra will be worth to 24 x Sparks on top of being faster it would have 2TB memory, Faster ppspeed . You would be able to run the largest models in 45 to 60 t/s . All hypothetically of course . I believe new Mac Studios if they come with 1Tb ram it would be even more appealing for Home users or even small labs 🙏🏼
somerussianbear@reddit
I can smell something burning already
Turbulent-Walk-8973@reddit
how about cooling? I had a single DGX Spark, and I was having some issues with it.
Kurcide@reddit (OP)
I have some 3u fans tha i’m going to try and mount infront of them to force air through
ilarson007@reddit
What kind of private jet do you own?
Kurcide@reddit (OP)
can’t afford one, spent all my money on compute
InnocenceIsBliss@reddit
So, where are those guys who called this out as just a store clerk posting for clout..😆
Kurcide@reddit (OP)
quite I guess
neuralnomad@reddit
Are there even any games for this? 🤭
New_Zone5490@reddit
i wish i had a powerful ai girlfriend like this too
Osi32@reddit
At least he can produce just dance Vance videos faster than the rest of us….
Themotionalman@reddit
My gosh, this is the life bro. How many kidneys did you have to sell?
-dysangel-@reddit
harvest*
PhonicUK@reddit
I thought I was baller with my 4x sparks but this is something else. You'd have probably been better off getting a DGX Station!
mayong13@reddit
แค่เห็นภาพนี้ก็น้ำเดินแล้ว localllmในฝัน
Klarts@reddit
Dude that’s so sick! Hope you’re having a blast and it’s living up to your expectations!
Effective_Motor_4398@reddit
Wow. Ledgend
RoomyRoots@reddit
I wish I could be this trifle with money.
Annual_Award1260@reddit
I’m working on setting up a 3 node cluster and due to the ddos attacks on ubuntu I have quite a few broken packages now.
-dysangel-@reddit
Kudos on the 16x setup, that is nuts! Thanks for making me/us aware the DGX/Mac split was possible with your last post.
I'm not balling out like you, but I've got a single Spark arriving today to boost prefill for my M3 Ultra. Should accelerate my prefill to M5 Ultra speeds - and buying 2 Sparks might even be cheaper than a 256GB M5 Ultra, but with the benefit that you can also play around with the CUDA stack.
pureskill1tapnokill@reddit
Why not another 2 GH200? don't they have unified ram off like 4 sparks plus VRAM of 1? So 3 GH200 \~ 15 Sparks?
conockrad@reddit
FP4 I guess
True-Lychee@reddit
$500k worth of kit
xXy4bb4d4bb4d00Xx@reddit
This is interesting. I have a cluster of 21 x 8 rtx 6k nodes and I am currently experimenting with the gb10 for a new cluster.
Please post your results, and if you’re interested in consulting / being paid to share your setup please dm me
Long_comment_san@reddit
So...did you build anything?
One-Pain6799@reddit
That's great, I'm looking forward to your projects.
humanoid64@reddit
Amazing! Is this at home or in a data center?
shALKE@reddit
My OCD is kicking off the Sparks arena spaced evenly
ZubZeleni@reddit
Won’t you have issues with heating? Don’t you need some free space between each Spark?
VonDenBerg@reddit
sheesh jelly. please tell me you have a business use case and if so, why not colo instead?
yeahbuddyia@reddit
Very nicely done. How are you planning to handle the split between the Macs and Dgx Sparks? I tried it recently with 4 m3u 256gb and 2 dgx spark with Exo, and they don't have that working yet.
IngenuityNo1411@reddit
tk/s when
(another approperiete question other than "gguf when")
Optimal_Guava5390@reddit
thats a hell of a way to spend 200K+ Very very impressive setup
koushd@reddit
what was your prefill and decode on glm 5.1 nvfp4
fairydreaming@reddit
obviously not as impressive as the photo
LegacyRemaster@reddit
How many Watt?
no-adz@reddit
Sick, nice build!
unluckybitch18@reddit
following for more updates
Only_Situation_4713@reddit
Speed? Thinking about 8x
SM8085@reddit
The cool part is you can ask your Sparky-16 to do that work now.
That's pretty nuts.
DeepSeek V4 actually seems really nice. I won't be able to run Pro locally, but there's a chance I could run Flash.
adt@reddit
The brush panels are nice, never seen those before.
Kurcide@reddit (OP)
got them on Amazon, definitely helps make it look nice and manage dust