16x Spark Cluster (Build Update)

Posted by Kurcide@reddit | LocalLLaMA | View on Reddit | 143 comments

Build is done. 16 DGX Sparks on the fabric, all hitting line rate.

Setup was time consuming but honestly smoother than I expected. Each Spark runs Nvidia’s flavor of Ubuntu out of the box with mostly everything pre installed and ready to go. For setup I had to rack them, power on, create the same user/pass across all nodes, wait about 20 minutes per node for updates, then configure passwordless SSH, jumbo frames, IPs, etc. which I scripted to save time.

Each Spark connects to the FS N8510 switch with a single QSFP56 cable. The DGX Spark bonds its two NIC interfaces into each port, so you get dual rail over one cable. I'm seeing 100 to 111 Gbps per rail, which aggregates to the advertised 200 Gbps.

Why this over H100s or a GB300?

Unified memory. The whole point is maximizing unified memory capacity within the Nvidia ecosystem. With 8 nodes I was serving GLM-5.1-NVFP4 (434GB) at TP=8. Now going to test with DeepSeek and Kimi

The longer term plan is a prefill/decode split. The Spark cluster handles prefill (massive parallel throughput), and once the M5 Ultra Mac Studios drop I'll add 2 to 4 into the rack for decode.

—

Full rack, top to bottom:

- 1U Brush Panel

- OPNSense Firewall

- Mikrotik 10Gb switch (internet uplink)

- Mikrotik 100Gb switch (HPC to NAS)

- 1U Brush Panel

- QNAP 374TB all U.2 NAS

- Management Server

- Dual 4090 Workstation

- Backup Dual 4090 Workstation (identical specs)

- FS 200Gbps QSFP56 Fabric Switch (Spark cluster)

- 1U Brush Panel

- 8x DGX Spark Shelf One

- 8x DGX Spark Shelf Two

- 2U Spacer Panel

- SuperMicro 4x H100 NVL Station

- GH200

[-]

Raredisarray@reddit

Yo you weren’t fuckin around! I thought for sure you were showing the stock at the store you work at 😂😭

[-]

JockY@reddit

How ~~fast~~ slow is it?

[-]

TheRealSol4ra@reddit

Ok bro, you got slap your dick in my face money but can I ask why this over like 8 RTX 6000 pros. Thats 768gb of VRAM thats more than enough to run these models at FP8 or Q6, Like sure you absolutely can run any model now. But youll top out at like 15-25t/s right? Which is fine but compared to the 6000 pro is nothing.

[-]

NotumRobotics@reddit

According to our experience, less, more like 5-7tps.

[-]

Eugr@reddit

It's meaningless to talk about performance without mentioning model/quant/cluster size.

[-]

starkruzr@reddit

no, Ziskind did it at about 21.

[-]

TheRealSol4ra@reddit

Yeah thats rough man… 80 grand to get less than 10 t/s. Hopefully they got a good return policy 💀

[-]

PutMyDickOnYourHead@reddit

You've got what?

[-]

Such_Advantage_6949@reddit

Please share some statistic how fast it run

[-]

kmouratidis@reddit

Considering OP never answered similar questions, the regrets/second are way higher than the tokens/second.

[-]

Irythros@reddit

I mean this has already been known. Not for the new models like GLM 5.1 but just comparing the known speeds on old models vs other known hardware on the same old models we can infer.

https://www.youtube.com/watch?v=QJqKqxQR36Y

8 nodes of qwen 3.5-397b-a17b was 24.21 tg/s and 1498 pp/s

Honestly I would just spend the money on RTX 6000's. Less memory but god damn the sparks are slow.

[-]

Ok_Top9254@reddit

This guy knows shit about LLMs. "Write 1000 word story" is not a benchmark. The fact you use him as a source makes your statement worthless.

[-]

2Norn@reddit

spark, studio, ai max

it's not their job to be fast its lpddr5x usually after all, it's just they are readily clusterable and considerably cheaper on size/power consumption and way more portable

8x spark costs less than 4x 6000 especially if u go msi/dell/asus etc which is just exactly the same thing

but on top of all this 4x 6000 will require a special workstation built for it as well

you're looking at like $40k vs $60k type of situation plus the extra power bill after that

depends on the type of work if its 24/7 openclaw type of thing or something else

[-]

Irythros@reddit

I can get one Spark for $4700. I can get a 6000 for $9300. So I can get 4x 6000's ($37.2k) for $400 less than 8x sparks ($37.6k)

A spark is rated for 240w. An RTX 6000 is rated for 600w. Previous testing posted here in /r/LocalLLaMA has shown that reducing a cards power draw to 80% has effectively zero impact on performance (reducing t/s from 30 to 29.) That would be 480w if done. Same power draw as 2x sparks.

Qwen3-VL-32B is 6 t/s on 2x Sparks. On 1x RTX 6000 it is 20.37 t/s ( https://www.reddit.com/r/LocalLLaMA/comments/1kvf8d2/nvidia_rtx_pro_6000_workstation_96gb_benchmarks/ )

Effectively same price for compute, effectively same power usage, over 300% the token speed.

[-]

2Norn@reddit

i'm not gonna repeat what i already explained. that's just wrong.

[-]

StardockEngineer@reddit

You can get that speed on just two nodes, tho. So there is either a diminishing returns situation or optimization problem. Nvidia themselves only recommends linking two.

[-]

Eugr@reddit

Yes, but the number above is for BF16 version. Otherwise, 4-bit quant runs well on 2 nodes.

[-]

flobernd@reddit

Yeah, tg is mostly memory bandwidth limited while prefill is compute bound. Unified memory is slow (compared to regular VRAM) and also the slow interconnect (200G NIC) hurts when doing TP inference.

OPs plan is to use this only for prefill and outsource tg to a Mac, if I understood correctly.

[-]

Irythros@reddit

Its diminishing and pretty heavily. He provides speeds for 1, 2, 4, and 8 nodes.

[-]

djdeniro@reddit

Qwen3.5-3977B-A17B-MXFP4 with vLLM and 8xR9700 got 32 t/s at tg and 3000 t/s in pp 170k max model len and 80k kv cache. but with 4x concurrent request got 100+ t/s generation

[-]

Eugr@reddit

The number above, I believe, was for a BF16 version, not quantized.

[-]

-dysangel-@reddit

The sparks are way faster than Macs for prefill at least, and it turns out exo can link the two: https://blog.exolabs.net/nvidia-dgx-spark/ .

I think this combo is going to be hard to beat for a compact, low power solution until the M5 Ultra comes out. One thing this stack has over an M5 Ultra (other than that it isn't out yet, and who knows when it will be!!) is it also lets you play around with CUDA only projects.

[-]

DistanceSolar1449@reddit

Meh, it's SM121

It doesn't support SM100 features that makes it worth using CUDA. Basically SM121 isn't real blackwell. You don't have CUTLASS. You don't get FlashAttention-4.

[-]

-dysangel-@reddit

Yeah I understand it's not the same thing as server Blackwell. There's a bunch of stuff I'd like to try which is CUDA only though, such as Isaac Gym. Plus many new models or smaller repos are CUDA only, so it will open some fun doors vs my current Mac-only setup.

I don't mind tinkering on kernels here and there when necessary too. If anything it's all just a good learning experience, which is one of the most important parts of this for me. If it were just about performance, API is cheaper and faster.

[-]

starkruzr@reddit

24 tps generation is honestly not bad at all for a model that size, spending $32K to run it. that's 3 RTX Blackwells and change, and that isn't anywhere close to enough VRAM.

[-]

flobernd@reddit

If I recall correctly I observed 6000 prefill and 150-200t/s for that model when I tested on 4x RTX Pro 6000.

[-]

Historical-Internal3@reddit

I own two sparks and love them and would defend them but this comment made me audibly laugh - hard.

[-]

Charming-Author4877@reddit

That thing looks like a lasting memory for a lifetime, something you can tell your grand kids :)

[-]

PassengerPigeon343@reddit

They processed the comment very quickly, but it’s taking them a while to finish typing their response

[-]

Such_Advantage_6949@reddit

Yea sounds like it

[-]

Freonr2@reddit

There are many knobs that beg to be tweaked, and I imagine it will take significant fiddling with them to get the best performance. It's an unusual setup compared to a normal DC cluster so it's not like there's some recipe waiting to be read and trivially implemented.

Probably need to give OP another few days or week to figure out how to squeeze the juice.

[-]

ResidentPositive4122@reddit

It run nowhere, that rack will make sure of it :)

[-]

Such_Advantage_6949@reddit

I meant glm 5.1, i am really wonder fast it run, sound like really solid setup to run big model

[-]

AppleBottmBeans@reddit

I think the guy above you was pretty clear that the rack OP got is a pretty solid structure. So not only will the rack not be able to run away, but either will the software (like glm 5.1).

[-]

validol322@reddit

What are your primary use cases and industry field where you operate?

[-]

kaliku@reddit

Look at Ops profile, he's a rich dude playing. Or at least - that's the vibe I get from his posts. Anyway OP don't take this comment to heart. If I had the money I'd prolly do the same. Hell... At my level, I, in fact did. I splurged on a rtx pro 6000 because I wanted to learn and not be restricted by hardware. And I could afford the 6000.

Some people like fast cars, others like gpus. OP likes both haha. Green with envy I am. Peace.

[-]

Kurcide@reddit (OP)

I don’t get offended by these comments. It’s all just fancy “playing”. I’m using the to test an agentic layer im building ontop of LLM harnesses and going to use them to support my engineering teams which is what the whole rack has always been used for

6000 pro is great, if I didn’t have the H100s I would have gotten some

[-]

pirateadventurespice@reddit

But do you get tired of the people blatantly unable to read your like four sentences of explanation?

The folks who seem obsessed with t/s and ignore that you've already said (repeatedly) these are meant to handle prefill long-term is just exhausting.

At my own scale, I've been enjoying two sparks with an m3 ultra. The concept is solid and works, you're just taking it to a whole new level.

[-]

kaliku@reddit

Thanks for sharing with us

[-]

Maleficent-Ad5999@reddit

Shhh.. we don’t discuss use cases here.. we just brag about our builds

[-]

MisticRain69@reddit

I have noticed pretty much anytime anyone asks someone how they afford such an ungodly amount of hardware its complete radio silence. Not even one peep of what the use case is. Why so secretive?

[-]

Maleficent-Ad5999@reddit

Either they all do something shady that they’re embarrassed to admit or they must have signed nda at their workplace not to reveal stuff.. or, some are just cruel

[-]

Polite_Jello_377@reddit

I think the more likely reason is they dumped a lot of money into something that they got interested in but don't actually have meaningful use-cases for it

[-]

Suitable-Economy-346@reddit

I always thought it was mostly employees at companies and they like the praise and don't want to admit they're broke asses like the rest of us.

[-]

xienze@reddit

There's a good chance it's just someone who made a lot of money on Bitcoin and likes tech for the sake of it. Sorta like the guys you see on r/homelab who have entire 42U racks full of gear but like six ethernet ports actually in use.

[-]

Dany0@reddit

The industry is orphan crushing machines and orphan crushing machine accessories unlesss specified otherwise

We must bully the gpu rich into submission. Post use case or face the wrath of leddit

[-]

Prof_ChaosGeography@reddit

Like you I used to want to know when I was building a system but I found out the hard way.... for many it ain't software development even if that's their day day job, people got weird kinks they use image gen and roleplay for

[-]

Shot-Buffalo-2603@reddit

I’m sure some people do this, but there has to be a better way than getting GLM to run on 16 sparks to goon

[-]

temperature_5@reddit

💸>🧠, but glad you're having fun.

[-]

Birdinhandandbush@reddit

so thats like 80-100k on the sparks alone, are we gonna call this a local home build :) :) :)

[-]

Kurcide@reddit (OP)

It’s locally in my home

[-]

Birdinhandandbush@reddit

No need for central heating, am I right

[-]

fyrn@reddit

I can see my thoughts in your rack ... "I like how this Sliger looks but you can barely see it because it's black, now that I need a second one maybe I should pick one of these colors?" :)

(Except I just went for black again, but stuck RGB fans behind it, the white looks awesome though.)

[-]

Party-Special-5177@reddit

Just popping in to show some love.

I completely adore the main thesis behind this build (iirc semi-solving the Mac prefill issues with a properly fat cluster of gb10s).

[-]

IrisColt@reddit

Honestly, just seeing that this can be done is an experience unto itself.

[-]

pixelpoet_nz@reddit

Yeah there's a lot of obvious stuff to be said about money etc, but you have to appreciate dedication to the game / vision. Well played, and I would love to have such a setup for ridiculous 3D rendering (using custom Vulkan code).

[-]

IrisColt@reddit

Mind-blowing! I'm easily wowed, but this is something else.

[-]

Ok-Measurement-1575@reddit

How are you planning to split PP / TG?

I didn't realise this was a supported option.

[-]

bick_nyers@reddit

SGLang can do it (probably not with Mac though?). It's called PD (prefill-decode) disaggregation.

It's great for when you want to drive latency (TTFT) down

[-]

Ok-Measurement-1575@reddit

That's very interesting, thanks. Never imagined this would be possible.

[-]

bick_nyers@reddit

At scale we use it to dynamically change the ratio of how many GPUs are used for prompt processing vs. decode. When the average context length of users prompts increases throughout the day -> shift some GPUs from decode to prompt processing. When length of prompts levels back out -> shift some GPUs back to decode to give them faster token speeds.

[-]

No_Afternoon_4260@reddit

Sglang allows you to change the number of GPU allocated for P and G dynamically? Do your u have any documentation by any chance?

[-]

bick_nyers@reddit

So you do it at the cluster/orchestration level, not necessarily SGLang.

You setup a prefill cluster and a decode cluster, each has workers attached to it (GPUs).

Then you monitor externally and programmatically shutdown a worker in one cluster and then spin up a worker in another cluster. SGLang can be setup to do service discovery to auto-detect that a worker was added to the cluster.

https://docs.sglang.io/docs/advanced_features/sgl_model_gateway#pd-mode-discovery

[-]

No_Afternoon_4260@reddit

Wow really interesting I need to look into that mechanic
By cluster do you mean network?

[-]

the320x200@reddit

Somewhat off topic, but how likely do you think it's that the major providers are shifting quant levels throughout the day to balance load?

[-]

bick_nyers@reddit

I think it's highly likely. I wish there was some kind of fingerprinting/guarantees as a user in that regard

[-]

Ok-Measurement-1575@reddit

Awesome.

[-]

KingMitsubishi@reddit

Yea, I am wondering about this too…

[-]

flobernd@reddit

I got your point about prefill, split gen and memory, but did you consider 8x RTX Pro 6000 Blackwell? Might have been the easier solution (single host) at a similar price point. Power usage is a bit on the higher side, but it runs Kimi26, GLM51-nvfp4 etc. with very good prefill and 100+t/s regardless of the PCIe bottleneck (that you also kinda have with the Sparks in form of the 200G NICs).

[-]

moorsh@reddit

Because 8x 96gb RTX Pro is only 768GB while his setup is over 2TB. You need over 1.5TB for some of the flagship open source models.

[-]

flobernd@reddit

Granted, there is less VRAM! My suggestion was based on the models he mentioned. K26 runs at full precision, GLM51 requires a 4-bit quant. DS4 Pro will be unusable on his cluster (performance is already very bad on 8xRTXPro6k) and this will most likely be the case for all MoEs with many active params.

Fun experiment anyways (if you have the money).

[-]

segmond@reddit

what's the performance for DSv4Pro on 8xpro6k?

[-]

xienze@reddit

Sure, but sometimes you gotta ask yourself if it's better to run a faster, smaller model than a larger model that's so slow it's practically unusable.

[-]

bnm777@reddit

Yeah, isn't the token per second rate far higher with 6000s Vs sparks?

[-]

flobernd@reddit

I‘d guess so! Let’s see what OP reports :-)

[-]

NickCanCode@reddit

https://www.youtube.com/watch?v=QJqKqxQR36Y
Someone already tried 8 DGX few months ago.

Qwen 3.5 397B-A17B = ~24 tps
Kimi-K2.5 = ~13 tps

VLLM probably have improved in the pass two months so number should be a little higher now. I think \~40 tps is ok for normal use case. For coding, RTX Pro with NV Link will be much faster and more enjoyable.

[-]

flobernd@reddit

Unfortunately the RTX Pro does not have NV link. But regardless of that, it gets 100-150t/s for K26 for example on a Turin server.

[-]

FaustAg@reddit

the new server blackwell 6000 version has 2x 200gbe network connections on each card. they take the place of nvlink. I have the regular OG blackwell 6000 and mine doesn't have those

[-]

flobernd@reddit

It’s not comparable. 2x200G rather compares to PCIe 5 x16 on the RTX Pro 6000 speed wise. NVLink is like 900 Gb/s - it’s in a completely different scale.

[-]

FaustAg@reddit

yes that's true and the lack of tcgen05 pisses me off so much, but it's better than just pcie. I only have one blackwell and I run out of vram trying to train the things I want so if I could I'd still trade in for two server versions.

[-]

NickCanCode@reddit

Oh I see. Didn't expect even pro cards don't have NV link these days.

[-]

flobernd@reddit

Yeah NVIDIA basically ran a scam with these cards. They are not „real“ Blackwell (sm100). They use sm120/sm121 which lacks several architectural features like tmem and also NVlink. The Spark uses the same architecture. NVIDIA did that on purpose for market segmentation I guess..

[-]

StardockEngineer@reddit

We know that two nodes can do 28 tok/s via Spark Arena https://spark-arena.com/leaderboard

[-]

philmarcracken@reddit

He mentioned this, if its the same guy. These are for prefill and some m4 mac or whatever is for token gen. I mindblank on apple trash so its whatever

[-]

holdthefridge@reddit

Try using DFlash to get throughput faster, and in future if you run out of 2TB ram, use turboquant. Let us know the tokens/s once you get DFlash working on all.

[-]

cusspvz@reddit

What’s your configuration and stack to run these in a cluster?

[-]

Seventh_monkey@reddit

Humor me, can you describe what exactly will you use it for (to support my engineering teams is as broad a description as it gets) so that this is an investment that will pay off?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Pleasant-Shallot-707@reddit

If this is a hobby…zoiks!

[-]

bick_nyers@reddit

If you can batch your workflow even a little bit I would be curious if expert parallel gives you better numbers

[-]

DiscussionAncient626@reddit

Address?

[-]

DiscussionAncient626@reddit

Sorry the worst joke. amazing setup

[-]

Chris279m@reddit

I have one spark 🫪😔

[-]

IndividualGold4667@reddit

How much did this cost?

[-]

Kurcide@reddit (OP)

Sparks all in around $70k, $13k for the switch, $2k for the cables.

If you mean the whole rack… a lot more

[-]

Eyelbee@reddit

Couldn't you just build a 8xb200 node at this point?

[-]

a_slay_nub@reddit

Dunno about b200 but we got quoted 400k for 8xH200.

[-]

Kurcide@reddit (OP)

yup, and it’s around $300k for H200 refurb. I considered it but it was too big of a jump

[-]

starkruzr@reddit

not even close. not even close to close.

[-]

debackerl@reddit

I'll buy the cables

[-]

IndividualGold4667@reddit

Sweet set-up! Congratulations !

[-]

urazyjazzy@reddit

Everybody here talking about prefill speed being the main purpose of this system. Didn't M5Max solve this as the experiments showed which means M5 Ultra will be even 2x faster . So buying 4 M5 Ultras will be a much better deal than this in the future. Token generation speed of 1 x M5Max is 3x faster than a single sparks so if you make a calculation M5Ulra will be 6 x faster. So 4 of M5Ultra will be worth to 24 x Sparks on top of being faster it would have 2TB memory, Faster ppspeed . You would be able to run the largest models in 45 to 60 t/s . All hypothetically of course . I believe new Mac Studios if they come with 1Tb ram it would be even more appealing for Home users or even small labs 🙏🏼

[-]

somerussianbear@reddit

I can smell something burning already

[-]

Turbulent-Walk-8973@reddit

how about cooling? I had a single DGX Spark, and I was having some issues with it.

[-]

Kurcide@reddit (OP)

I have some 3u fans tha i’m going to try and mount infront of them to force air through

[-]

ilarson007@reddit

What kind of private jet do you own?

[-]

Kurcide@reddit (OP)

can’t afford one, spent all my money on compute

[-]

InnocenceIsBliss@reddit

So, where are those guys who called this out as just a store clerk posting for clout..😆

[-]

Kurcide@reddit (OP)

quite I guess

[-]

neuralnomad@reddit

Are there even any games for this? 🤭

[-]

New_Zone5490@reddit

i wish i had a powerful ai girlfriend like this too

[-]

Osi32@reddit

At least he can produce just dance Vance videos faster than the rest of us….

[-]

Themotionalman@reddit

My gosh, this is the life bro. How many kidneys did you have to sell?

[-]

-dysangel-@reddit

harvest*

[-]

PhonicUK@reddit

I thought I was baller with my 4x sparks but this is something else. You'd have probably been better off getting a DGX Station!

[-]

mayong13@reddit

แค่เห็นภาพนี้ก็น้ำเดินแล้ว localllmในฝัน

[-]

Klarts@reddit

Dude that’s so sick! Hope you’re having a blast and it’s living up to your expectations!

[-]

Effective_Motor_4398@reddit

Wow. Ledgend

[-]

RoomyRoots@reddit

I wish I could be this trifle with money.

[-]

Annual_Award1260@reddit

I’m working on setting up a 3 node cluster and due to the ddos attacks on ubuntu I have quite a few broken packages now.

[-]

-dysangel-@reddit

Kudos on the 16x setup, that is nuts! Thanks for making me/us aware the DGX/Mac split was possible with your last post.

I'm not balling out like you, but I've got a single Spark arriving today to boost prefill for my M3 Ultra. Should accelerate my prefill to M5 Ultra speeds - and buying 2 Sparks might even be cheaper than a 256GB M5 Ultra, but with the benefit that you can also play around with the CUDA stack.

[-]