Local 405B Model on 3 DGX Spark units.

Posted by elephantgif@reddit | LocalLLaMA | View on Reddit | 27 comments

I've pre ordered 3 Spark units which will be connected via infiniband at 200 GB/s. While not cheap, all other options that are comperable seem to be much more expensive. AMD's max+ is cheaper, but also less capable, particularly with interconnect. Mac's equivalent has much better memory bandwidth, but that's about it. Tenstorrent's Blackhole is tempting, but lack of literature is too much of a risk for me. I just wanted to check to see if I was missing a better option.

[-]

eloquentemu@reddit

I'm assuming that your 200 GB/s is Gb/s? I'm curious where you've seen support for that... Yeah, it has a ConnectX-7 and that should, I've only seen the connectivity advertised as 10GbE though PHY says 100GbE but ConnectX-7 should support 400GbE so I'm not sure what to believe (why is this not clearly stated!?).

Anyways, to sanity check it looks like $12k to run the model at q4 at ~3.5 t/s if tensor parallelism works perfectly? (Is Llama 405B still worth that vs Deepseek or something?)

As I understand it, you need 2 or 4 GPUs to run tensor parallelism but I haven't used it extensively. So if you want to actually get your 3.5t/s (which would be 4.6t/s). You could always always run in layer/pipeline parallel mode but that wouldn't multiply your memory bandwidth / inference speed. You'd be able to batch multiple inferences but would get a peak of ~1.2 t/s.
What's your concern with the Mac M3 Ultra? It's not a powerhouse (e.g. if you wanted to run diffusion or something) but has comparable memory bandwidth and is cheaper.
The AMD AI Max is probably out as the connectivity isn't great. It only has PCIe 4 (x4 in the the implementations I've seen) so you're limited to like 40GbE if you can make that work.
If you're only planning on running layer/pipeline parallel, you could match the 273GBps you could meet that spec with an 8 channel DDR5-4800 server system $12k buys one hell of a server and even a 5090 or two.

Of course, they have good compute and connectivity but if you want to run 405B you're basically going to need to plan around memory bandwidth.

[-]

elephantgif@reddit (OP)

The main script will be running on an EPYC Gennoa which will connect to the Sparks via a Melenox Infiniband port. What appeals to me about this setup is the cost/performance and molecularity.If I want to integrate more Sparks to add a DM or LRM down the road, it would be simple. I will check out the server system option you mentioned, though.

[-]

eloquentemu@reddit

If you have an Epyc Genoa then just run 405B on that. They may be better on PP but on TG the Genoa will run at basically the same speed unless you get tensor parallelism working which would require 4 units. (They may run a little faster or a lot slower depending on how close to theoretical bandwidth they can use and how much the infiniband hurts).

Don't get me wrong, I can see it being an interesting project, just don't expect usable performance out of it on 405B at least. Honestly, I would suggest starting with 2 units and running tensor parallel on 405B on a higher quant. That would let you get your setup working and save you $4k if you hate it :).

FYI, here's the performance of my Genoa machine (12ch @ 5200) with a 4090 to assist for the CUDA results:

model	size	params	backend	ngl	test	t/s
llama 405B Q4_K - Medium	226.37 GiB	405.85 B	CUDA	0	pp512	35.29 ± 4.87
llama 405B Q4_K - Medium	226.37 GiB	405.85 B	CUDA	0	tg128	1.05 ± 0.29
llama 405B Q4_K - Medium	226.37 GiB	405.85 B	CPU	-	pp512	4.15 ± 0.00
llama 405B Q4_K - Medium	226.37 GiB	405.85 B	CPU	-	tg128	1.06 ± 0.00

So not great, but on paper a DGX Spark maxes out it's tg128 at ~1.2

[-]

elephantgif@reddit (OP)

I'll probably try two at first. I hope that they magnify each other. I'm getting conflicting opinions about that. If it truly doesn't work the way I think it will, I'll bite the bullet on a 6000 and use the two sparks as complimentary DMs. More than I want to spend, but I think things might get really interesting there. I'd have to research a lighter model, but I'd hate to put this much into something that gives me 1 t/s.

[-]

eloquentemu@reddit

I hope that they magnify each other. I'm getting conflicting opinions about that.

Yeah, I mean it's unreleased and most people attempting distributed inference fail from what I've seen. However, they're using using Macs or something. The Spark does seems like it has an intended use as a distributed inference machine to at least some extent. (Though like, there are no public docs so AFAICT it only supports 10GbE.) So there's a chance but it's definitely unknown territory.

I'd hate to put this much into something that gives me 1 t/s.

Understandable, but the best case is you get 1.2t/s per Spark. I'm not sure why you would spend $8k to get <2.4 t/s when you could spend $0 and get 1 t/s on your current CPU server. Maybe if you get 4 for $16k and get <5 t/s that's... okay?

I'm not sure if you have any special interest in Llama 405B, but that model is really hard/expensive to run. You could get 10x the TG with a large MoE like Deepseek 671B on your server and it would be more forgiving for the Sparks too (though it wouldn't fit with q4 on 3 Sparks...)

[-]

YRUTROLLINGURSELF@reddit

3x DGX = 384@200 for $9k m3 ultra = 512@800 for $9.5k

are my numbers wrong or else why would you choose this

[-]

elephantgif@reddit (OP)

M3 hast the edge in bandwidth, but the Spark processors have way more raw compute power. Plus expandability.If I wanted to introduce models to run concurrently down the road, infiniband is far superior to Thunderbolt for integration.

[-]

Baldur-Norddahl@reddit

Bandwidth is a hard cap on how fast you can run a model. You will find the DGX Spark to be unusable for this project. DGX Spark has a memory bandwidth of 273 GB/s. Which means 273/405 = 0.67 t/s and that is before you factor in the interconnect. You won't be able to run the Sparks in parallel. It will run some of the layers on the first Spark, then you have to wait for transfer to the next one and then it will run more layers there etc.

[-]

elephantgif@reddit (OP)

The scripts aren't going to be running on the Spark units, they will run on a Gennoa CPU. The LLM will be on the Spark units. I dont see why they couldnt run in parallel, orchestrated by the CPU.

[-]

Baldur-Norddahl@reddit

I am not sure I understand your use case then. A single DGX Spark has a maximum of 128 GB of memory. That is not enough to run a 405b model. You therefore need to split the model among the Sparks just to load it. I am just saying that it won't work like you think it will.

There is something called tensor parallelism. It is hard to get running, very bleeding edge and requires very high bandwidth between the units. But that is what you need to do this. It would be a very interesting experiment for sure and we would all love to hear about it :-). Just don't expect it to be easy or even successful. Also it is very likely limited to just two devices.

What will happen instead is that you will get serial execution. Three Sparks will be no faster than the single Spark and in fact quite slower, because it needs to wait for data transfer. Only one Spark will be actually doing anything at a given time.

[-]

YRUTROLLINGURSELF@reddit

I feel like the timeframe would have to be very small, like 1-2 years tops, with how fast things are changing

[-]

auradragon1@reddit

M3 hast the edge in bandwidth, but the Spark processors have way more raw compute power. Plus expandability.If I wanted to introduce models to run concurrently down the road, infiniband is far superior to Thunderbolt for integration.

I have some questions:

Have you tested something like M3 Ultra vs 4x DGX for DS R1?
What models are you running where 3x DGX is faster than M3 Ultra?
If you expand to 4 or 5 DGX, what models can you actually run with a max of 200GB/s bandwidth?

[-]

Conscious_Cut_6144@reddit

I have no idea what 3 dgx sparks will do irl, But 405b will bring a m3 ultra to its knees. Much harder to run than deepseek.

(Theoretical max speed for 405b mlx 4bit with 800GB/s is under 3.5 T/s, that is with 0 context length)

[-]

auradragon1@reddit

Yes but theoretical for 3x DGX is less than 1 t/s.

[-]

Conscious_Cut_6144@reddit

Theoretical is 1.2T/s for PP or 3.6 for TP. (I’ve never seen TP working across 3 devices, but it’s theoretically possible)

[-]

Rich_Repeat_22@reddit

Stop only looking bandwidth, M3 Ultra is extremely slow chip to do this job.

[-]

FireWoIf@reddit

Seems silly when the RTX PRO 6000 exists

[-]

elephantgif@reddit (OP)

It would take three times as many, and each 6000 is twice as much. Plus, I'd have to house them.

[-]

Conscious_Cut_6144@reddit

Pro 6000’s are 96GB each, you just need 4 of them total. Grab the max-q’s and you could do in a regular atx desktop / server.

4 pro 6000’s is going to be an order of magnitude faster and better supported.

If you actually plan on running 405b for some reason, realize that model is not moe and will run quite slow on the m3 ultra and the dgx spark.

That said there are like 5 400b class moe models that would run fine on a Mac or dgx spark.

[-]

bick_nyers@reddit

It's the difference between $36k and $9k.

[-]

Conscious_Cut_6144@reddit

Not really, DGX spark is just a bad hardware choice for 405B.
You could get better 405B speeds from a DDR5 server + single GPU (that costs 1/2 as much as that 9k)

[-]

elephantgif@reddit (OP)

Ive got an EPYC 9354P that will handle orchestrating, and was planning to connect three Spark units to that via infiniband. That should run the 405B well. And at about a third of the cost of 4 Pro 6000's. If it's slow, I'll look at other models.

[-]

Conscious_Cut_6144@reddit

Ya it won’t run well.

DGX spark is 273GB/s Memory bandwidth. Even ignoring overhead and compute and assume perfect parallelism, 3 of those is only 819GB/s / 230GB/t = 3.5 T/s

And that is wildly optimistic for something split across 3 devices over network. I would guess 1T/s is realistic.

Now if your use case allows for multiple simultaneous chats, you could have 10+ each doing ~1T/s, but that doesn’t sound particularly useful.

[-]

colin_colout@reddit

but I'm here for it. I preordered the framework desktop even though m4 exists.

We could all just use openai or anthropic for cheaper AND better quality models than trying to run it ourselves.

[-]