Collected the infinity stones

What LLM do you actually run on that consistently? I'm just curious what you are getting out of all that power? I'm rather happy with the models I can run on my 4090 but I don't know what I would even do with all that memory.

[-]

Electrical-Ad-9808@reddit

Haha, good enough to run the latter half of B2B SAAS turned agents.

[-]

CollegeStock8249@reddit

Awesomeness

[-]

Strict-Opinion2895@reddit

This is the way.

[-]

YOMUMSOBIG@reddit

Im here to say that im just jealous.

[-]

mantafloppy@reddit

So, you have money, a camera and an idea of a concept?

Ppl are posting and upvoting the dumbest shit....

[-]

icnocop@reddit

He also has a RØDE microphone!

[-]

human_bean_@reddit

Not RTX Pro 6000?

[-]

Street-Buyer-2428@reddit (OP)

No. Only 5000 72gb

[-]

ReasonableDust8268@reddit

Still faster than those macs, idk why macbros don't understand the power of cuda

[-]

Vas1le@reddit

"Only"

[-]

tcx00@reddit

Damn with blackwells, it must be nice all that money for gadgets

[-]

Street-Buyer-2428@reddit (OP)

yeah

[-]

PattF@reddit

And I’m over here trying my hardest to figure out to run 27B on my mac’s 16GB of usable. It’s fiiiiine. 😂😂😂😢

[-]

GonzoDCarne@reddit

The easiest way is to get into installments for a 32Gb Mac to play with mxfp4 or 64Gb to use mxfp8 with decent context and a couple of tabs in Chrome.

[-]

Arkenstonish@reddit

You actually can fit iq4_xs with q8 cache and still have ~30k ctx (at 1 slot and no vision ofc)

For info: it's nothing. First prompt from qwen code CLI takes like 56% of usage 💀

source: running qwen3.6 27b on 5070 ti, 1600/50

[-]

FatheredPuma81@reddit

Isn't that just dropping the image gguf and running 3 bit with Q5_1 KV Cache?

[-]

PattF@reddit

Image.. fine. 3bit and Q5? 😬

[-]

FatheredPuma81@reddit

Well your only other option is not running it at all sooo... KV Cache Quantization is optional I guess but you'll be stuck at like 8192 context or moving closer to 2 bit.

[-]

allenasm@reddit

which tools are you using? I'm using 'inferencer' which is a fairly new mac app to do multi mac inference (i have 2 512gb studios now). i know vllm works too but its a lot pickier to set up.

[-]

Street-Buyer-2428@reddit (OP)

Awesome! I’m actually looking for new people to test out a new tool I created called r1o.ai . Send me a PM if interested! I’m using it myself and i find it to be really helpful

[-]

stormy1one@reddit

What are you planning on running with this?

[-]

Street-Buyer-2428@reddit (OP)

All the deepseek quants, Kimi 2.6, Glm 5.1 and imma try to use turboquant, dflash etc.

[-]

Alternative_News_732@reddit

to do what? if its not personel sir?

[-]

Street-Buyer-2428@reddit (OP)

Scour the internet and find more studios

[-]

East-Tea6193@reddit

Do not follow the bright lights.

[-]

AdeptnessRound9618@reddit

Lmao

[-]

koushd@reddit

who is we

[-]

ShutUpAndDoTheLift@reddit

You telling me you don't say "we" after 8 hours of orchestrating agents and answering sub agent decision decision choices?

[-]

FatheredPuma81@reddit

That's hour 1. By hour 8 I've long since transitioned to "you idiots" 😄

[-]

Street-Buyer-2428@reddit (OP)

Especially if you use voice

[-]

East-Tea6193@reddit

When Whspr tells you it now has a voice print for you, and in reality, it is hours of you talking to and swearing at another AI agent, which it then uses to personalize your tone to everyone...

[-]

RoboErectus@reddit

I feel personally attacked

[-]

Street-Buyer-2428@reddit (OP)

Me and the voices in my head

[-]

Thatisverytrue54321@reddit

Voices in your computers *

[-]

Jords13xx@reddit

Yeah, those voices are probably just trying to make sense of all those cores and RAM. It's a wild setup you got there!

[-]

throwawayacc201711@reddit

A ghost in the shell if you will

[-]

peterox@reddit

Do they hum silently 🤔

[-]

Street-Buyer-2428@reddit (OP)

You’re absolutely right!

[-]

No_Mango7658@reddit

Soon

[-]

Girafferage@reddit

bro if you are already running multiple models in your head why do you need these!?

[-]

LilPsychoPanda@reddit

Ghost in the machine

[-]

FatheredPuma81@reddit

Can I be one of the voices? "Give it to this guy you want to give it all to this guy for free"

[-]

Equivalent-Repair488@reddit

I hear voices in my head They counsel me They understand They talk to me

[-]

Vicar_of_Wibbly@reddit

How does one configure an inference stack to do prefill on GPU and decode on CPU?

[-]

Street-Buyer-2428@reddit (OP)

I’m trying to do prefill on the blackwell and decode on the studios bandwidth

[-]

Vicar_of_Wibbly@reddit

I know. My question was “how”? I’m familiar with vLLM but as far as I know it’s not an option. How are you doing this?

[-]

Street-Buyer-2428@reddit (OP)

Sorry what i meant ti say was that I’m trying to use Apple’s new standalone JACCL librsry to make it happen

[-]

JockY@reddit

Ok, but how? It’s easy to say “use vLLM-mlx” but when it’s not a supported feature how are you going to do this?

I would love to know how to reproduce it.

[-]

Street-Buyer-2428@reddit (OP)

Pm your github. I’ll gladly send over the progress I have thus far in my implementation. Are you trying to collaborate?

[-]

JockY@reddit

Thanks, but while I have pile of rtx 6000 pros and some Macs, I don’t have the offboard egpu gear. If y’all get something PoC working I might invest - I’m pretty old school, worked on everything from OS/2 to Linux kernel dev, might be that I’m handy for this. Gonna want to see some progress first though 😎

[-]

East-Tea6193@reddit

Os/2 that is a blast from the past.

Interviewed a SA guy who had got his PhD in Pascal something or other back in 1990 - still has work around the world on industrial systems running floppy disks that need fixing.

[-]

DinoAmino@reddit

You can do this on vllm with LMCache.

https://docs.lmcache.ai/

https://docs.vllm.ai/en/stable/examples/others/lmcache/

[-]

Vicar_of_Wibbly@reddit

I’m still confused. Can you show us how to copy this configuration? Having separate prefill and decode hardware would be amazing.

[-]

C0smo777@reddit

I dont think its possible is the answer, the only ways that I know you need the full model in both places

[-]

Street-Buyer-2428@reddit (OP)

I have a Group.split() on jaccl on my mlx fork

[-]

Street-Buyer-2428@reddit (OP)

Pm your gh

[-]

DinoAmino@reddit

You can do this on vllm with LMCache.

https://docs.lmcache.ai/

https://docs.vllm.ai/en/stable/examples/others/lmcache/

[-]

Vicar_of_Wibbly@reddit

My understanding is that lmcache is used for extending the capacity of LV cache by offloading the computed values, not for re-assigning on which piece of hardware the computation takes places.

[-]

DinoAmino@reddit

Offloading is one thing it can do. Another is the LMCache server runs on one LLM instance and can share the kv cache with any other LLM instances. Check out the links.

[-]

Street-Buyer-2428@reddit (OP)

There are a couple of implementations. Look up vllm-mlx

[-]

AttitudeImportant585@reddit

have you calculated the kv cache transfer speeds you need for your model? 40gpbs is pretty slow for anything useful, unless mac has some other way than thunderbolt to connect pcie?

[-]

Street-Buyer-2428@reddit (OP)

tb5 = 120

[-]

Possible-Pirate9097@reddit

That's only the boosted mode which wouldn't work here.

[-]

Front_Eagle739@reddit

Alright 80gbps then. Or you coukd connect multiple ports and stripe em

[-]

evil0sheep@reddit

You should honestly post a detailed plan to get feedback from the community. I think you might be seriously underestimating the complexity of making this work. Are you planning on duplicating the model params and kv cache across both the Blackwell VRAM and the Mac studios? If so what’s the point of using the Mac studios at all? If not, how are you gonna do prefill on the Blackwell GPUs without the model params and the KV cache? Also how are you gonna get the Nvidia cards to do RDMA over thunderbolt? Do they even have driver support for that?

[-]

Possible-Pirate9097@reddit

Alex Ziskind seems to have whipped Claude into providing him a working solution. He talks through it in his latest video. Would be better with RTX 6000 Pros obviously.

[-]

Street-Buyer-2428@reddit (OP)

Please look at my other posts.

[-]

Adventurous_Pin6281@reddit

its a question you only answer with 20k in fuck you money

[-]

dlarsen5@reddit

also am interested in how

[-]

Street-Buyer-2428@reddit (OP)

I’m trying to use tinygrad driver and JACCL standalone librsry that recently came out to see if i can pipe that in. I’m using Ghidra to see if i can find where the hell apple hides the api they got for distributed

[-]

scottjgo@reddit

this isn't exactly the same, but i recently implemented PCI passthrough on QEMU on macOS, so it's possible to "pass through" an nvidia GPU to a a linux vm running on top of macOS and do AI inference that way. i wrote a blog about it here: https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/

there's instructions how to set it up in my qemu fork: https://github.com/scottjg/qemu-vfio-a…

i wonder if you could install exo in the vm and cluster it somehow that way? i've never attempted a configuration like that.

[-]

Vicar_of_Wibbly@reddit

Oh this is such a good idea. Holy shit. Kudos for getting it to work.

[-]

Street-Buyer-2428@reddit (OP)

This is the type of response I need. Pm

[-]

habachilles@reddit

I’m so curious to see if transferring, that’s sort of data. Kills the benefits of doing this or not. I am really looking forward to your updates.

[-]

Vas1le@reddit

Just ask claude to search for it

[-]

Street-Buyer-2428@reddit (OP)

He found it

[-]

dbenc@reddit

claude take the wheel

[-]

AttitudeImportant585@reddit

disaggregated prefill. old concept but not widely supported. vllm and sglang currently have limited support

[-]

Mundane_Discount_164@reddit

It's called vibe inference.

[-]

Street-Buyer-2428@reddit (OP)

huh?

[-]

LordHenry8@reddit

So now that you have this what on earth are you going to do with it?

[-]

_Kinging@reddit

What do you use this for?

[-]

Torodaddy@reddit

Nerd penis measuring contest

[-]

wayfaast@reddit

And what are you actually doing with it?

[-]

anitricks@reddit

This… I mean like what’s the end goal ? Half of these posts on this sub just are buying Mac studios figuring out the configuration and then it’s just slop or porn generation

[-]

manituana@reddit

Wait, are there other use cases?

[-]

AdeptnessRound9618@reddit

If they don’t clearly state it in the post, the answer is always porn

[-]

mlucasl@reddit

With the price of all of that, you could be building an AI Server, instead of relaying on slowish pipelines.

[-]

Savantskie1@reddit

He’s not relying on the studio’s for prefill. He’s using a Blackwell card for prefill. TG on these are really good

[-]

mlucasl@reddit

Still, the pipeline between any of those will he shower than in any purpose build server. Spliting a model into those machine would make the communication between then one of the bottlenecks.

[-]

Savantskie1@reddit

Over thunderbolt that’s negligible which is probably what he’s doing and how these machines network the best.

[-]

Othvin@reddit

Change the power LED indicators to each be a different powerstone color!

[-]

DizzyExpedience@reddit

All that money without any specifc task at hand. Thats a lot of money just for fun

[-]

ctanna5@reddit

What can you run locally with this? Like how big do you think

[-]

Muscleandgains@reddit

What kind of things can You do with this This is something I might wanna do in future. Get a cluster to create a powerful machine

[-]

Intelligent_Ice_113@reddit

[-]

bennyb0y@reddit

Op will break even in 2039

[-]

Torodaddy@reddit

Before or after christ returns to the earth?

[-]

TronAres25@reddit

Never did

[-]

stefano_dev@reddit

You forgot a zero

[-]

DreadStallion@reddit

02039

[-]

kobraca@reddit

You missed perfect opportunity to add "You are absolutely right! Here is the correct number:" before that

[-]

Inaeipathy@reddit

It's not just a correction - it's the perfect opportunity

[-]

Amazing_Brother_3529@reddit

and Honesty.. there is nothing wrong with it.

[-]

Murky-Bullfrog8273@reddit

😁 happy??

[-]

Armstrongtomars@reddit

[-]

CoolstaConnor@reddit

What cost is considered fair to purchase these?

[-]

Street-Buyer-2428@reddit (OP)

$90k

[-]

CoolstaConnor@reddit

Individually? Sorry I'm not very well versed in Mac Studios.

[-]

ezyz@reddit

How much of a speedup do you get with tensor parallelism with larger models like K2.6 or GLM 5.1?

On a single M3 Ultra, I've been able to optimize to ~220 prefill / 20 decode, and but most of the public benchmarks for Exo I found aren't that much higher. So I've always assumed the main benefit is running at higher precision or distributing workloads across instances.

And for split prefill, does the Blackwell's VRAM limit the size of model you can run?

[-]

NinjaWK@reddit

How much did you spend? What model, what setting and how many tokens per sec?

[-]

gordo_Tibio@reddit

I won’t pay 1200 a year for AI when I can run it free locally!

Expend 15k in 4 Mac’s studio

[-]

domus_seniorum@reddit

gerade heute ist mir auch so eine Kalkulation durch den Kopf gegangen 😎😃

[-]

Street-Buyer-2428@reddit (OP)

Jf only you knew how much i was soending in the cloud

[-]

techdevjp@reddit

There was a post about this on here a few months back:

https://www.reddit.com/r/LocalLLaMA/comments/1o7k6e5/nvidia_dgx_spark_apple_mac_studio_4x_faster_llm/

There's also a YouTuber who posted about doing this. I'm not sure if he did it or just spoke about it. I'll see if I can find the video.

[-]

Street-Buyer-2428@reddit (OP)

Thats not on prod

[-]

techdevjp@reddit

Found the YouTube video:

https://www.youtube.com/watch?v=D2oZHzC_M28

[-]

ElementNumber6@reddit

This is a maximum Vram build. Not a maximum speed build.

[-]

techdevjp@reddit

This is a maximum Vram build. Not a maximum speed build.

OP says in a comment, "I’m trying to do prefill on the blackwell and decode on the studios bandwidth". That's what you do to try and maximize performance. Except it doesn't work because the time saved with the faster prefill is lost due to even 50GbE (and probably 100GbE) being too slow.

[-]

ElementNumber6@reddit

That may be what he's trying to do, but that isn't the primary point of a 2TB Mac Studio build.

Also, is the model usage not restricted by the vram on the blackwell card?

[-]

techdevjp@reddit

He wants to run large models on the Mac Studios (and get the benefit of the Studio's combined memory capacity & high memory bandwidth) without the downside of the slow prefill in the current Mac Studios.

Other people have had this idea, too. The Macs run the show with this, but will send your context to the blackwell card for prefill/prompt processing. Then the blackwell will send the processed result back to the Macs, and the Macs will spit out inference tokens using their higher memory bandwidth.

It's a great idea in concept, but as shown in the video I linked above, even with 50Gbps networking between the Macs and the blackwell, the overhead of those transfers eats up the time saved by having the much more powerful blackwell do the prompt processing.

And to answer your other question, no, the model isn't restricted in size by the vram on the blackwell card.

I really suggest you watch the video I linked above. He talks in quite a bit of detail about this, including showing performance comparisons and explaining why it's a good idea in concept (and does work!) but it's let down by "only" having 50Gbps between the machines. Even at 100Gbps it is unlikely to be significantly faster than just using the Macs alone.

[-]

ElementNumber6@reddit

I had watched that video, actually, before you had shared it.

And if I recall correctly, the same model (different quantization level, however) did need to be loaded on the GPU for this to work.

[-]

techdevjp@reddit

Yes, in that way there is a limit to what is possible. The blackwells can run much smaller quant versions of the same model, but the bigger the drop in quantization, the more likely it is that problems will appear in the final output from the LLM.

So /u/Street-Buyer-2428 could run FP16 on the Macs (I guess?) and Q4 on the blackwells, but it will probably impact the quality of the output. The better the quant on the blackwells the lower the chance for problems. Q6 would be ideal. Q5 probably okay. Q4 will probably show some degradation. Smaller than Q4 and things will fall apart fast.

It's also a matter of finding models that have appropriately sized quants for both machines. The 'tuber had that problem with some models.

So it's a fun engineering exercise but not really a serious way to run LLMs today. Hopefully the M5 Ultra will have much better prefill capabilities to go with the much higher bandwidth expected from its memory.

[-]

Street-Buyer-2428@reddit (OP)

The video is great but it only touches the surface.

[-]

techdevjp@reddit

It's enough to know roughly how he did it, how fast the networking was, and that it still wasn't fast enough to make it worthwhile.

By all means, do a better job of it and release more details. It's how everyone learns. But it's going to be an engineering exercise rather than a path to greater performance. If Macs could support 400GbE then it might work out, but Thunderbolt 5 tops out at 120Gbps so that idea is basically DOA.

[-]

Street-Buyer-2428@reddit (OP)

Theres definitely paths to better performance.

[-]

techdevjp@reddit

There are many paths to better performance, but I don't think this is one of them. Not yet.

[-]

Street-Buyer-2428@reddit (OP)

Will update 🔜

[-]

techdevjp@reddit

I'm really not sure what you mean by this. Obviously I understand what "prod" means, but anything like this is going to be very experimental and probably somewhat unstable.

[-]

fpodunedin@reddit

What are these devices?? Something apple im guessing

[-]

Street-Buyer-2428@reddit (OP)

Satechi dock

[-]

arananet@reddit

Nice cluster 😊

[-]

jkstaples@reddit

Just watch the Alex Ziskind video about this, he does the exact same thing

[-]

globetrot31@reddit

Here is the reason apple stated ,that there is a chip shortage. Hope that setup can compete with claude code.

[-]

Formally-Fresh@reddit

Cheers man im about 6 months away from getting there myself!

Hopefully I can stand on your shoulders!

[-]

Street-Buyer-2428@reddit (OP)

Try r1o.ai !

[-]

ibishitl@reddit

If I spend the same amount in just Deepseek api, how much would it be? And how long until I use it all? hahaha

[-]

Own_Dimension_4513@reddit

At this point just get a Mac Studio lol — but respect for the commitment.

[-]

Street-Buyer-2428@reddit (OP)

Hello Everybody! I just launched the App I use for all the observability, launching and rdma management on local models. r1o.ai is the website, Take a look!!

[-]

Street-Buyer-2428@reddit (OP)

r1o.ai

[-]

pinkwar@reddit

This is 4 years of Claude max.

[-]

AshuraBaron@reddit

Look son, it’s $20k dollars on that persons desk.

[-]

DayPrevious7239@reddit

My alll time earning

[-]

Torodaddy@reddit

Asking for a hardware failure from overheating by placing them like that

[-]

PeachOk54@reddit

How much did it cost?

[-]

SeaweedBrain_0711@reddit

that's sick bro. I'm jealous

[-]

oceanbreakersftw@reddit

The guy who does Mac LLM tests on yt did an EXO cluster with Mac and DGX iirc

[-]

s9p5t@reddit

how is it 2.3 TB of Ram bro? Mac studio max unified RAM is 96GB. 96x4 << 2300. What else have you added?

[-]

Trick-Assignment-828@reddit

ahi va un riñon

[-]

sathi006@reddit

Install HART OS and give a taste of your compute for the Hive OS

[-]

kentrich@reddit

So, are you stacking them to make a griddle?

We have two and stacking seems like a really bad heat management structure.

[-]

Street-Buyer-2428@reddit (OP)

I 3d printed a couple of brackets that I scred on to the drywall, but having all those thunderbolt cables hanging on a wall like that was pretty ridiculous. Maybe it works with a shorter cable? idk

[-]

ComplexType568@reddit

nice dog

[-]

boutell@reddit

DLM (Dog Language Model)

half-bit quant

[-]

Street-Buyer-2428@reddit (OP)

thanks

[-]

sooki10@reddit

I hope you didn't use PLA!

[-]

Street-Buyer-2428@reddit (OP)

PET-G

[-]

kaafivikrant@reddit

Post benchmarks dude

[-]

Juulk9087@reddit

Slow but can load big models. There is your benchmark. Lol

[-]

No_Algae1753@reddit

What makes it slow ? Is it due to the thunderbolt connection to each Mac ?

[-]

Everyone_Is_MC@reddit

Memory bandwidth
Raw gpu compute power

[-]

Zolty@reddit

Accurate.

[-]

Toastti@reddit

If they end up being able to use the baclwell GPU for the prefill portion it should actually be quite snappy for large contexts

[-]

IliasHad@reddit

How much power does this pull running?

[-]

-dysangel-@reddit

I think this would be the first heterogeneous cluster.

Actually I implemented disaggregated prefill on my Spark/Mac in the last few days (not kidding) ;p but it's only 1 M3 Ultra and 1 spark.

You don't need RDMA or TinyGPU to just send your prefilled KV cache over the network btw. You just need enough bandwidth to get the job done quickly - latency and drivers etc don't matter as much. You just need to make sure the KV cache is compatible, such as using llama.cpp or mlx on both ends (Spark can do mlx apparently, I've just been using llama.cpp though)

[-]

freddycheeba@reddit

Please tell me you're going to connect them all together with thunderbolt and enable DMA,

[-]

Street-Buyer-2428@reddit (OP)

Thats exactly what I’m going to do

[-]

dbzunicorn@reddit

all for 25 tokens per second and 2 mins pp!!

[-]

Street-Buyer-2428@reddit (OP)

But… concurrency 😭

[-]

HeadtripVee@reddit

Was that a double post referencing concurrency on purpose? If it was i flipping love you.

[-]

killerjurist@reddit

~~Inception~~ Concurrency

[-]

Street-Buyer-2428@reddit (OP)

you get it. 🤣

[-]

Street-Buyer-2428@reddit (OP)

But… concurrency 😭

[-]

Important_Coach9717@reddit

All this to generate anime porn …

[-]

toptier4093@reddit

What the fuck. That's lit

[-]

MaximKiselev@reddit

hello, is it mac mini ? does he have direct connector like SLI? is it better dgx or not by power per watt ?

[-]

Dismal-Particular545@reddit

OP would it be possible to connect a macbook pro and a macbook studio for the same combined unified memory effect?

[-]

power97992@reddit

U must have a good job? Will u upgrade to m5 ultra?

[-]

codehamr@reddit

That split makes sense from my own runs. I went from M3 Ultra 512GB to RTX 6000 Pro 96GB. Prefill on long context was night and day, roughly 5x faster. Decode on the Mac mesh is fine. Prefill is where Apple silicon falls behind.

[-]

saltyourhash@reddit

Alex Ziskind got you hyped?

[-]

Kinky_No_Bit@reddit

https://www.youtube.com/shorts/EiAOY-lIzTk

Here's the song I picture OP singing.

[-]

nojukuramu@reddit

If you ever got tired of it, send it to me

[-]

David_Fetta@reddit

And still a 20 dollar subscription will outplay these with ease

[-]

Street-Buyer-2428@reddit (OP)

good luck

[-]

Funny_Working_7490@reddit

which model you play with this toy??

[-]

Looz-Ashae@reddit

Nice. SWE's goals

[-]

Flimsy-Researcher-46@reddit

I’ll give you $20 for em when the M5 ultra comes out

[-]

Street-Buyer-2428@reddit (OP)

If you solve the issue i’m having ill give u one for free (not really)

[-]

Flimsy-Researcher-46@reddit

You should try asking claude (I’ll take the studio now tyvm)

[-]

DR4G0NH3ART@reddit

I have all the information now, I have formatted the response and sent a DM. Please send the Mac at address shared.

[-]

curious-guy-5529@reddit

Would you mind telling us what you have built/ are building with this super power?

[-]

val_in_tech@reddit

That is a one small PiPi. Who cafes. You can't even run opencode not waiting for 10 mins on a good model to start working. Macs are total piece of shit for any real work unless you're in a cult and need to get some point inside by downvoting suff.

[-]

openSourcerer9000@reddit

Open code doesn't have prompt caching, only token maxing. Try kon

[-]

a9udn9u@reddit

How's it 2.3TB? 512x4 = 2048 = exactly 2TB, am I wrong?

[-]

Street-Buyer-2428@reddit (OP)

2x macbook pros, and the 72gb blackwell

[-]

a9udn9u@reddit

Awesome

[-]

_mayuk@reddit

Give me one don’t be greedy :(

[-]

spense01@reddit

Why not just use Exxos?

[-]

gravybender@reddit

my 128gb studio comes on tuesday finally. been waiting 8 weeks. can finally migrate off my 24gb mini

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Allenite@reddit

Very nice. What do you plan to run on this?

[-]

pizzaiolo2@reddit

How much was this?

[-]

pacman829@reddit

I'm jealous. Congrats

[-]

FormalAd7367@reddit

isn’t it cheaper to just build a used server rig….

[-]

Street-Buyer-2428@reddit (OP)

Not for the price I got these

[-]

FormalAd7367@reddit

how much please?

[-]

Street-Buyer-2428@reddit (OP)

Refurb prices from september

[-]

openSourcerer9000@reddit

Not the first, this may be helpful:

https://blog.exolabs.net/nvidia-dgx-spark/

[-]

nmrk@reddit

Well, maybe second or third heterogenous cluster at best.

https://www.youtube.com/watch?v=D2oZHzC_M28

[-]

Street-Buyer-2428@reddit (OP)

Not with RDMA

[-]

nmrk@reddit

Dude RDMA is exactly how exo works, that's what it's made for. Maybe you should watch the video, it will save you a lot of work.

[-]

Street-Buyer-2428@reddit (OP)

Exo currently wraps mlx onto their processes. As with JACCL, there's still work to be done on it. I'm currently in contact with the maintainer to collaborate and possibly improve. Really nice guy. I have a stable rust backend that broadcasts rdma status and gives some observability onto entire clusters. If you would like further information let me know. Have you tried any of this or did you just watch the video?

[-]

Vegetable-Score-3915@reddit

Sent you a msg. Definitely keen to understand your perspective.

[-]

Vegetable-Score-3915@reddit

You would get the best speed between the studios using both rdma and thunderbolt.

[-]

Street-Buyer-2428@reddit (OP)

Or maybe 2 thunderbolts at once for double the speedz

[-]

Vegetable-Score-3915@reddit

You need to connect each studio to each other with the thunderbolt 5 cables. Don't use the slot right next to the Ethernet port, apparently that one is unreliable. Make sure you have rdma enabled on each of the devices. Have done it myself. I have 3 m3 ultras. Each one is using 2 thunderbolt cables connecting them to the other two.

Note exo is great for getting started but security wise it isnt great. Happy to discuss on this thread.

Recommend watching one of the youtube vids listed on exo's website with ppl setting up rdma with these devices, will save some time and small issues.

[-]

Street-Buyer-2428@reddit (OP)

Oh ive already set this up and have been working on it since december. Given that you have these ultras, I would like to share something with you that I think could be useful for you and would like your opinion on. PM if you can

[-]

Vegetable-Score-3915@reddit

Sure. Happy to. Much appreciated.

[-]

Vegetable-Score-3915@reddit

Pretty sure one of those dudes that did a youtube vid on linking 4 m3 ultras, also did a halo strix pairing with a m3 ultra. I think it was Alex ziskind

[-]

chensium@reddit

Have you tried llm-d or Exo for heterogeneous inference?

[-]

Rkozak@reddit

I think you are missing a stone.

[-]

Street-Buyer-2428@reddit (OP)

which

[-]

Rkozak@reddit

Ahhh I didn’t see the MacBook. You got all 5

[-]

Vancecookcobain@reddit

You try it with DeepSeek v4 pro? If so how many tps are you getting out that thing??

[-]

Street-Buyer-2428@reddit (OP)

Will soon for sure!

[-]

juzatypicaltroll@reddit

Don’t get it. Nvidia and Macs? Thought they don’t play well together.

[-]

frostyplanet@reddit

What brand is this device?

[-]

idkfawin32@reddit

What'd you do let them roll around in the back of a truck? Buff them scuffs out!(Mostly the third and first one from the bottom)

[-]

AccomplishedFix3476@reddit

2.3 tb of ram for prefill is a flex i didnt know was on the table for a homelab tbh. the rdma over to blackwells for decode is the part that feels like a server room from 2027 instead of 2026 ngl. wattage at full load is gonna be the real story

[-]

Street-Buyer-2428@reddit (OP)

Yeah. I’m gonna try and undervolt it so my neighbors wont sue me

[-]

devnullopinions@reddit

TTFT with a large input?

[-]

AdSignificant2058@reddit

I don't think Tinygrad eGPU is what you want. It's cute that it works. But it's very slow and not optimized. Your goal is prefill speed. What you probably want is a DGX spark or two or an RTX 6000 Pro on a Linux machine. Linux has proper drivers to run Nvidia metal.

[-]