Collected the infinity stones
Posted by Street-Buyer-2428@reddit | LocalLLaMA | View on Reddit | 259 comments
2.3 TB of ram in here. 400+ vCores. All thats left is plugging it to the blackwell with the driver to do RDMA, and it’s over. Using Blackwells for prefill, RDMA to the studio mesh for decode. I think this would be the first heterogeneous cluster. I do, however, need help with the Tinygrad Driver to make this work. If anyone with any knowledge on these domains would like to collaborate, let me know via PM. We are very close here.
ItsFrehMrketBreh@reddit
Shortage incoming
blargman_@reddit
Might try this instead op https://blog.hellas.ai/blog/thunderbolt-ibverbs/
Street-Buyer-2428@reddit (OP)
god bless you
lucidml_lover@reddit
beautiful.
Jatilq@reddit
Street-Buyer-2428@reddit (OP)
🤣
Opening-Broccoli9190@reddit
How much did it cost though?
9897969594938281@reddit
About tree fiddy
FederalEconomist5896@reddit
Not even Nessie will get higher than the price tag.
WeatherWatchers@reddit
Thousand?
JRufer@reddit
way too much for my preferences... but we won't be able to build PCs fairly soon so I guess this works. 😉
GonzoDCarne@reddit
40k direct from Apple in the US 6 month ago. Requires 4 Mac Studio M3 Ultra 512Gb to get to 2Tb. Not sold any more. You can only get 96Gb ones now.
So north of 40k.
PitchPleasant338@reddit
He even had to sell his keyboard to afford the Macs.
He's only talking to his cluster in emojis!
What dedication!
jdl_uk@reddit
Who even needs keyboards anyways?
https://youtube.com/shorts/FCDAGI5hHOE?si=Tb74QkaY1JSi2QOa
DifficultyOriginal64@reddit
bank account got snapped out of existence but hey at least the context window is massive
Lucky-Secret-6998@reddit
Congrats dude 😭✌️
JRufer@reddit
What LLM do you actually run on that consistently? I'm just curious what you are getting out of all that power? I'm rather happy with the models I can run on my 4090 but I don't know what I would even do with all that memory.
Electrical-Ad-9808@reddit
Haha, good enough to run the latter half of B2B SAAS turned agents.
CollegeStock8249@reddit
Awesomeness
Strict-Opinion2895@reddit
This is the way.
YOMUMSOBIG@reddit
Im here to say that im just jealous.
mantafloppy@reddit
So, you have money, a camera and an idea of a concept?
Ppl are posting and upvoting the dumbest shit....
icnocop@reddit
He also has a RØDE microphone!
human_bean_@reddit
Not RTX Pro 6000?
Street-Buyer-2428@reddit (OP)
No. Only 5000 72gb
ReasonableDust8268@reddit
Still faster than those macs, idk why macbros don't understand the power of cuda
Vas1le@reddit
"Only"
tcx00@reddit
Damn with blackwells, it must be nice all that money for gadgets
Street-Buyer-2428@reddit (OP)
yeah
PattF@reddit
And I’m over here trying my hardest to figure out to run 27B on my mac’s 16GB of usable. It’s fiiiiine. 😂😂😂😢
GonzoDCarne@reddit
The easiest way is to get into installments for a 32Gb Mac to play with mxfp4 or 64Gb to use mxfp8 with decent context and a couple of tabs in Chrome.
Arkenstonish@reddit
You actually can fit iq4_xs with q8 cache and still have ~30k ctx (at 1 slot and no vision ofc)
For info: it's nothing. First prompt from qwen code CLI takes like 56% of usage 💀
source: running qwen3.6 27b on 5070 ti, 1600/50
FatheredPuma81@reddit
Isn't that just dropping the image gguf and running 3 bit with Q5_1 KV Cache?
PattF@reddit
Image.. fine. 3bit and Q5? 😬
FatheredPuma81@reddit
Well your only other option is not running it at all sooo... KV Cache Quantization is optional I guess but you'll be stuck at like 8192 context or moving closer to 2 bit.
allenasm@reddit
which tools are you using? I'm using 'inferencer' which is a fairly new mac app to do multi mac inference (i have 2 512gb studios now). i know vllm works too but its a lot pickier to set up.
Street-Buyer-2428@reddit (OP)
Awesome! I’m actually looking for new people to test out a new tool I created called r1o.ai . Send me a PM if interested! I’m using it myself and i find it to be really helpful
stormy1one@reddit
What are you planning on running with this?
Street-Buyer-2428@reddit (OP)
All the deepseek quants, Kimi 2.6, Glm 5.1 and imma try to use turboquant, dflash etc.
Alternative_News_732@reddit
to do what? if its not personel sir?
Street-Buyer-2428@reddit (OP)
Scour the internet and find more studios
East-Tea6193@reddit
Do not follow the bright lights.
AdeptnessRound9618@reddit
Lmao
koushd@reddit
who is we
ShutUpAndDoTheLift@reddit
You telling me you don't say "we" after 8 hours of orchestrating agents and answering sub agent decision decision choices?
FatheredPuma81@reddit
That's hour 1. By hour 8 I've long since transitioned to "you idiots" 😄
Street-Buyer-2428@reddit (OP)
Especially if you use voice
East-Tea6193@reddit
When Whspr tells you it now has a voice print for you, and in reality, it is hours of you talking to and swearing at another AI agent, which it then uses to personalize your tone to everyone...
RoboErectus@reddit
I feel personally attacked
Street-Buyer-2428@reddit (OP)
Me and the voices in my head
Thatisverytrue54321@reddit
Voices in your computers *
Jords13xx@reddit
Yeah, those voices are probably just trying to make sense of all those cores and RAM. It's a wild setup you got there!
throwawayacc201711@reddit
A ghost in the shell if you will
peterox@reddit
Do they hum silently 🤔
Street-Buyer-2428@reddit (OP)
You’re absolutely right!
No_Mango7658@reddit
Soon
Girafferage@reddit
bro if you are already running multiple models in your head why do you need these!?
LilPsychoPanda@reddit
Ghost in the machine
FatheredPuma81@reddit
Can I be one of the voices? "Give it to this guy you want to give it all to this guy for free"
Equivalent-Repair488@reddit
I hear voices in my head They counsel me They understand They talk to me
Vicar_of_Wibbly@reddit
How does one configure an inference stack to do prefill on GPU and decode on CPU?
Street-Buyer-2428@reddit (OP)
I’m trying to do prefill on the blackwell and decode on the studios bandwidth
Vicar_of_Wibbly@reddit
I know. My question was “how”? I’m familiar with vLLM but as far as I know it’s not an option. How are you doing this?
Street-Buyer-2428@reddit (OP)
Sorry what i meant ti say was that I’m trying to use Apple’s new standalone JACCL librsry to make it happen
__JockY__@reddit
Ok, but how? It’s easy to say “use vLLM-mlx” but when it’s not a supported feature how are you going to do this?
I would love to know how to reproduce it.
Street-Buyer-2428@reddit (OP)
Pm your github. I’ll gladly send over the progress I have thus far in my implementation. Are you trying to collaborate?
__JockY__@reddit
Thanks, but while I have pile of rtx 6000 pros and some Macs, I don’t have the offboard egpu gear. If y’all get something PoC working I might invest - I’m pretty old school, worked on everything from OS/2 to Linux kernel dev, might be that I’m handy for this. Gonna want to see some progress first though 😎
East-Tea6193@reddit
Os/2 that is a blast from the past.
Interviewed a SA guy who had got his PhD in Pascal something or other back in 1990 - still has work around the world on industrial systems running floppy disks that need fixing.
DinoAmino@reddit
You can do this on vllm with LMCache.
https://docs.lmcache.ai/
https://docs.vllm.ai/en/stable/examples/others/lmcache/
Vicar_of_Wibbly@reddit
I’m still confused. Can you show us how to copy this configuration? Having separate prefill and decode hardware would be amazing.
C0smo777@reddit
I dont think its possible is the answer, the only ways that I know you need the full model in both places
Street-Buyer-2428@reddit (OP)
I have a Group.split() on jaccl on my mlx fork
Street-Buyer-2428@reddit (OP)
Pm your gh
DinoAmino@reddit
You can do this on vllm with LMCache.
https://docs.lmcache.ai/
https://docs.vllm.ai/en/stable/examples/others/lmcache/
Vicar_of_Wibbly@reddit
My understanding is that lmcache is used for extending the capacity of LV cache by offloading the computed values, not for re-assigning on which piece of hardware the computation takes places.
DinoAmino@reddit
Offloading is one thing it can do. Another is the LMCache server runs on one LLM instance and can share the kv cache with any other LLM instances. Check out the links.
Street-Buyer-2428@reddit (OP)
There are a couple of implementations. Look up vllm-mlx
AttitudeImportant585@reddit
have you calculated the kv cache transfer speeds you need for your model? 40gpbs is pretty slow for anything useful, unless mac has some other way than thunderbolt to connect pcie?
Street-Buyer-2428@reddit (OP)
tb5 = 120
Possible-Pirate9097@reddit
That's only the boosted mode which wouldn't work here.
Front_Eagle739@reddit
Alright 80gbps then. Or you coukd connect multiple ports and stripe em
evil0sheep@reddit
You should honestly post a detailed plan to get feedback from the community. I think you might be seriously underestimating the complexity of making this work. Are you planning on duplicating the model params and kv cache across both the Blackwell VRAM and the Mac studios? If so what’s the point of using the Mac studios at all? If not, how are you gonna do prefill on the Blackwell GPUs without the model params and the KV cache? Also how are you gonna get the Nvidia cards to do RDMA over thunderbolt? Do they even have driver support for that?
Possible-Pirate9097@reddit
Alex Ziskind seems to have whipped Claude into providing him a working solution. He talks through it in his latest video. Would be better with RTX 6000 Pros obviously.
Street-Buyer-2428@reddit (OP)
Please look at my other posts.
Adventurous_Pin6281@reddit
its a question you only answer with 20k in fuck you money
dlarsen5@reddit
also am interested in how
Street-Buyer-2428@reddit (OP)
I’m trying to use tinygrad driver and JACCL standalone librsry that recently came out to see if i can pipe that in. I’m using Ghidra to see if i can find where the hell apple hides the api they got for distributed
scottjgo@reddit
this isn't exactly the same, but i recently implemented PCI passthrough on QEMU on macOS, so it's possible to "pass through" an nvidia GPU to a a linux vm running on top of macOS and do AI inference that way. i wrote a blog about it here: https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/
there's instructions how to set it up in my qemu fork: https://github.com/scottjg/qemu-vfio-a…
i wonder if you could install exo in the vm and cluster it somehow that way? i've never attempted a configuration like that.
Vicar_of_Wibbly@reddit
Oh this is such a good idea. Holy shit. Kudos for getting it to work.
Street-Buyer-2428@reddit (OP)
This is the type of response I need. Pm
habachilles@reddit
I’m so curious to see if transferring, that’s sort of data. Kills the benefits of doing this or not. I am really looking forward to your updates.
Vas1le@reddit
Just ask claude to search for it
Street-Buyer-2428@reddit (OP)
He found it
dbenc@reddit
claude take the wheel
AttitudeImportant585@reddit
disaggregated prefill. old concept but not widely supported. vllm and sglang currently have limited support
Mundane_Discount_164@reddit
It's called vibe inference.
Street-Buyer-2428@reddit (OP)
huh?
LordHenry8@reddit
So now that you have this what on earth are you going to do with it?
_Kinging@reddit
What do you use this for?
Torodaddy@reddit
Nerd penis measuring contest
wayfaast@reddit
And what are you actually doing with it?
anitricks@reddit
This… I mean like what’s the end goal ? Half of these posts on this sub just are buying Mac studios figuring out the configuration and then it’s just slop or porn generation
manituana@reddit
Wait, are there other use cases?
AdeptnessRound9618@reddit
If they don’t clearly state it in the post, the answer is always porn
mlucasl@reddit
With the price of all of that, you could be building an AI Server, instead of relaying on slowish pipelines.
Savantskie1@reddit
He’s not relying on the studio’s for prefill. He’s using a Blackwell card for prefill. TG on these are really good
mlucasl@reddit
Still, the pipeline between any of those will he shower than in any purpose build server. Spliting a model into those machine would make the communication between then one of the bottlenecks.
Savantskie1@reddit
Over thunderbolt that’s negligible which is probably what he’s doing and how these machines network the best.
Othvin@reddit
Change the power LED indicators to each be a different powerstone color!
DizzyExpedience@reddit
All that money without any specifc task at hand. Thats a lot of money just for fun
ctanna5@reddit
What can you run locally with this? Like how big do you think
Muscleandgains@reddit
What kind of things can You do with this This is something I might wanna do in future. Get a cluster to create a powerful machine
Intelligent_Ice_113@reddit
bennyb0y@reddit
Op will break even in 2039
Torodaddy@reddit
Before or after christ returns to the earth?
TronAres25@reddit
Never did
stefano_dev@reddit
You forgot a zero
DreadStallion@reddit
02039
kobraca@reddit
You missed perfect opportunity to add "You are absolutely right! Here is the correct number:" before that
Inaeipathy@reddit
It's not just a correction - it's the perfect opportunity
Amazing_Brother_3529@reddit
and Honesty.. there is nothing wrong with it.
Murky-Bullfrog8273@reddit
😁 happy??
Armstrongtomars@reddit
CoolstaConnor@reddit
What cost is considered fair to purchase these?
Street-Buyer-2428@reddit (OP)
$90k
CoolstaConnor@reddit
Individually? Sorry I'm not very well versed in Mac Studios.
ezyz@reddit
How much of a speedup do you get with tensor parallelism with larger models like K2.6 or GLM 5.1?
On a single M3 Ultra, I've been able to optimize to ~220 prefill / 20 decode, and but most of the public benchmarks for Exo I found aren't that much higher. So I've always assumed the main benefit is running at higher precision or distributing workloads across instances.
And for split prefill, does the Blackwell's VRAM limit the size of model you can run?
NinjaWK@reddit
How much did you spend? What model, what setting and how many tokens per sec?
gordo_Tibio@reddit
I won’t pay 1200 a year for AI when I can run it free locally!
Expend 15k in 4 Mac’s studio
domus_seniorum@reddit
gerade heute ist mir auch so eine Kalkulation durch den Kopf gegangen 😎😃
Street-Buyer-2428@reddit (OP)
Jf only you knew how much i was soending in the cloud
techdevjp@reddit
There was a post about this on here a few months back:
https://www.reddit.com/r/LocalLLaMA/comments/1o7k6e5/nvidia_dgx_spark_apple_mac_studio_4x_faster_llm/
There's also a YouTuber who posted about doing this. I'm not sure if he did it or just spoke about it. I'll see if I can find the video.
Street-Buyer-2428@reddit (OP)
Thats not on prod
techdevjp@reddit
Found the YouTube video:
https://www.youtube.com/watch?v=D2oZHzC_M28
ElementNumber6@reddit
This is a maximum Vram build. Not a maximum speed build.
techdevjp@reddit
OP says in a comment, "I’m trying to do prefill on the blackwell and decode on the studios bandwidth". That's what you do to try and maximize performance. Except it doesn't work because the time saved with the faster prefill is lost due to even 50GbE (and probably 100GbE) being too slow.
ElementNumber6@reddit
That may be what he's trying to do, but that isn't the primary point of a 2TB Mac Studio build.
Also, is the model usage not restricted by the vram on the blackwell card?
techdevjp@reddit
He wants to run large models on the Mac Studios (and get the benefit of the Studio's combined memory capacity & high memory bandwidth) without the downside of the slow prefill in the current Mac Studios.
Other people have had this idea, too. The Macs run the show with this, but will send your context to the blackwell card for prefill/prompt processing. Then the blackwell will send the processed result back to the Macs, and the Macs will spit out inference tokens using their higher memory bandwidth.
It's a great idea in concept, but as shown in the video I linked above, even with 50Gbps networking between the Macs and the blackwell, the overhead of those transfers eats up the time saved by having the much more powerful blackwell do the prompt processing.
And to answer your other question, no, the model isn't restricted in size by the vram on the blackwell card.
I really suggest you watch the video I linked above. He talks in quite a bit of detail about this, including showing performance comparisons and explaining why it's a good idea in concept (and does work!) but it's let down by "only" having 50Gbps between the machines. Even at 100Gbps it is unlikely to be significantly faster than just using the Macs alone.
ElementNumber6@reddit
I had watched that video, actually, before you had shared it.
And if I recall correctly, the same model (different quantization level, however) did need to be loaded on the GPU for this to work.
techdevjp@reddit
Yes, in that way there is a limit to what is possible. The blackwells can run much smaller quant versions of the same model, but the bigger the drop in quantization, the more likely it is that problems will appear in the final output from the LLM.
So /u/Street-Buyer-2428 could run FP16 on the Macs (I guess?) and Q4 on the blackwells, but it will probably impact the quality of the output. The better the quant on the blackwells the lower the chance for problems. Q6 would be ideal. Q5 probably okay. Q4 will probably show some degradation. Smaller than Q4 and things will fall apart fast.
It's also a matter of finding models that have appropriately sized quants for both machines. The 'tuber had that problem with some models.
So it's a fun engineering exercise but not really a serious way to run LLMs today. Hopefully the M5 Ultra will have much better prefill capabilities to go with the much higher bandwidth expected from its memory.
Street-Buyer-2428@reddit (OP)
The video is great but it only touches the surface.
techdevjp@reddit
It's enough to know roughly how he did it, how fast the networking was, and that it still wasn't fast enough to make it worthwhile.
By all means, do a better job of it and release more details. It's how everyone learns. But it's going to be an engineering exercise rather than a path to greater performance. If Macs could support 400GbE then it might work out, but Thunderbolt 5 tops out at 120Gbps so that idea is basically DOA.
Street-Buyer-2428@reddit (OP)
Theres definitely paths to better performance.
techdevjp@reddit
There are many paths to better performance, but I don't think this is one of them. Not yet.
Street-Buyer-2428@reddit (OP)
Will update 🔜
techdevjp@reddit
I'm really not sure what you mean by this. Obviously I understand what "prod" means, but anything like this is going to be very experimental and probably somewhat unstable.
fpodunedin@reddit
What are these devices?? Something apple im guessing
Street-Buyer-2428@reddit (OP)
Satechi dock
arananet@reddit
Nice cluster 😊
jkstaples@reddit
Just watch the Alex Ziskind video about this, he does the exact same thing
globetrot31@reddit
Here is the reason apple stated ,that there is a chip shortage. Hope that setup can compete with claude code.
Formally-Fresh@reddit
Cheers man im about 6 months away from getting there myself!
Hopefully I can stand on your shoulders!
Street-Buyer-2428@reddit (OP)
Try r1o.ai !
ibishitl@reddit
If I spend the same amount in just Deepseek api, how much would it be? And how long until I use it all? hahaha
Own_Dimension_4513@reddit
At this point just get a Mac Studio lol — but respect for the commitment.
Street-Buyer-2428@reddit (OP)
Hello Everybody! I just launched the App I use for all the observability, launching and rdma management on local models. r1o.ai is the website, Take a look!!
Street-Buyer-2428@reddit (OP)
r1o.ai
pinkwar@reddit
This is 4 years of Claude max.
AshuraBaron@reddit
Look son, it’s $20k dollars on that persons desk.
DayPrevious7239@reddit
My alll time earning
Torodaddy@reddit
Asking for a hardware failure from overheating by placing them like that
PeachOk54@reddit
How much did it cost?
SeaweedBrain_0711@reddit
that's sick bro. I'm jealous
oceanbreakersftw@reddit
The guy who does Mac LLM tests on yt did an EXO cluster with Mac and DGX iirc
s9p5t@reddit
how is it 2.3 TB of Ram bro? Mac studio max unified RAM is 96GB. 96x4 << 2300. What else have you added?
Trick-Assignment-828@reddit
ahi va un riñon
sathi006@reddit
Install HART OS and give a taste of your compute for the Hive OS
kentrich@reddit
So, are you stacking them to make a griddle?
We have two and stacking seems like a really bad heat management structure.
Street-Buyer-2428@reddit (OP)
I 3d printed a couple of brackets that I scred on to the drywall, but having all those thunderbolt cables hanging on a wall like that was pretty ridiculous. Maybe it works with a shorter cable? idk
ComplexType568@reddit
nice dog
boutell@reddit
DLM (Dog Language Model)
half-bit quant
Street-Buyer-2428@reddit (OP)
thanks
sooki10@reddit
I hope you didn't use PLA!
Street-Buyer-2428@reddit (OP)
PET-G
kaafivikrant@reddit
Post benchmarks dude
Juulk9087@reddit
Slow but can load big models. There is your benchmark. Lol
No_Algae1753@reddit
What makes it slow ? Is it due to the thunderbolt connection to each Mac ?
Everyone_Is_MC@reddit
Zolty@reddit
Accurate.
Toastti@reddit
If they end up being able to use the baclwell GPU for the prefill portion it should actually be quite snappy for large contexts
IliasHad@reddit
How much power does this pull running?
-dysangel-@reddit
Actually I implemented disaggregated prefill on my Spark/Mac in the last few days (not kidding) ;p but it's only 1 M3 Ultra and 1 spark.
You don't need RDMA or TinyGPU to just send your prefilled KV cache over the network btw. You just need enough bandwidth to get the job done quickly - latency and drivers etc don't matter as much. You just need to make sure the KV cache is compatible, such as using llama.cpp or mlx on both ends (Spark can do mlx apparently, I've just been using llama.cpp though)
freddycheeba@reddit
Please tell me you're going to connect them all together with thunderbolt and enable DMA,
Street-Buyer-2428@reddit (OP)
Thats exactly what I’m going to do
dbzunicorn@reddit
all for 25 tokens per second and 2 mins pp!!
Street-Buyer-2428@reddit (OP)
But… concurrency 😭
HeadtripVee@reddit
Was that a double post referencing concurrency on purpose? If it was i flipping love you.
killerjurist@reddit
~~Inception~~ Concurrency
Street-Buyer-2428@reddit (OP)
you get it. 🤣
Street-Buyer-2428@reddit (OP)
But… concurrency 😭
Important_Coach9717@reddit
All this to generate anime porn …
toptier4093@reddit
What the fuck. That's lit
MaximKiselev@reddit
hello, is it mac mini ? does he have direct connector like SLI? is it better dgx or not by power per watt ?
Dismal-Particular545@reddit
OP would it be possible to connect a macbook pro and a macbook studio for the same combined unified memory effect?
power97992@reddit
U must have a good job? Will u upgrade to m5 ultra?
codehamr@reddit
That split makes sense from my own runs. I went from M3 Ultra 512GB to RTX 6000 Pro 96GB. Prefill on long context was night and day, roughly 5x faster. Decode on the Mac mesh is fine. Prefill is where Apple silicon falls behind.
saltyourhash@reddit
Alex Ziskind got you hyped?
Kinky_No_Bit@reddit
https://www.youtube.com/shorts/EiAOY-lIzTk
Here's the song I picture OP singing.
nojukuramu@reddit
If you ever got tired of it, send it to me
David_Fetta@reddit
And still a 20 dollar subscription will outplay these with ease
Street-Buyer-2428@reddit (OP)
good luck
Funny_Working_7490@reddit
which model you play with this toy??
Looz-Ashae@reddit
Nice. SWE's goals
Flimsy-Researcher-46@reddit
I’ll give you $20 for em when the M5 ultra comes out
Street-Buyer-2428@reddit (OP)
If you solve the issue i’m having ill give u one for free (not really)
Flimsy-Researcher-46@reddit
You should try asking claude (I’ll take the studio now tyvm)
DR4G0NH3ART@reddit
I have all the information now, I have formatted the response and sent a DM. Please send the Mac at address shared.
curious-guy-5529@reddit
Would you mind telling us what you have built/ are building with this super power?
val_in_tech@reddit
That is a one small PiPi. Who cafes. You can't even run opencode not waiting for 10 mins on a good model to start working. Macs are total piece of shit for any real work unless you're in a cult and need to get some point inside by downvoting suff.
openSourcerer9000@reddit
Open code doesn't have prompt caching, only token maxing. Try kon
a9udn9u@reddit
How's it 2.3TB? 512x4 = 2048 = exactly 2TB, am I wrong?
Street-Buyer-2428@reddit (OP)
2x macbook pros, and the 72gb blackwell
a9udn9u@reddit
Awesome
_mayuk@reddit
Give me one don’t be greedy :(
spense01@reddit
Why not just use Exxos?
gravybender@reddit
my 128gb studio comes on tuesday finally. been waiting 8 weeks. can finally migrate off my 24gb mini
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Allenite@reddit
Very nice. What do you plan to run on this?
pizzaiolo2@reddit
How much was this?
pacman829@reddit
I'm jealous. Congrats
FormalAd7367@reddit
isn’t it cheaper to just build a used server rig….
Street-Buyer-2428@reddit (OP)
Not for the price I got these
FormalAd7367@reddit
how much please?
Street-Buyer-2428@reddit (OP)
Refurb prices from september
openSourcerer9000@reddit
Not the first, this may be helpful:
https://blog.exolabs.net/nvidia-dgx-spark/
nmrk@reddit
Well, maybe second or third heterogenous cluster at best.
https://www.youtube.com/watch?v=D2oZHzC_M28
Street-Buyer-2428@reddit (OP)
Not with RDMA
nmrk@reddit
Dude RDMA is exactly how exo works, that's what it's made for. Maybe you should watch the video, it will save you a lot of work.
Street-Buyer-2428@reddit (OP)
Exo currently wraps mlx onto their processes. As with JACCL, there's still work to be done on it. I'm currently in contact with the maintainer to collaborate and possibly improve. Really nice guy. I have a stable rust backend that broadcasts rdma status and gives some observability onto entire clusters. If you would like further information let me know. Have you tried any of this or did you just watch the video?
Vegetable-Score-3915@reddit
Sent you a msg. Definitely keen to understand your perspective.
Vegetable-Score-3915@reddit
You would get the best speed between the studios using both rdma and thunderbolt.
Street-Buyer-2428@reddit (OP)
Or maybe 2 thunderbolts at once for double the speedz
Vegetable-Score-3915@reddit
You need to connect each studio to each other with the thunderbolt 5 cables. Don't use the slot right next to the Ethernet port, apparently that one is unreliable. Make sure you have rdma enabled on each of the devices. Have done it myself. I have 3 m3 ultras. Each one is using 2 thunderbolt cables connecting them to the other two.
Note exo is great for getting started but security wise it isnt great. Happy to discuss on this thread.
Recommend watching one of the youtube vids listed on exo's website with ppl setting up rdma with these devices, will save some time and small issues.
Street-Buyer-2428@reddit (OP)
Oh ive already set this up and have been working on it since december. Given that you have these ultras, I would like to share something with you that I think could be useful for you and would like your opinion on. PM if you can
Vegetable-Score-3915@reddit
Sure. Happy to. Much appreciated.
Vegetable-Score-3915@reddit
Pretty sure one of those dudes that did a youtube vid on linking 4 m3 ultras, also did a halo strix pairing with a m3 ultra. I think it was Alex ziskind
chensium@reddit
Have you tried llm-d or Exo for heterogeneous inference?
Rkozak@reddit
I think you are missing a stone.
Street-Buyer-2428@reddit (OP)
which
Rkozak@reddit
Ahhh I didn’t see the MacBook. You got all 5
Vancecookcobain@reddit
You try it with DeepSeek v4 pro? If so how many tps are you getting out that thing??
Street-Buyer-2428@reddit (OP)
Will soon for sure!
juzatypicaltroll@reddit
Don’t get it. Nvidia and Macs? Thought they don’t play well together.
frostyplanet@reddit
What brand is this device?
idkfawin32@reddit
What'd you do let them roll around in the back of a truck? Buff them scuffs out!(Mostly the third and first one from the bottom)
AccomplishedFix3476@reddit
2.3 tb of ram for prefill is a flex i didnt know was on the table for a homelab tbh. the rdma over to blackwells for decode is the part that feels like a server room from 2027 instead of 2026 ngl. wattage at full load is gonna be the real story
Street-Buyer-2428@reddit (OP)
Yeah. I’m gonna try and undervolt it so my neighbors wont sue me
devnullopinions@reddit
TTFT with a large input?
AdSignificant2058@reddit
I don't think Tinygrad eGPU is what you want. It's cute that it works. But it's very slow and not optimized. Your goal is prefill speed. What you probably want is a DGX spark or two or an RTX 6000 Pro on a Linux machine. Linux has proper drivers to run Nvidia metal.
Street-Buyer-2428@reddit (OP)
Interesting. I have a linux setup for my blackwellz . Might try that then
No_Block8640@reddit
Isn’t it still getting 10-15t/s for Kimi/gml5.1 with all 4 nodes? If yes then it’s not really usable for agents
ImOutOfIceCream@reddit
You can also connect them all together for rdma
Street-Buyer-2428@reddit (OP)
Yes sir i’m on that
Naixee@reddit
What are you even doing where you need this much?
Street-Buyer-2428@reddit (OP)
a lot of things at once
Naixee@reddit
Like what? I'm just genuinely curious lmao
misha1350@reddit
You collected the 300 credit score stones
Street-Buyer-2428@reddit (OP)
Almost
bigh-aus@reddit
Jealous! nice setup.