2x 512gb ram M3 Ultra mac studios

[-]

FoxiPanda@reddit

I'm up to 3 M3 Ultra mac studios now (512,256,256) and so I feel pretty qualified to answer your question/assertion.

Here's the real good and bad of them compared to NVIDIA hardware:

The VRAM availability is real. 819GB/s memory bandwidth is real. You can even cluster them together using RDMA over TB5 and while it can get a little unstable, you can even run tensor parallelism across multiple macs with shared memory pools. 1TB of coherent VRAM is pretty cool
I do in fact get like 25tok/s on a 4 bit quant of GLM-5.1 (slows down to about 19 at higher contexts).
The power isn't quite a ceiling fan, but at absolute full tilt, each mac studio uses about 160-190W. This means all 3 of my studios still use less than my RTX 5090 combined
The 1 hour of unboxing thing is maybe a little ambitious because of how damn long it takes to update MacOS, but otherwise, my AI has written scripts to bring another node up in a zero touch provisioning type methodology and so once MacOS is updated and I have manually enabled RDMA on TB5 (recovery console), I can bring up another Mac Studio roughly at the speed of an 80Gbps TB5 transfer of a model from one mac to another.
New models every week -- I actually say new models every day - genuinely, I have a script that runs and checks for new models and ranks them based on their capability and history to give me a report of how high of a probability that a new model will replace one of my existing local models that I run. So yeah, I'm greatly looking forward to ~300-400GB quants of Kimi-2.6 today. Today.

Here's the stuff that sucks:

Prompt processing is bad on Mac M3 Ultras. My RTX 5090 is like 8x as fast on PP and for models that fit in 32GB, easily 4-5x as fast on TG.
MacOS quirks are ... MacOS... lol. It's not quite linux but it's not that bad...
M3 Ultra Mac Studios are virtually not available anymore in any real sense. Working in the industry, this is purely RAM shortage and not Apple prepping to release the M4/M5 Ultra. RAM is really stupidly expensive and virtually unavailable right now...even to Apple. So yeah, everyone and their mother did go out and buy them.

[-]

Whats the point of running a 30k setup if its generating 21k/sec. I doubt those huge models will provide a better result than a condensed smaller model would they ? But then again im pretty sure someoen willing to spend that much on hardware knows what theyre doing

[-]

FoxiPanda@reddit

Honestly my setup is kind of out of control lol... I don't really run GLM at 21tok/s on my home setup much. I do run like 13 different models currently and a custom written harness that I wrote for me with features I want. As for big models being better - they really are if you can avoid quantizing them to death. However, a highly capable 30B model can do day to day work with one of those big models doing batch work or non-real time work that needs high accuracy but not necessarily real time interactivity.

[-]

taylorhou@reddit (OP)

hey u/FoxiPanda can i convince you to put your studio's on the teale.com distributed inference network? most of the devices only have 16gb so the network recommends hermes llama for openclaw/hermes agent inference demand but your studios could join mine in making kimi k2.6 and other near opus4.6 level models available to the world.

[-]

PurringBurrito@reddit

Did you go all out on storage as well, or you've kept it at the minimal (1TB if I am not mistaken?) for these Mac Studios?
I know some models can have a bigger footprint on the storage, and it can clutter up if you download multiple models, so was just wondering.

Thanks!

[-]

taylorhou@reddit (OP)

i didn't realize mmap could offload to SSD so one of mine only has 1TB of SSD but nvme via USBC is like 95% as fast as SSD so i expanded with a 4TB crucial for $1k including the housing

[-]

FoxiPanda@reddit

I went with 4TB on the 512 and I went with “whatever I could get” on the 256 which turned out to be 2TB on both of those. Beggars and choosers and such. TB5 external SSDs are great though and having a big NAS to archive and backup to is recommended

[-]

eclipsegum@reddit

Any chance you would sell your 512?

[-]

taylorhou@reddit (OP)

legit this is the most recent comparable. sold 1 day ago for $26,600 not including taxes which makes this likely $29k all in.

[-]

FoxiPanda@reddit

Lol, for a kings ransom, perhaps. Short of that, no, legitimately no.

[-]

eclipsegum@reddit

Any reason why you went 512/256/256 instead of just 2 512s? Would have been less costly too right?

[-]

FoxiPanda@reddit

512s were no longer available when I decided to scale, BUT there is actually a good reason if tensor parallelism gets to be more stable. If you can have multiple GPUs working on the same prompt with all of them having large memory pools, you can conceivably get more speed (though not quite linear scaling). So in theory, 4x256 would be better than 2x512 if the parallelism over TB5 works right in a cluster....that is unfortunately a big IF still...I haven't quite gotten it stable.

[-]

spvn@reddit

I do in fact get like 25tok/s on a 4 bit quant of GLM-5.1 (slows down to about 19 at higher contexts).

Do you feel like this is actually usable for agentic coding? sounds awfully slow (especially considering processing the prompt itself probably takes all day?)

[-]

FoxiPanda@reddit

So KV caching does a lot of good here. You do have to get over the hump of the first prompt injection, which can legitimately take 7 minutes if you're using a heavyweight harness that hasn't had it's first-prompt-injected files pruned lately (looking at you 25k token first prompt harnesses...). After that though, the PP isn't abysmal. And if you're the only user? Your cache isn't getting constantly invalidated, so PP isn't crazy each prompt after that first one.

After that initialization phase though? Yeah it's actually pretty usable. It's not some 100tok/s speed demon, but I can personally read at like ~20-25tok/s and so if you aren't trying to do 15 tasks in parallel across multiple sessions simultaneously, you'll have a pretty good time. Is it comparable to an NVL72 running thousands of tokens per second? No. Is it legitimately usable at your house for 180 watts? Yeah.

[-]

michaelsoft__binbows@reddit

ouch. even with theoretically 4x more PP speed from having tensor units in the upcoming M5 Ultra (maybe we can expect 5x due to all other architectural leaps?) that's still over one minute of prefill to wait for.

[-]

-dysangel-@reddit

well, you don't need to be running a large model for every prompt either. With 512GB of RAM you can afford to have a small model processing large contexts and summarising for the larger model. I haven't bothered to build this yet, though I did make a custom version of OpenCode that lets you handle /compact with the small utility model at least (I'm surprised this isn't an out of the box option)

[-]

michaelsoft__binbows@reddit

I would say compaction operations are very important and make little sense to ever delegate to a dumber model. Since hitting cache is important for performance the only sensible thing is to have the current model perform the compaction since what is being compacted would already be in cache.

[-]

-dysangel-@reddit

For me a large model is something that takes over 200GB. The "dumber model" is still going to be a model like Qwen 3.6 35B or 27B for example. I'm fine to trust them doing summaries rather than wait 20 minutes for OpenCode to do prompt processing again (for some reason they don't append the summarisation request to the *end* of the history). This is more of an OpenCode caching issue than a model issue, but yeah.

[-]

ahjorth@reddit

I am having to run a lot of data through local models (for GDPR-reasons) for a research project, and I literally sat down this morning to draft a post asking for real-life experience with this exact setup: 512 + 256 + 256. I already have an M3, and given the lack of 512s I'm considering buying two more 256 and running them with tensor sharding on Exo. I have some questions, and I'd love answers if you have time!

I looked through Exo's code when they launched V1, and at the time they didn't support parallel/batched inference. For my use case that's a deal breaker, but I see that they do now, and that their batched code extends directly on mlx-lm.

* How reliable is batched inference with exo?

* Does it scale as well as single inference when doing tensor sharding?

* Do you use Exo as a server, or are you using its Python api directly? If the latter, does it keep up with mlx-lm changes or does it lag (significantly) behind?

* I built a small structured outputs-package using outlines to create logits processors that I pass into mlx-lm's `BatchGenerator` on a per-prompt/stream basis (which mlx-lm supports now since Dec 2025). Do you have any experience with structured outputs on Exo?

All of these questions (except structured outputs) are answered on Exo's on page, but I can't totally tell how much I trust their marketing material...

[-]

FoxiPanda@reddit

There's like 20 questions in here that all require some significant writing, so please hold for semi-meaningful answers. I will try to get back to you in the next 24-36 hours lol

[-]

ahjorth@reddit

Haha, thank you. And really, if you don’t have time, don’t worry about it!

[-]

FoxiPanda@reddit

So let me try a short version now:

If you're interested in batched inference, I might be the wrong person to fully answer your questions. I have done a little bit of it, but not enough to be genuinely well versed. I think these guys did some work on it, but I haven't tried their stuff at all, so YMMV: https://old.reddit.com/r/MacStudio/comments/1rvgyin/you_probably_have_no_idea_how_much_throughput/

Regarding exo - I have a love/hate relationship with exo. Sometimes it works fine and sometimes I fight with it for hours. I've tried both the GUI and the API and I think one of the struggles with exo is that I personally prefer to build things from source and grab the latest fixes... but exo's latest fixes often break things that they claim are working in the .dmg releases...so sometimes I have to use an older version of exo just to get things working (looking at you RDMA-over-TB5). It is not seamless. When it DOES work, it's pretty cool and I genuinely have seen multi-node speeds and that's awesome.

Does it scale as well as single inference when doing tensor sharding? No and neither does NVIDIA on PCIe. There's a reason they spent tens of billions developing NVLink. Realistically, you get like ~1.6-1.8x what you would going from one node to two and then 2.4x-2.6x going from two to three nodes in my experience. However, it's not both PP and TG. TG goes up, PP can sometimes suffer. There may be bugs there, but that may also be the nature of the beast.

Regarding exo batchgenerator - I didn't even know this existed :D

Net net - I think your use case is probably different enough from mine that you'd have to do some exploration. One area where Macs are handicapped is in prompt processing, so if you're initiating large prompt sized batch inferences on small models, you might seriously have a better experience with NVIDIA RTX Pro 6000s. They shine in prompt processing and concurrency thanks to their much higher compute power...as long as you're willing to accept smaller models than what can fit in a Mac Studio's large unified memory VRAM pool.

[-]

ahjorth@reddit

Thank you so, so much for this:

for the link to the batched inference thread. I will look into that. I'm pretty much only interested in throughput and this might be an option.

for the link to the batched inference thread. I'm seeing the same kinds of uplifts of throughput with massively parallel inference (on MLX I'm running 600 in parallel on my 512GB). I'll re-read it later to see if I have missed any optimizations.

for your thoughts and experiences on Exo. I am already running my head against enough technical issues running what is essentially a server on MacOS which - as you say - is very much not meant to be a server. Why oh why can I not just run Linux on it... sigh. So I think I'll hold off on all this, because I don't think my mind can manage one more thing that kinda-but-not-really works reliably. But I might just have to, given the amount of data we have and the deadlines we are trying to achieve.

I have an old M2. It doesn't let me do RDMA but it would at least give me a playground for seeing how flexible Exo's BatchGenerator (https://github.com/exo-explore/exo/blob/main/src/exo/worker/engines/mlx/generator/batch_generate.py#L90) is. As I mentioned, I get twice as high throughput with mlx-lm's BatchGenerator compared to running it through mlx-lm's server. So if I can get that to work, i might try it... sigh. Yeah, it feels like a big decision. Do I contradict myself? Very well, then I contract myself...

Unfortunately NVIDIA is not practically an option because of the way insurance interacts with the national procurement agreements for Danish universities (technical, legal boring stuff), that's why I'm already running on Mac.

Again, thank you!!

[-]

eclipsegum@reddit

What’s your take on the RAM shortage and if new studios will come out in June? Do you think 512s will be available? What kind of price and waitlist?

[-]

FoxiPanda@reddit

I do not believe we'll see M5 Ultras in June. Probably Q4 at the earliest. IMO, if a 512GB M5 Ultra happens at all, it'll be $20000+.

[-]

PinkySwearNotABot@reddit

maybe the new CEO will cut the prices in half to get the stock price going ;)

[-]

FoxiPanda@reddit

Ah yes, I will also take those free lunches, $10 lambo 2-for-1 deals, and enjoy the champagne that rains from the heavens every morning.

[-]

eclipsegum@reddit

I would still click preorder so fast, that seems like a steal to have unlimited access to SOTA models forever privately. It does stuff an employee would cost 10-20x more for just for a year of employment

[-]

michaelsoft__binbows@reddit

I've thought this and it's easy to justify it that way but is it really the responsible choice when...

you could, in theory, if your environment is suitable, install some solar to offset GPU power consumption
probably in a few years there will be an oversupply of GPUs and e.g. H100/H200 to be cheaply had
during these AI bubble times you can get massive discounts on e.g. codex accounts: massively subsidized frontier level inference. At the rate I work, I only need to use 3, rarely 4x $20 tier subscription accounts worth of inference (caveat this point with lack of privacy)
1T class open models can also be had very cheap on subscriptions (e.g. z.ai, nano-gpt etc) (again, caveat on privacy)

As you see, I have little need for privacy with llm inference and I seem to have done a good job of talking myself out of it but you bet your ass I'm still going to be loading up that M5 Ultra 256GB order page and feeling some type of way.

[-]

FoxiPanda@reddit

I do not disagree. There's a reason I have 3 M3 Ultras lol.

[-]

PinkySwearNotABot@reddit

New models every week -- I actually say new models every day - genuinely, I have a script that runs and checks for new models and ranks them based on their capability and history to give me a report of how high of a probability that a new model will replace one of my existing local models that I run. So yeah, I'm greatly looking forward to \~300-400GB quants of Kimi-2.6 today. Today.

can you share this script? i need to adopt it for my 64GB model instead of checking huggingface and manually updating my local LLM inventory (although some steps are automated)

[-]

FoxiPanda@reddit

Strongly advise writing your own with your LLM. Mine is deeply embedded into my own workflow, so you probably don't have an overnight benchmark pipeline, multiple nodes, model routing, model incumbents, etc etc so it would be useless to you.

[-]

EugenePopcorn@reddit

With all the news about Apple drivers for Nvidia eGPUs lately, do you think a 5090 could work well for batched prefill in that kind of setup?

[-]

FoxiPanda@reddit

Yes, absolutely, if tinygrad can get their drivers working well. Currently they are very slow .. I actually am testing this currently. I'll buy a dedicated RTX 5090 for each of my M3 Ultras if I can get it to work reliably. It will be the best of worlds.

[-]

PinkySwearNotABot@reddit

can you give us a short briefing on how we can connect 2 Mac Minis together to double the unified RAM? what software/hardware would I need if say, I had two (2) M5 Max's with 256GB each?

[-]

FoxiPanda@reddit

https://www.jeffgeerling.com/blog/2025/15-tb-vram-on-mac-studio-rdma-over-thunderbolt-5/

No idea if it works on Minis.

[-]

spambait-aspaaaragus@reddit

Have to ask, do you do this for work? For fun? This is awesome

[-]

FoxiPanda@reddit

Both :)

[-]

Turbulent-Week1136@reddit

I have a M3 Ultra with 512 GB as well. I use it for image processing and the M3 Ultra is about 2-3x slower than my RTX 5070 Ti. The only advantage is that it can load much larger models but things that I hoped I could do, like have it handle multiple API calls to it at the same time is not possible because even a single call takes up almost 75-85% of the GPU, rendering it useless for multi-tasking queries.

[-]

joblesspirate@reddit

Yea... I'ma need that script.

I have an M3 Ultra with 512GB and love the damn thing.

[-]

thnok@reddit

Just curious what do you do with so much hardware? I get you can try out new models, but you could do that via inference and on cloud.

[-]

FoxiPanda@reddit

People ask me this a lot and the answer is pretty simple - I am building a layer of abstraction between myself and the rest of the world. Think of it like an automated Family Office that a billionaire would have... but I have it for myself and it's customized and tuned for me.

[-]

Beginning-Sport9217@reddit

Random Q. I recently got a M3 Mac Studio and I tried out a video model only to have it come out like blurry nonsense. Researching, it seemed like a potential culprit was that video models are hard to run on M3. Have you found this true?

[-]

FoxiPanda@reddit

I have to be honest I have not messed with video gen models at all

[-]

Beginning-Sport9217@reddit

Ah okay well thanks for the reply anyways

[-]

ihatecascardo@reddit

I love to try out your script that grabs and rates models.

[-]

kweglinski@reddit

what do you use to run models? omlx? mlx directly? lm studio? something else?

[-]

FoxiPanda@reddit

Llama.cpp for most things, weird forks of llama.cpp for experimental stuff, mlx sometimes. I’ve messed with omlx a bit too.

[-]

kweglinski@reddit

interesting, thank you. While llama.cpp just works I was never able to squeeze max performance out of it. Recently started playing with omlx and it's caching made night and day difference, esp for coding.

[-]

MrTacoSauces@reddit

Since you seem very knowledgeable how are things going with the Nvidia driver for having a egpu part of the setup? Would it be possible to fix the prompt processing speed by hooking in a 5090 to your setup?

My first thought is the 32gb of fast processing probably would get bottlenecked but I haven't really seen anyone talking much about the new Nvidia driver beyond the news posts.

[-]

FoxiPanda@reddit

TBD honestly. Even the Tinygrad folks writing that driver probably don’t have a good answer to this yet. Theoretically, it could be great…

[-]

SkyFeistyLlama8@reddit

The 5090 being 8x faster on PP hurts. Maybe the DGX Spark is 4x faster but that speed matters when dealing with large document RAG, doing document synthesis or when working with big chunks of code.

I can say this as someone who runs a much slower unified RAM rig: PP matters a lot more than TG for most workloads, especially when MOE models can run on a lot of RAM at high speed because of the low active parameter count.

[-]

FoxiPanda@reddit

The problem with the spark is that it is faster on PP but is 3.5x slower in TG because of its pedestrian 273GB/s memory bandwidth. :/

[-]

sthote@reddit

How does everybody even afford all of this lol? Do you guys just have steady jobs or does the local llms actually make you money?

[-]

FoxiPanda@reddit

I work in tech in AI, personally.

[-]

Illustrious-Yard-871@reddit

What quirks...?

[-]

Caffdy@reddit

The power isn't quite a ceiling fan, but at absolute full tilt, each mac studio uses about 160-190W

AFAIK, the M5 Ultra can reach up to 280W of power draw

[-]

worldburger@reddit

Post your provisioning script to GitHub :)

[-]

FoxiPanda@reddit

Lol, I honestly might do that eventually, but right now it's hardcoded AF to a bunch of random crap unique to my network, so it'll take a little abstraction before it would make sense to anyone but me...but it's a legitimate ask. I'll add it to my (unfortunately ever growing) todo list.

[-]

reto-wyss@reddit

How well does it scale with concurrent requests?

2x Pro 6k, I can get 15x to 20x throughput on Qwen3.5-122b-a10b (scaling is even better with the 30b dense models up to like 50x) if I load it up until total kv-cache is exhausted, maybe I could get better for batch one with MTP, but it seriously dunks on throughput so I typically don't use it.

[-]

FoxiPanda@reddit

If you're doing concurrent requests, you're going to want better PP than a Mac Studio can provide IMO. You're barking up the right trees with the RTX Pro 6000s for that task imo. I can reasonably get up to like 4 parallel requests but after that... things start to really slow down.

[-]

IrisColt@reddit

why didn't everyone here get one the second they were available?

That actually happened, heh

[-]

-SunGod-@reddit

I can say from experience that trying to run local LLMs on a Mac Mini M4 Pro w 24GB RAM and 1TB drive is just on the outer edge of usable. I guess it'll do if it's dedicated to using for a handful of targeted LLMs, but I really regret not going with a 2GB drive or 32GB version now.

Next time I need to update a desktop Mac, I'll just go straight to a Mini with an Ultra in it to ensure I've got more computational and storage headroom.

[-]

segmond@reddit

priorities? I could spend the money, but you know, I got other bills too.

[-]

eclipsegum@reddit

I would have ate ramen for a year to get one of these last year if I had known what was coming lol

[-]

segmond@reddit

In retrospect, I sometimes wish I bought it too. But I also feel and believe that cheaper hardware is going to come around. The AMD Ryzen AI 395+ max is a preview. Just add a few more cores to that, bump it to 512gb, give it a bit more lanes so the system can accept 2-4 PCI slots and you get a system that will crush Mac studio. The tech is there, it's just that the WILL and strategy seems to be lacking from AMD. Furthermore, models are getting much better and smaller as we see with qwen3.6, qwen3.5-122b, qwen3codernext, gpt-oss-120b. So it might be that in a few years 128-256gb range will be more than enough.

[-]

Far-Low-4705@reddit

The real reason: I don’t have $25k

[-]

Blackdragon1400@reddit

Everyone here didn’t get one because most of Reddit doesn’t have $25k in discretionary income lol.

[-]

Gloomy_Letterhead395@reddit

I dont believe that Considering the depreciation of the product you could easily afford online full scale models at lower subscription compared to full purchase

[-]

SmartCustard9944@reddit

I would buy it if M5 Ultra was not just around the corner. It will be insanely expensive though.

[-]

SexyAlienHotTubWater@reddit

Everyone didn't get one because they're $25k - you paid $25k to get slower output on a worse model than the service that costs $2.4k per year.

[-]

ChocomelP@reddit

SOTA models like qwen 3.5-397b

[-]

GregoryfromtheHood@reddit

People always talk about token generation speed, but I've noticed that for any kind of real work, especially agentic stuff, prompt processing matters way more. People get great token gen speeds on CPU and unified memory systems, but the prompt processing is usually pretty slow.

I have tried models that generate about the same speed but if one has higher prompt processing, it's way more usable. I'd even take 15t/s gen speed as long as I have 5kt/s-10kt/s prompt processing. Generation speed actually doesn't really matter all too much most of the time for me.

Not sure if my use cases of workflows are different to people just chatting and getting answers or something, but that's just what I've noticed. Slow prompt processing absolutely kills any usability for me more than generation speed.

[-]

FerLuisxd@reddit

How much is the power draw per month in dollars? 👀

[-]

No_Run8812@reddit

you are not missing out, these machines are insanely slow in large models, and who needs them run small models. just get a 128gig m5 you will be happier.

[-]

eclipsegum@reddit

But how else can you load up models in the range 400B-1T

[-]

Deep90@reddit

What do you even do with the speeds you get?

Imo I can wait a couple years to run it at a level and price that's worth it. 25k is just too much.

[-]

JockY@reddit

You can when the models are heavily quantized, but even then it’s painfully slow and hard to imagine anyone using a setup like this for realtime interactive work over the long term without getting pretty frustrated by the experience. It just wouldn’t be fast enough, surely?

[-]

No_Run8812@reddit

Even if you could load the model they are useless on m3 ultra maxed out, mine one give 8tk/sec and 145tk/sec profile. This is slow, it takes hours to do something meaningful.

New Gemma 4 31B dense model is really good, I am using it for openclaw it works fine.

[-]

FinalTap@reddit

The big question will you get the M5 Ultra/Fusion/Max Studio when it comes out?

[-]

Healthy_Albatross_73@reddit

What can you do with the profitably? Invest all the money for what?

[-]

eclipsegum@reddit

My line of thinking is that we are ALL totally addicted to using LLMs now. And this is the only practical way to use it privately with unfettered access. It’s like having your own solar panels, well water, etc. it feels like self sufficiency, privacy, and complete ownership of the most important asset - intelligence

[-]

davewolfs@reddit

The larger models don’t run well on the hardware even if they can be loaded.

[-]

Southern_Sun_2106@reddit

They run well for general chat. But that's it. If someone is OK with just a general chat, with a large model, in total privacy and off-the-grid, that's OK.

[-]

davewolfs@reddit

When I got my 96GB Ultra I thought a lot about if I had made the right choice. 2 years later - no regrets. For the M5 Ultra I will definitely be stepping up if/when it becomes available.

[-]

flyingbanana1234@reddit

i believe because the m3 ultras were a bit slow for daily inference on big models

m5 ultra would have solved that issue

[-]

coder543@reddit

If all of that is actually true... why didn't everyone here get one the second they were available? What am I missing

Two main reasons: that’s a lot of money, and the prompt processing is very slow.

The M5 Ultra Mac Studio will drastically improve prompt processing speeds later this year thanks to the M5’s neural accelerators in the GPU. (Not the same as NPU.)

Still, it’s a lot of money. It’s very hard to justify $10k+ in hardware when cloud subscriptions are this cheap. I have a $4k DGX Spark for experimentation purposes, and that’s really the limit for me unless someone offers to double my salary.

[-]

tristanbrotherton@reddit

What’s that screen in a suitcase and the rover wheels in the background?

[-]

taylorhou@reddit (OP)

woof woof

[-]

tristanbrotherton@reddit

Very cool - i hope you've seen this: The early days....

[-]

taylorhou@reddit (OP)

hahaha

[-]

taylorhou@reddit (OP)

i'm in robotics. quadruped robot dogs for security and patrol.

[-]

tristanbrotherton@reddit

Fun!

[-]

limesoda1@reddit

If you're able to get any of the GLMs (4.6 or later) running tensor parallelism across both, I'd love to hear how. I did not enjoy exo when I tried it out.

[-]

taylorhou@reddit (OP)

i was able to get glm 5.1 running with tensor parallelism. but their launch was overshadowed by kimi k2.6 - i was using exo v1.0.70

[-]

Longjumping_Crow_597@reddit

Hey, exo maintainer here.

Sad to hear you didn't enjoy exo.

Can I ask what you didn't like about it? I know there were some issues with GLM tensor parallelism recently, and we pushed some fixes in 1.0.70 (https://github.com/exo-explore/exo/pull/1529).

[-]

Sergei-_@reddit

hah you even have this portable lg or whatever tv. are you from llt by any chance?

[-]

taylorhou@reddit (OP)

dunno what llt is

[-]

frankiebev@reddit

Hopefully Apple releases m5 soon and secondhand market dumps a lot of these m4 max and m3 ultra studios

[-]

taylorhou@reddit (OP)

unlikely the price will be competitive. ram supply is still 2+ years behind. these were about $20/gb ($10k for 512gb ram) even nvidia chips are coming out with 700gb ram at $150k so literally $200/gb or 10x the price. i don't think we'll see $20/gb for a LONG time

[-]

aero-spike@reddit

Bro could run Kimi K2 at FP8.

[-]

taylorhou@reddit (OP)

exactly what is running K2.6 8bit. =)

[-]

talk_nerdy_to_m3@reddit

How many months of Claude max subscription would it take to equal the cost of this? I mean, yea I bought a 4090 but that plays video games. What do you do with something like this besides run local models? Seems like such a waste

[-]

taylorhou@reddit (OP)

i was fortunate and had the foresight to buy these retail. for me with 250+ employees, my breakeven on inference is 2 months... i have a unique situation though

[-]

AdOk3759@reddit

Exactly, idk why you’re being downvoted. 25k dollars and you can’t even run SOTA models. It would take, what? 10 years just to break even??

[-]

Upset-Fact2738@reddit

25k but you can sell it after using

[-]

One-Adhesiveness-643@reddit

For a fraction of the price.

[-]

NoahFect@reddit

We'll see.

[-]

jonydevidson@reddit

However many months it is, it will be running exactly what you spin up on it, without silent nerfs when user numbers surge or as new models get tested in the back.

It will always be the same. New models will come out in time.

Mac Studio's issue is GPU compute. M5 architecture being chiplets points to Apple seeing this as a potential solution where they can perhaps offer M5 Studio with M5 Ultra that has 3x or even 4x the number of GPU cores compared to M5 Max, not just 2x.

They are well positioned to pull this off.

You also get complete privacy and batch inference is much more efficient so if you have 2-4 people working on one such node, at $200/m x 4, the payoff point is much quicker.

For a single person doing this, this isn't about economics at all. It's pure enthusiasm. The only things that could make it worth are the privacy and consistency aspects, and that will have a different worth to everyone.

[-]

Caffdy@reddit

How many months of Claude max subscription would it take to equal the cost of this?

I mean, reading on their sub how many people are getting rate-limited on the 5X or 20X plans even in mere minutes of use, these hardware purchases doesn't seem that outrageous

[-]

Mickenfox@reddit

Probably the closest comparison would be actually renting a server.

For example, for $25k you could pay for 18 months of a 40-core, 961GiB server, assuming you're fine with CPU inference.

...actually the local hardware does seem like a better investment now.

[-]

tarruda@reddit

Claude is very expensive if you use the API. I tried last week and easily burned through $100 in a couple of days of moderate coding.

[-]

Southern_Sun_2106@reddit

You can play games on these too. The selection is more limited, but whatever is available, runs really smooth.

Claude is going through a very rough time right now. And that can happen to any cloud-based model. Plus this is local llama, so people here tend to be more focused on locally-run AI.

[-]

vex_humanssucks@reddit

With that much unified memory the thing I'd really want to test is long-context coherence at the tail end of a 128k+ window — not just whether it stays on topic, but whether it maintains consistent references to things mentioned early in the document. Most benchmarks skip over the degradation that happens in the last 20% of the context window and it matters a lot for real document-heavy workflows. Curious how DeepSeek V3.2 Q8 holds up past 100k tokens on something like a long codebase analysis.

[-]

taylorhou@reddit (OP)

would love to have you try exactly that and report back. currently i have kimi k2.6 on the machines because my engineering team uses them for coding but if you DM me, we can coordinate and i'll load deepseek v3.2 or even the newest deepseek v4 and you can test all you want for free.

[-]

theologi@reddit

KIMI K2.6 DUDE

[-]

taylorhou@reddit (OP)

i have kimi k2.6 8bit available for free on teale.com - join the distributed inference network and try them for yourself!

[-]

bruhhhhhhhhhhhh_h@reddit

Nifty. Could you finetune models ? What's the it/s like? are you using unsloth?

[-]

taylorhou@reddit (OP)

unsloth seems to be the fastest in coming out with quantized models and models specific to backends like MLX, GGUF, etc... i've been thinking about finetuning...

[-]

SkyFeistyLlama8@reddit

Are you running any extra cooling to keep these from frying themselves? LLM inference isn't kind to hardware.

[-]

taylorhou@reddit (OP)

so far haven't had anyone actually use them 24/7. hop on teale.com where they are powering kimi k2.6 for free!

[-]

FederalAnalysis420@reddit

i'm new in here, how does this actually work? do they sync up and act as one unit with more compute?

[-]

taylorhou@reddit (OP)

yup exactly this. i was tempted to get 4x but i saw some initial reports that you get diminishing returns as every request has to make a roundtrip between all devices connected in a cluster.

[-]

getmevodka@reddit

Works via project exo for example. Both studios need the full model in their harddrive though afaik, but then will load only accordingly to each other and math out their part.

[-]

NotTodayGlowies@reddit

Hey OP, what's up with the monitor in the suitcase?

[-]

taylorhou@reddit (OP)

i go to conferences a lot so that makes for a fantastic monitor you can baggage claim. it's literally been around the world many times. it also goes vertical too, has an internal battery and technically is a smartTV as well.

[-]

TiK4D@reddit

Thing of dreams, just spent A$5k on 64GB VRAM 2x R9700's

[-]

TiK4D@reddit

+ $1300 on 64GB RAM ouch

[-]

Southern_Change9193@reddit

How is that $5000? It is $1700 on B&H:

https://www.bhphotovideo.com/c/product/1928519-REG/asus_turbo_ai_pro_r9700_32g_turbo_radeon_ai_pro.html

I bought two last month for $1499 each.

[-]

TiK4D@reddit

Australian dollar

[-]

nomorebuttsplz@reddit

kimi k2.6 prompt processing speeds

[-]

softwareweaver@reddit

How noisy is it when running DeepSeek or GLM for inference, tokens per second and power consumption.

Thinking of making the switch when M5 Studios comes out. Thanks

[-]

spaceman3000@reddit

Studios are dead quiet

[-]

parano666@reddit

I've been wondering myself, to fix the downside of my m3u 512gb, if this is THE fix and THE big futur proof upgradable solution https://www.youtube.com/watch?v=C4KWsmezXm4

[-]

parano666@reddit

Here: https://docs.tinygrad.org/tinygpu/egpu for mac - tinygrad docs

[-]

misha1350@reddit

Sell one and use Qwen3.5 397B A17B instead. Should be good enough. Exo is a crutch, you'll go bankrupt way sooner than you can get any sort of ROI just to break even by using 2x Mac Studios instead of just 1x Mac Studio at full speed.

[-]

_VirtualCosmos_@reddit

Ok but is that a robot with a box over it? and a case-monitor? lel

[-]

GKN777@reddit

Really jealous mate

[-]

MuzafferMahi@reddit

LocalLLaMA final boss ass setup

[-]

Alarming_Bluebird648@reddit

This is soooooo coool, I wish I had these 🥹

[-]

BlueSky4200@reddit

Would love to hear more about your progress with GLM 5.1 and what context sizes you can archieve.

[-]

bigh-aus@reddit

Kimi k 2.6

[-]

BP041@reddit

Insane setup! With that much RAM, you're in the rare position to test true multi-agent efficiency. Raw tokens per second is one thing, but I’d love to see how this handles complex agentic workflows (like OpenClaw or Claude Code) running multiple local models for different tasks (planning vs. coding vs. debugging) simultaneously. DeepSeek v3.2 is great, but seeing if the Exo backend can effectively distribute a swarm of smaller, specialized agents across those Studios would be a legendary benchmark.

[-]

michael_p@reddit

I am so excited to hear about how Kimi works on these compared to Claude code

[-]

ortegaalfredo@reddit

It looks like the final boss that you have to destroy to save the earth, congrats.