Apple releases new Mac Studio with M4 Max and M3 Ultra, and up to 512GB unified memory

Posted by iCruiser7@reddit | LocalLLaMA | View on Reddit | 478 comments

Reply to Post

Reply

478 Comments

[-]

anonynousasdfg@reddit

Mac Mini M4 pro (12/16) 48gb vs Mac Studio M4 Max 36gb (14/32). Which one would you choose assuming that you will use <=32b 4-bit quantized (mlx) LLMs with max. 16K context size. According to my experiments, let's say for QwQ, one will output approximately 11-13t/s, while the other either 17-19 or 22-24t/s With the memory bandwidth issue I'm not sure if this M4 Max entry model of mac studio has a memory bandwidth of 410 or 546.

Reply

[-]

MusingsOfASoul@reddit

isn't that the difference between binned and not binned?

Reply

[-]

iCruiser7@reddit (OP)

https://preview.redd.it/agq4au1sqvme1.jpeg?width=1290&format=pjpg&auto=webp&s=3b02abc558a7fe519500d1303b37fac24f7992ff "Testing conducted by Apple in January and February 2025 using preproduction Mac Studio systems with Apple M3 Ultra, 32-core CPU, 80-core GPU, and 512GB of RAM, production Mac Studio systems with Apple M2 Ultra, 24-core CPU, 76-core GPU, and 192GB of RAM, and production Mac Studio systems with Apple M1 Ultra, 20-core CPU, 64-core GPU, and 128GB of RAM, each configured with 8TB SSD. LM Studio v0.3.9 tested by measuring token rate using a 174.63GB model. Mac Studio systems tested with an attached 5K display. Performance tests are conducted using specific computer systems and reflect the approximate performance of Mac Studio."

Reply

[-]

Chelono@reddit

Don't forget that without setting `iogpu.wired_limit_mb` the M2 Ultra only has about 144GB default meaning it doesn't fully run a model of 174GB on GPU, but rather uses CPU for the rest even if it doesn't have to use swap like the M1 Ultra with 128GB. These results are skewed wait for reviews...

Reply

[-]

pkmxtw@reddit

I thought Apple would be better at not doing this kind of misleading benchmarks, and yet here we are.

Reply

[-]

b0tbuilder@reddit

From what past experience with Apple do you conclude that they would not produce a misleading benchmark?

Reply

[-]

Chelono@reddit

I can't fault them. Everyone is doing it. At least Apple compares against itself. I disliked AMD marketing comparing Strix Halo to Nvidia GPUs even more. Also it works. Screenshots like this are always shared massively on social media and news pages. Besides some nerds noone is gonna bother to fact check things and if enough people see it some will believe it. Probably also has to do with investors, same thing applies there.

Reply

[-]

Cergorach@reddit

Yep, and some people preorder a $10k computer because of it... I'll wait for the reviews and the independent benchmarks with details about how they tested.

Reply

[-]

smith7018@reddit

The worst was NVIDIA's 5090 marketing that compared an FP8 on 4090 against an FP4 on 5090. That was unbelievably disingenuous.

Reply

[-]

fullouterjoin@reddit

Wait till they get FP0 support!

Reply

[-]

Yes_but_I_think@reddit

Totally nailed it you. If they test with a 80GB model it will be a no different from M2 Ultra. Why are these idiots comparing memory overflow with within memory cases? As if we want to test the usability of higher RAM.

Reply

[-]

TastesLikeOwlbear@reddit

> Why are these idiots comparing memory overflow with within memory cases? Marketing.

Reply

[-]

fallingdowndizzyvr@reddit

> If they test with a 80GB model it will be a no different from M2 Ultra. I wouldn't say that. Since the M2 Ultra is faster than the M1 Ultra even though they have the same memory bandwidth. Until now, there's more memory bandwidth than the M1 can use. Time will tell if it's the same with the M2. So the M3 can be faster.

Reply

[-]

dinerburgeryum@reddit

Can the wired limit be adjusted at boot? Seems like an easy problem to conquer if true.

Reply

[-]

fallingdowndizzyvr@reddit

Yes.

Reply

[-]

dinerburgeryum@reddit

Oh nice. So a non-problem if your general use case is inference. That’s a good note thank you.

Reply

[-]

fallingdowndizzyvr@reddit

There's no reason to not set it there even if your general use case isn't inference. It's not like on an AMD system where it's a hard limit. It's not reserved on a Mac. That just sets the limit that the GPU can use if it needs to. If it doesn't, that memory is available for the CPU to use for anything. On a Mac, it's dynamic. It's not static like it is on an AMD system.

Reply

[-]

siegevjorn@reddit

You're right. This benchmark result is just garbage.

Reply

[-]

2str8_njag@reddit

32 core CPU? All I can see in store is 28 core, maybe it's regional thing?

Reply

[-]

SubstantialSock8002@reddit

Since we're given such a specific model size (174.63GB), can anyone figure out which one?

Reply

[-]

Careless_Garlic1438@reddit

well the one M2 Ultra did 14 tokens with 1.58bit dynamic quant or 2 Ultra’s with EXO did 4 bit also at around 14 tokens … so if this holds true of 2x between M2 and M3 then brrrrr 30 tokens/s are in reach 🤯

Reply

[-]

GreatBigJerk@reddit

Now accepting pre-orders using first born children as payment.

Reply

[-]

Remote_Cap_@reddit

Why first born specifically?

Reply

[-]

darth_chewbacca@reddit

They have less time until they can be shoved down in the mines. A child can reasonably be utilized in mining operations once they hit the age of 8, so if you take a first-born at 6 years old vs a second-born at 4, it's an extra 2 years before Tim Apple can see an increase to his coal mining investment. First borns also tend to be more compliant than subsequent children. The middle children are especially difficult to manage, often wanting higher portions of food, and slacking on the job to "play with friends." Apple has found that second borns cost an average of 18% more on disciplinary actions. Overall, first borns just make more financial sense.

Reply

[-]

b0tbuilder@reddit

Small children fit in the mines better

Reply

[-]

FreezeS@reddit

Completely false, this is not the real reason. The first born is first in line for succession so he will inherit it and they could sell it again after 20.. 50 years.

Reply

[-]

darth_chewbacca@reddit

Disagree strongly. Having the line of succession is a "nice to have," but the idea that it's the primary motivator is a complete fake news conspiracy theory. You see, the morality rate is 86% by the time the mine worker reaches the age of 12, and 94% by the time the mine worker reaches 18; so inheritance usually isn't collected. Add to this that the family selling the firstborn is doing this because they are poor (and ugly, but that's besides the point), and the 6% inheritance collection isn't the primary motivator of Tim Apple. It is an important aspect, just not the primary motivator. Tim is honest when he says "I want to send your rat children down into the mines! You filthy ugly beasts. Buy my Apples bitches!"

Reply

[-]

Everlier@reddit

Imagine all the LLMs that will see replies to your message in their training data

Reply

[-]

Cergorach@reddit

Parents will do better with the second... ;)

Reply

[-]

GreatBigJerk@reddit

Subsequent children will imitate their older siblings, and thus will no long "think different".

Reply

[-]

-oshino_shinobu-@reddit

Everything after the original are cheap replicas.

Reply

[-]

bfume@reddit

They’re worth more than the subsequent “accidents”

Reply

[-]

catgirl_liker@reddit

They taste better

Reply

[-]

SecuredStealth@reddit

Jerk

Reply

[-]

geekgodOG@reddit

Pricing: 256GB = 5.6K 512GB = 9.5K

Reply

[-]

Tadpole5050@reddit

128 GB = 3.5k M4 Max, 546 GB/s memory bandwidth. Probably a Nvidia Digits competitor at that price point and bandwidth.

Reply

[-]

pkmxtw@reddit

Damn, now if digits come with less than 500 GB/s it would be pretty much DOA.

Reply

[-]

mxforest@reddit

Rumored to be 256 GBps. RIP if true.

Reply

[-]

b0tbuilder@reddit

This is already confirmed

Reply

[-]

Paganator@reddit

Rumors from where? My guess is that Nvidia hasn't communicated the bandwidth because they wanted to see what they could get away with. Now that AMD and Apple are releasing directly competing products, Nvidia will feel more pressure to offer more bandwidth.

Reply

[-]

emprahsFury@reddit

the rumor is just that what's advertised matches closely (but not perfectly) with the a Blackwell config that has 256gbps

Reply

[-]

ReginaldBundy@reddit

Yeah, it would need either higher bandwidth or lower price. At the current setting (as far as we know) it's dead.

Reply

[-]

perelmanych@reddit

It is DOA for those who want just to use models. But not for anyone who is doing training as it has full CUDA support in such incredibly small form factor.

Reply

[-]

FullOf_Bad_Ideas@reddit

Which framework supports training with ARM CPU like what GH200 has? Compute wise, it's gonna be at single 3090 level. That's not as powerful as you might think.

Reply

[-]

bigmanbananas@reddit

My 3090s have gone up in value 50% in n the last few months. I'd better sell them before the digits arrive.

Reply

[-]

FullOf_Bad_Ideas@reddit

To buy them back later when Digits is unavailable?

Reply

[-]

fallingdowndizzyvr@reddit

And lower compute.

Reply

[-]

FullOf_Bad_Ideas@reddit

lower compute than single 3090? I think it should be around equal. It's almost 1000 FP4 sparse TOPS, once you convert it to real FP16 non-sparse you get 125 FP16 TFLOPS. 3090 has 142 FP16 TFLOPS as per [GA102 Whitepaper](https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf) that's 3090-level compute I mentioned earlier. It's a bit lower, so your statement is true, and who knows if it will throttle, but it's very similar.

Reply

[-]

fallingdowndizzyvr@reddit

That's "up to" 1000 FP4. Not exactly reassuring. Like an "up to" 90% off sale. Nvidia has been very light on real specs.

Reply

[-]

FullOf_Bad_Ideas@reddit

I think they have it listed as just 1 PFlop in specs. >AI Performance 1 PFLOP FP4 https://www.nvidia.com/en-us/project-digits/ probably not more than a few percent off. On their marketing slide it's also without the "up to" thingy.

Reply

[-]

fallingdowndizzyvr@reddit

It absolutely says "up to". From your link. "Experience **up to** 1 petaflop of AI performance at FP4 precision with the Grace Blackwell architecture."

Reply

[-]

FullOf_Bad_Ideas@reddit

In one place yes, but in two others it just says 1 PF AI TOPS. Marketing can't decide on exact words it seems like.

Reply

[-]

florinandrei@reddit

Many things are "possible". Only one thing is real - and we don't have that yet.

Reply

[-]

bigmanbananas@reddit

But. So much more ram than the 48GB vram

Reply

[-]

FullOf_Bad_Ideas@reddit

For sure. In some usecases, you want more VRAM. Sometimes you want more compute. I've been in both. I hope DIGITS will be good, but I think I'll be sticking with normal GPUs. Or if I make a switch, it will be to PCI-E NPUs. Something like what Tenstorrent is doing.

Reply

[-]

perelmanych@reddit

I am not the most qualified person to answer that, but I believe almost any that comes in form of source files that you can compile for any platform that has recent enough implementation of C++.

Reply

[-]

noiserr@reddit

You're not going to be doing training on Digits, unless you're talking about super small models or just fine tunes on small models.

Reply

[-]

perelmanych@reddit

The idea is to use Digits for proof of concept on lets say 1% of dataset before opening a wallet.

Reply

[-]

Cergorach@reddit

Right tool for the job. I think it's great news for everyone if that's true. If DIGITS is worse at inference then the new Mac stuff (even the M4 Pro has more bandwidth), people can buy Macs, better availability then Nvidia stuff anyway. For the folks doing training that means there is less of a run on the DIGITS product and they might actually get it at a normal price...

Reply

[-]

amhotw@reddit

Do we know anything about the training speed on DIGITS? I haven't seen any benchmarks but I remember that the expectation was that it would be slower than 5090.

Reply

[-]

fallingdowndizzyvr@reddit

Training with 256GB/s of memory bandwidth and corresponding low compute. I guess if you have all the time in the world to wait.

Reply

[-]

perelmanych@reddit

Not proper training. Running training on 1% of dataset, to see if the code works correctly and how gradient behaves.

Reply

[-]

Enough-Meringue4745@reddit

Too slow of bandwidth for training

Reply

[-]

perelmanych@reddit

I think it is used more as a proof of concept, before renting monster GPUs on the cloud.

Reply

[-]

indicava@reddit

POC’ing on rented GPUs isn’t that bad either, I regularly rent out 4x4090 machines for about a $1.20 an hour. I do my experiments locally on my MBP M2 or on my gaming rig (my precious) that has a 3070 and then POC in the cloud, usually no more than 12 hours for a test train. (Then for the full training I dip into the 2xH200 $7-$8 an hour machines)

Reply

[-]

Ok_Warning2146@reddit

Even at 500GB/s, it still can't compete. 2xDIGITS is 6k. But M3 Ultra 256GB is 5.6k.

Reply

[-]

WhiteHorseTito@reddit

I’m waiting to see how DIGITS benchmarks against this Studio lineup, and I’ll probably pull the trigger on one of them by end of May.

Reply

[-]

DirectAd1674@reddit

https://preview.redd.it/2xp30jeptvme1.jpeg?width=1284&format=pjpg&auto=webp&s=9a49db03d927828d1870406827a8e367c6aeb2e8 500? It says 800+

Reply

[-]

Tadpole5050@reddit

M3 Ultra is 800+, M4 Max is up to 546

Reply

[-]

Abject_Radio4179@reddit

Not all M4 will be 800+ though.

Reply

[-]

fallingdowndizzyvr@reddit

> M3 Ultra is 800+ There is no M3 Ultra. The last Ultra is M2.

Reply

[-]

DirectAd1674@reddit

You're right, i overlooked that you mentioned M4 not M3. Either way, I'm excited to see how the love test results turn out. Thanks for clarifying!

Reply

[-]

-6h0st-@reddit

M3 Ultra with 80GPUs, 60GPU model would have lower bandwidth

Reply

[-]

Yes_but_I_think@reddit

So no difference from m2 ultra in bandwidth?

Reply

[-]

animealt46@reddit

19GB/s more. Might have fewer memory controllers IDK.

Reply

[-]

power97992@reddit

819GB/s

Reply

[-]

sage-longhorn@reddit

Don't trust AI

Reply

[-]

fallingdowndizzyvr@reddit

Betting than digits since you can use a Mac like as a general purpose computer.

Reply

[-]

TheElectroPrince@reddit

DIGITS also runs a custom version of Ubuntu, and it has HDMI and USB ports, so you can definitely use it like a normal computer.

Reply

[-]

fallingdowndizzyvr@reddit

How are you running Microsoft Office on it? How are you running Davinci resolve? How are you running the vast library of software on both Windows and Mac OS that people use for general computing?

Reply

[-]

Caffeine_Monster@reddit

You can get near that bandwidth with any modern ddr5 compatible server - can cost a lot less too.

Reply

[-]

AXYZE8@reddit

Single stick of DDR5-6000 is 48GB/s buddy, you're far off with your calculations, especially with costs.

Reply

[-]

Desm0nt@reddit

12-channel epyc = 48\*12=576GB/s. He is not far off. He was talking about SERVER pc, not consumer one.

Reply

[-]

AXYZE8@reddit

Intel offerings are 8 channel, AMD Genoa drops to 4800MHz if you use 12 channels. 80% of "DDR5 compatible servers" are out of the question to start with. If you want to have 12channels on DDR5-6000MHz you need to use AMD Turin. Single CCD read memory bandwidth on Turin is 106GB/s https://preview.redd.it/bdaahicwawme1.png?width=1055&format=png&auto=webp&s=96c31ad8f461c3c3e85b53eed99d7cca14c5469a You need 5x CCD to go above 500GB/s. Cheapest one that has that is EPYC 9355P, it costs $2998. :) So there you go with "cost a lot less too".

Reply

[-]

noiserr@reddit

You can get dual P boards with 24 channels with Genoa or better.

Reply

[-]

AXYZE8@reddit

Of course you can! Now instead of $2998 for CPU you need to get two 9275F, two of these cost $7k. If you use ktransformers (no-brainer for CPU inference) you also need to load weights twice, therefore instead of 512GB RAM you'll need 1TB. Go ahead! :)

Reply

[-]

noiserr@reddit

why do you need to load weights twice?

Reply

[-]

AXYZE8@reddit

Because thats how ktransformers work. "copy model into RAM *twice* for big dual socket systems (as cross NUMA nodes is bottleneck)" [https://github.com/ubergarm/r1-ktransformers-guide/blob/main/README.md](https://github.com/ubergarm/r1-ktransformers-guide/blob/main/README.md)

Reply

[-]

noiserr@reddit

> # ONLY IF you have Intel dual socket and >1TB RAM to hold 2x copies of entire model in RAM (one copy per socket) # Dual socket AMD EPYC NPS0 probably makes this not needed? # $ export USE_NUMA=1 Also this is an MoE model not a dense model. You shouldn't need to load it twice. Even if you had to manually designate experts surely you could split the model instead of loading it twice.

Reply

[-]

AXYZE8@reddit

Do you have couple of minutes? [https://github.com/ggml-org/llama.cpp/discussions/11733](https://github.com/ggml-org/llama.cpp/discussions/11733) Go here and teach them. One person gets 102.2% performance where other has better result "only 105% compared to a single CPU benchmark run".

Reply

[-]

noiserr@reddit

Wish I had time :(

Reply

[-]

Caffeine_Monster@reddit

> "cost a lot less too". About 2/3 the price of the $10k Mac m4 - or at least was. I will admit there now seems to be a big shortage of server memory. The Mac will be a bit faster, but you will get ecosystem locked, so it is kind of swings and roundabouts.

Reply

[-]

AXYZE8@reddit

That $10k Mac Studio has **M3 Ultra** chip that does more than **800GB/s**. It's not "a bit faster", Mac is **40% faster**. This is in best case scenario for server, as Mac has beefy GPU that will massively speed up prompt eval, effectively making responses like **2x faster** if you have longer context. All that while eating like 1/4 of the power and sitting in random place on your desk barely making any sound. I don't know why you underestimate the Mac so much, table has flipped and now Apple is the value king for performance across a lot of workloads. I'm not commenting just about high end, this goes all way to bottom end, where on PC you're 2 generations behind on CPU (Ryzen 5 5500), 2 generation behind on GPU (RTX3050 6GB) and that 7nm+8nm+DDR4 combo is supposed to compete with 3nm M4 Mac Mini that costs $529 right now at Amazon.

Reply

[-]

FullOf_Bad_Ideas@reddit

Lots of moving parts need to get right to get this kind of speed. More than dual-channel is rare in consumer world at consumer prices. Even with quad channel and higher, silly things like some internal CPU die specs on AMD CPUs matter and drop your bandwidth beyond what you should get. Also AMD cpus seem to be getting less bandwidth out of the theoretical maximum for some reason. Reasonably priced DIY computers have up to around 200gb/s bandwidth, beyond that costs are similar to what you'd be getting with Macs/ Digits.

Reply

[-]

rorowhat@reddit

Digits all the way

Reply

[-]

-6h0st-@reddit

I think still nvidia will win with much higher TOPS.

Reply

[-]

Zyj@reddit

It states the memory bandwidth of the M4 Max at 410 GB/s. Where did you get 546GB/s?

Reply

[-]

Playful_Accident8990@reddit

I was tiring of all the precious gems in my house anyways!

Reply

[-]

mxforest@reddit

Those precious gems are dead weight. Why buy shiny stuff when you can actually buy intelligence. Make the wise choice.

Reply

[-]

Individual_Aside7554@reddit

Except in two years the m4 Max could be dead weight (given the pace of tech progress) and the gems' value will appreciate :)

Reply

[-]

tothatl@reddit

It depends for what you want it. If it's for being your DeepSeek R1/R2 backend and make it work and produce income, it can be totally justifiable economically regardless if will become obsolete in a few years. That's why people keep buying work machines and computers. But if it's just for fun, jewels and the mac with m4 max are just a matter of taste.

Reply

[-]

poli-cya@reddit

Produce income? Unless you're talking about a programmer using it for work, I can't imagine what that'd be. And even then, it'd be so glacially slow compared to API, I just can't see it. If you were trying to run it to serve an actually service to customers, you're not going to get the studio IMO... so this purchase comes down to interest in LLMs and if you can justify using the mac for something else also.

Reply

[-]

Forgot_Password_Dude@reddit

It's slow? I heard the memory is fast? I guess it's not as fast as Nvidia?

Reply

[-]

poli-cya@reddit

It's gonna be insanely slow compared to online services, and extremely cost ineffective. This is for if you're doing something you absolutely don't want sent to any outsider or as a hobby. Still cool it exists, but anyone with space for a server and this kind of money might be best served to go that route with older GPUs or GPUs mixed with RAM- at least to my understanding.

Reply

[-]

zxyzyxz@reddit

Yep, imagine how many cloud AI tokens you can get for ~10k USD, I know this is /r/LocalLLaMA but the economics don't necessarily make sense.

Reply

[-]

b0tbuilder@reddit

They make sense when you need to run a large context model doing RAG on highly sensitive information

Reply

[-]

tothatl@reddit

I doubt DeepSeek is going to go the Western route of just adding moar layers and parameters. Maybe they do, maybe they don't. What I expect is they will find better algorithms and optimizations to run rationalizing multimodal models, probably with *less" parameters or execution overhead.

Reply

[-]

WhyIsSocialMedia@reddit

DS1/2 will be like the brick cellphone to a smartphone at this rate.

Reply

[-]

llamabott@reddit

Same goes for one's redundant liver!

Reply

[-]

xor_2@reddit

And the best thing is that when you start 'buying intelligence' by buying Apple products then you probably need that artificial intelligence :D

Reply

[-]

SkyFeistyLlama8@reddit

Gems? How many kidneys and lungs do you think you actually need?

Reply

[-]

Playful_Accident8990@reddit

All of them! 🫴

Reply

[-]

SomeOddCodeGuy@reddit

I hope that some streamer or another shows us what running a larger model looks like on this machine. $10k for a q5\_K\_M of Deepseek R1 may not be, from my perspective, not a particularly terrible deal as long as it runs at any form of an acceptable speed.

Reply

[-]

joninco@reddit

I'm taking the plunge, will let you know.

Reply

[-]

Lyuseefur@reddit

RemindMe! 7 days “Polar Mac Plunge”

Reply

[-]

joninco@reddit

Mar 17 delivery date.

Reply

[-]

DeSibyl@reddit

How’d it go?

Reply

[-]

joninco@reddit

Oh, it went the same as all the reviews out there. Basically 18t/s on deepseek r1. One thing I was impressed with that I didn't see mentioned is the speed of the nvme storage. Deepseek was loading at over 10GB/sec and loads fairly quickly, which surprised me.

Reply

[-]

joninco@reddit

I’m actually going to return it. The 900GB bw is nice, but the RTX 6000 pro with 96gb ddr7 is more what I’m looking for. The ability to run smaller denser models at speed, rather than r1 at 18tk/sec.

Reply

[-]

ComingInSideways@reddit

RemindMe! 20 days “Mac for AI #2”

Reply

[-]

DeSibyl@reddit

RemindMe! 14 days

Reply

[-]

thrownawaymane@reddit

When’s the one you ordered for me getting here? (I look forward to your tests)

Reply

[-]

joninco@reddit

Sorry, I ran out of babies to sell.

Reply

[-]

man_and_a_symbol@reddit

Fr plz post tests I’m really curious as to how it performs

Reply

[-]

joninco@reddit

What should I test? Just load up LM Studio and R1 and see what it do

Reply

[-]

man_and_a_symbol@reddit

Yeah I think R1 is probably the best one to test; also try it with big context & different quants. If it gives a decent speed with big context and some decent quants, I bet people will be really interested. Also try some big non MoEs, maybe the Llama's just to see how they perform although I assume a dense 405B will be extremely slow

Reply

[-]

Lyuseefur@reddit

RemindMe! 14 days “Polar Mac Plunge 2: The Plunganing”

Reply

[-]

RemindMeBot@reddit

I will be messaging you in 7 days on [**2025-03-12 15:59:51 UTC**](http://www.wolframalpha.com/input/?i=2025-03-12%2015:59:51%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1j43us5/apple_releases_new_mac_studio_with_m4_max_and_m3/mg5xyuw/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1j43us5%2Fapple_releases_new_mac_studio_with_m4_max_and_m3%2Fmg5xyuw%2F%5D%0A%0ARemindMe%21%202025-03-12%2015%3A59%3A51%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201j43us5) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

Reply

[-]

joninco@reddit

Ran the 4bit MLX deepseek r1. Long story short, 18t/s like everyone else found out. But the longer the context, the longer the TTFT. That prompt processing is slow. How can I get an exact benchmark for that besides arbitrary contexts?

Reply

[-]

joninco@reddit

What is the ideal deepseek R1 quantization to run with 512GB ram? a Q4\_K\_M or Q5?

Reply

[-]

AstroZombie138@reddit

What config did you get? I'm wondering if the +$1500 for more cores is worth it. Otherwise I will go with the 256gb memory and 4tb storage (which hurts, but I think I'd eventually need that much storage. Remember you can get \~$600 off for a student discount if you qualify

Reply

[-]

joninco@reddit

Apple M3 Ultra chip with 32-core CPU, 80‑core GPU, 32-core Neural Engine 512GB unified memory 2TB SSD storage

Reply

[-]

EternalOptimister@reddit

RemindMe! Blood diamonds

Reply

[-]

EvilPencil@reddit

My money would be on Alex Ziskand being first to market on that...

Reply

[-]

power97992@reddit

Two m2 ultras maxed out runs q4 at 17 tokens/s , expect it be less than it for q5 or q4 on one mac.. Maybe 8-10 tokens/s due to less memory bandwidth but faster interconnect and higher flops

Reply

[-]

Careless_Garlic1438@reddit

2 M2 Ultra’s with EXO runs it at 14 tokens / s so if the 2x holds then we are looking at 30 🤞

Reply

[-]

Hoodfu@reddit

This new m3 ultra isn't twice as fast as an M2 Ultra. It's roughly the same.

Reply

[-]

Ok_Warning2146@reddit

GPU compute is 2x faster than M2 Ultra and 2.6x faster than M1 Ultra per the press release. I also have doubt on this. But we just have to wait for more tests to confirm.

Reply

[-]

Hoodfu@reddit

with LLMs it's really just about that memory speed. losing 20% compared to the m4 ultra that it was supposed to be was a big letdown when I saw the news.

Reply

[-]

fallingdowndizzyvr@reddit

> with LLMs it's really just about that memory speed. That's completely not true. If that were the case then a RX580 would be competitive with a 4060. It's not. It's only about memory speed if you have enough compute to use it. On Mac Ultras that hasn't been the case. They have more memory bandwidth than they have compute to use it. That's why the M2 Ultra is faster than the M1 Ultra even though they have the same memory bandwidth. There's no reason to believe that the M2 Ultra is using all available memory bandwidth. If not, the the M3 Ultra will be faster.

Reply

[-]

-6h0st-@reddit

No it won’t do acceptable speed. You need GPU processing power also, on Macs bigger the model slower prompt processing, and can’t handle big context window. So pointless for local usage.

Reply

[-]

chespirito2@reddit

Why do you say it can't handle big context window?

Reply

[-]

bullerwins@reddit

819GB/s memory bandwidth for the M3 Ultra. Do you think llama.cpp on mac would run faster than ktransformers on a server with 512GB ram and 1-4 GPUs?

Reply

[-]

Its_Powerful_Bonus@reddit

Both m3 ultra variants have 800gb/s? Also 28cpu/60gpu? M3 max 14/30 afair had 300gb/s …

Reply

[-]

Yes_but_I_think@reddit

So 2 tokens/s on 400GB sized model (R1 Q6K)

Reply

[-]

perelmanych@reddit

No, since R1 is not a monolithic model, but MOE and only 37B parameters are activated. Should be more close to 10t/s.

Reply

[-]

Healthy-Nebula-3603@reddit

10? Lol. No Rather 20t/s if memory has 500+ GB /s

Reply

[-]

perelmanych@reddit

Yes, yes I know you learnt how to use division)) And now look here [https://github.com/ggml-org/llama.cpp/discussions/4167](https://github.com/ggml-org/llama.cpp/discussions/4167)

Reply

[-]

joninco@reddit

It's MoE -- isn't it just a matter of being able to keep R1 in vram and then it runs whichever 36B model relatively quickly or am I missing something for decent speeds?

Reply

[-]

teachersecret@reddit

Someone ran R1 on 2x 192gb macs at about 17 tokens/second at a decent quant, so yeah, should be possible to get good usable speed with one of these 512gb rigs.

Reply

[-]

joninco@reddit

This should be fun, maybe the biggest bonus will be having full context in ram.

Reply

[-]

martinerous@reddit

And it's important to make sure they try it with large text. One thing is when you ask about QStarberries and another - if you want to work with code or text summaries.

Reply

[-]

SomeOddCodeGuy@reddit

Agreed. With that said, I do want to make a post not long from now discussing this in a bit more detail. I've always been hard on Macs for the prompt processing speed; I don't mind waiting, but other people do, and if you look at my profile I've made sure to pin a post showing the real numbers of what large context looks like on an M2 Ultra. With that said, I decided to test out ChatGPT's Deep Research by asking it to find me the ms per token numbers of inferencing 70b models on an a6000 (not the ada), and interestingly, it came back with results showing several posts putting the inference around 5ms per token in prompt processing. I recently got my hands on the more powerful M2 Ultra, the 76 core GPU version, and it processes prompts on Qwen2.5 72b at 10ms per token. It's 2x slower on the Mac,but that's not as bad as what I think a lot of folks were imagining. And with speculative decoding it was a much smaller gap for prompt writing, so i want to try to do a bit more research and get a conversation going about just how big of a difference there is in response times on BIG models at big contexts between certain CUDA cards and a Mac.

Reply

[-]

TyraVex@reddit

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/37 IQ3_M/IQ4_XS is all you need for V3/R1 I believe that a 3k ram server with a 3090 with ktransformers would equally, around 13-15 tok/s. I may be wrong.

Reply

[-]

IlIllIlllIlllIllllI@reddit

I could probably get $10k for my car, why would I need transportation when I can have a Mac?

Reply

[-]

ykoech@reddit

Time to sell that car 🚗

Reply

[-]

Harvard_Med_USMLE267@reddit

Haha, my car is worth less than my RTX 4090, it’s certainly not going to pay for this!

Reply

[-]

thunk_stuff@reddit

On the positive side, the overpriced SSD upgrades start to feel like rounding errors when you hit $10k.

Reply

[-]

fotiro@reddit

But would you like a stand for only $1,500?

Reply

[-]

Ok_Warning2146@reddit

tb5 is fast enough that using cheap external SSD is just as good.

Reply

[-]

cafedude@reddit

Cars are trouble anyway. And it's not like you need to go anywhere anymore.

Reply

[-]

Rich_Repeat_22@reddit

And the actual cost of that RAM is barely $300.

Reply

[-]

Karyo_Ten@reddit

What? 512GB/s bandwidth RAM is not exactly cheap, be it on GPU or 12-channel ECC RAM.

Reply

[-]

Rich_Repeat_22@reddit

This thing is using LPDDR5 (not X) 6400. Which street price is $1.8 per GB lets say $2per GB on (12/16/18GB modules) So $921-$1024 for 512GB LPDDR5 6400. Apple is selling $5000 the 256GB.

Reply

[-]

ResolveSea9089@reddit

Can I ask a stupid question, why aren't there others offerng that kind of RAM for that price point? Do you suspect there will be some going forward? Just no demand until now? Also does this apply to VRAM? I'm a bit of a tech novice and trying to understand, for these LLM's we want VRAM right?

Reply

[-]

Rich_Repeat_22@reddit

Yes. Because there wasn't much demand up to now. And corporations are kinda slow to make decisions always been minimum 6-12 months behind. Also some manufacturers will be extremely reluctant to do that, like AMD & NVIDIA because it would cut from their professional and accelerators markets.

Reply

[-]

Karyo_Ten@reddit

>Apple is selling $5000 the 256GB. The CPU+GPU aren't $5.6k - $5000 = $600. 512GB ECC RAM @6000Mhz is $1.7k on newegg: https://www.newegg.com/p/1X5-0009-00A03

Reply

[-]

Rich_Repeat_22@reddit

You confuse ECC DDR5 for servers and workstations with LPDDR5.

Reply

[-]

Karyo_Ten@reddit

I'm not, if you want to build a machine with 256GB DDR5 yourself, either you stack GPUs or you stack ECC DDR5, you can't do LPDDR5 yourself.

Reply

[-]

Rich_Repeat_22@reddit

The Apple is using LPDDR5.

Reply

[-]

Karyo_Ten@reddit

And you can't build a 256GB machine with LPDDR5 yourself from parts. You should compare with what you pay if you build it yourself.

Reply

[-]

Rich_Repeat_22@reddit

Yes we can. If we can get out hands on the 512GB BIOS of the M3 128GB we can replace the VRAM if the PCB is the same for less than $600.

Reply

[-]

Karyo_Ten@reddit

>If we can get out hands on So is this BIOS in the room with us right now?

Reply

[-]

Rich_Repeat_22@reddit

🤦‍♂️

Reply

[-]

fullouterjoin@reddit

If I had spent every dollar on apple stock as I spent on apple products, could probably afford a plane or a nice cabin.

Reply

[-]

Rich_Repeat_22@reddit

256GB don't cost $5000.

Reply

[-]

Karyo_Ten@reddit

Are you arguing that the CPU is $5.6k - $5000 = $600?

Reply

[-]

Rich_Repeat_22@reddit

I am arguing that 512GB costs 10K and 256GB 5.6K. So Apple is selling 256GB for 5K while they cost barely $512 in LPDDR5 6400 modules.

Reply

[-]

fullouterjoin@reddit

It does when put into that machine.

Reply

[-]

Kavor@reddit

That's the price you pay to have ~~a cold soulless robot put it in and solder it in place~~ an Apple engineering expert handcraft and personally sign the ram stick in a process that takes days and the uttermost love and then carefully place it in the slot.

Reply

[-]

jabblack@reddit

Cost about as much as a 5090

Reply

[-]

rorowhat@reddit

Apple pricing is the worst!

Reply

[-]

angry_queef_master@reddit

Damn, that is how much I paid for my 2010 honda civic

Reply

[-]

the_Luik@reddit

How much is that in liver

Reply

[-]

YearnMar10@reddit

12.5k in € for 512gb

Reply

[-]

zoe934@reddit

But you cant find a 12.5k GPU with 512gb VRAM\~

Reply

[-]

YearnMar10@reddit

No one said that, just wanted to inform

Reply

[-]

zoe934@reddit

SAME!

Reply

[-]

Rich_Repeat_22@reddit

I wonder if we can get our hands on the 512GB BIOS and the PCB is the same with the cheapest version (64/32GB), if we can replace the LPDDR5 6400 modules with 512GB ones. It would cost less than $1000 to buy 512GB worth of modules, replace the cheapest version and flash the bios 🤔

Reply

[-]

Forgot_Password_Dude@reddit

Oh hell yes 512, but damn almost double the price

Reply

[-]

wen_mars@reddit

9.5k is getting close to dual socket epyc territory. Nice that Apple gives people the option.

Reply

[-]

StoneyCalzoney@reddit

It's worth noting that the edu discount drops the 512GB price down to \~$8.6k

Reply

[-]

BigMagnut@reddit

It's not worth 10k.

Reply

[-]

candre23@reddit

Might as well just buy real GPUs for that money.

Reply

[-]

ASYMT0TIC@reddit

Which GPU with 512GB of VRAM are you buying for under $10k?

Reply

[-]

candre23@reddit

That will buy more than a dozen 3090s, which would run rings around the mac. Like order-of-magnitude faster. 512GB in a "unified memory" machine like this with laughable GPU cores is objectively pointless. Even with 123b models at moderate bpw you're only looking at about 96GB memory needed, and the mac would already be horrifically bandwidth and compute bound at that point. You load up a model that actually needs 512GB of memory and that mac will be lucky to produce more than a dozen tokens *per minute*.

Reply

[-]

indicava@reddit

Yea, cause the costs of building and running a 12x3090 rig end with the GPU’s right?

Reply

[-]

candre23@reddit

You wouldn't buy 12 3090s. You'd buy a reasonable number like 4. The point here is that it's factually impossible to actually take advantage of the "512GB" of memory in the mac. It's too slow in several metrics to run models that large at anything approaching usable speeds.

Reply

[-]

mxforest@reddit

It's not bad for what it is. A small 10-20 person startup can host local R1. That comes out to be $20 per person for 2 yrs.

Reply

[-]

Wildcard355@reddit

Very good point. It's likely that we'll get cheaper local options in the coming years (months 🙏), but given that most start ups have a small size like this and budget could probably accommodate one local LLM this way.

Reply

[-]

-6h0st-@reddit

5.8k I see here in Uk for 256GB 28/60 core version which doesn’t have full bandwidth of 800GB/s (25% lower?)

Reply

[-]

Healthy-Nebula-3603@reddit

Still very worth it! !!

Reply

[-]

tangoshukudai@reddit

cheap.

Reply

[-]

roshanpr@reddit

FUCK MY LIFE

Reply

[-]

mxforest@reddit

512 GB holy hell. Great machine for local R1.

Reply

[-]

DirectAd1674@reddit

https://preview.redd.it/es871ccxsvme1.jpeg?width=1284&format=pjpg&auto=webp&s=d73170646c8281250a3c4219264efc2faad8d9d5 The wording here certainly aims to suggest that

Reply

[-]

half_a_pony@reddit

it's funny to mention apple intelligence here because apple intelligence models are tiny. going to be a drop in a bucket in all of that memory

Reply

[-]

2016YamR6@reddit

Deepseek R1 Distill Siri

Reply

[-]

nexusprime2015@reddit

DeepSiri

Reply

[-]

getmevodka@reddit

💀🤭

Reply

[-]

ready-eddy@reddit

Really curious in the performance for Diffusion models. Stable Diffusion is running much better than I thought it would be on my 24gb mac mini.. 512GB sounds… tasty

Reply

[-]

Background-Hour1153@reddit

If I'm not mistaken, diffusion models are compute bound, so as long as the diffusion model fits in the RAM/VRAM (most image diffusion models fit in 24 gb of RAM), you shouldn't get faster generation if it's the same exact GPU.

Reply

[-]

tomz17@reddit

Is it tho? For $10k you can buy a proper 12-channel DDR5 system with similar memory BW, expandability (i.e. an nvidia card for prompt processing, more than 512GB RAM), and far more CPU compute power. -or- you can just rent $10k of actual cloud on a proper hopper, blackwell, etc. system and get orders of magnitude the throughput. I mean it's priced competitively to that once you factor in the apple tax, but it's not exactly a game changer in that price range.

Reply

[-]

Zyj@reddit

A 12-channel DDR5-6000 system provides a mere 576GB/s

Reply

[-]

mxforest@reddit

That's theoretical though. The more Kits you have the high the chance that they will run at lower clocks. I will be surprised if 12 modules result in it barely managing 5000-5200.

Reply

[-]

tomz17@reddit

Not "theoretical". DDR5 6000 is the spec for 5th gen Epyc parts, you WILL get exactly that speed.

Reply

[-]

Zyj@reddit

Well, DDR5-6000 past 32GB are still pretty rare. There's Kingston https://www.kingston.com/unitedkingdom/de/memory/search/?partid=KVR64A52BD8-64 but i'm not sure if UDIMMs are officially supported

Reply

[-]

tomz17@reddit

It's not a gaming PC. if you are buying a workstation or server class CPU you just look at the HQL for that cpu + motherboard and buy one of the samsung or hynix part numbers they actually tested in that config. Everything else is ymmv.

Reply

[-]

Zyj@reddit

Right. The QVL often doesn't list DDR5 6000 at higher capacities

Reply

[-]

tomz17@reddit

Huh? Anyone selling you a Turin-compatible board / system certainly does. How else would they sell the thing? e.g. H13-SSL-N lists MEM-DR516MB-ER64 **16GB** DDR5-6400 1Rx8 LP (16Gb) ECC RDIMM MEM-DR532MD-ER64 **32GB** DDR5-6400 2Rx8 (16Gb) ECC RDIMM MEM-DR512PC-ER64 **128GB** DDR5-6400 2Rx4 LP (32Gb) ECC RDIMM,RoHS

Reply

[-]

Zyj@reddit

You're right. With my ASUS Pro WS WRX80E-Sage SE WIFI mainboard, the QVL lacked any 8x32GB memories at DDR4-3200 speeds so you were on your own.

Reply

[-]

Zyj@reddit

Right! I'm wondering if Dual EPYC systems provide twice the memory bandwidth in a useful way (i.e. something that increases tokens per second)?

Reply

[-]

dinerburgeryum@reddit

It’s tough because you really have to watch NUMA parameters at that point. Ktransformers makes shadow copies of critical matrices at each NUMA node to prevent this, but that kind of tuning is not generally available for all models.

Reply

[-]

tomz17@reddit

\> A 12-channel DDR5-6000 system provides a mere 576GB/s **per socket**

Reply

[-]

BumbleSlob@reddit

> you can just rent $10k of actual cloud on a proper hopper, blackwell, etc. system and get orders of magnitude the throughput. sir this is /r/localllama

Reply

[-]

calcium@reddit

Apple tax? At this point when comparing workstations to one another they remain pretty competitive.

Reply

[-]

gandhi_theft@reddit

Can it fit R1? I read that it's \~700GB

Reply

[-]

mxforest@reddit

It can run a quantized version. 700GB is assuming 8 buts. You can run 4 bit which is half the size.

Reply

[-]

philguyaz@reddit

The memory bandwidth is going to make me cry it’s the same as the m2

Reply

[-]

bullerwins@reddit

819GB/s memory bandwidth for the M3 Ultra 546 GB/s memory bandwidth for the M4 Max

Reply

[-]

Abject_Radio4179@reddit

Actually, it’s up to 819 GB/s. The 60 core one will have 614 GB/s, so just 10% more than the M4 Max.

Reply

[-]

Barry_Jumps@reddit

Someone smart please explain why M4s have half the bandwidth

Reply

[-]

CtrlAltDelve@reddit

Apple's "Ultra" chips can be thought of as two "Max" chips glued together. So, an M3 Ultra is like having two M3 Max chips working together. Unless the new M4 Max was somehow magically twice as powerful as the M3 Max, the older M3 Ultra (which again is two M3 Max chips) will still be faster. The big mystery is why Apple didn't just release an M4 Ultra.

Reply

[-]

Barry_Jumps@reddit

🤜 🤛

Reply

[-]

animax00@reddit

should it be 410GB/s memory bandwidth for M4 max? [https://www.apple.com/mac-studio/specs/](https://www.apple.com/mac-studio/specs/)

Reply

[-]

Competitive-Bake4602@reddit

48GB and higher models are 546 GB/s

Reply

[-]

TrashPandaSavior@reddit

That page shows that it's "Configurable to" 545 GB/s. So basically the non-binned chip has that speed. For LLMs, that's a $300 upgrade I'd take.

Reply

[-]

animealt46@reddit

M4 Max is looking mighty mighty good.

Reply

[-]

mxforest@reddit

It's not great but it is ok for MoE with low number of active params.

Reply

[-]

philguyaz@reddit

Truuuuu! Also for finetuning which I use my ultra for it’s more than good cause I have more time than ram .

Reply

[-]

FullOf_Bad_Ideas@reddit

how fast (t/s) can you finetune bigger models on your Ultra? Are you doing QLoRA/LoRA or is full finetuning possible too?

Reply

[-]

philguyaz@reddit

I don’t know that I keep track of T/s trainings I more care about time per iteration. We do Lora MLX fine tuning and get the impact we want. I don’t get the obsession with full sized models and training I run a SaaS product that wouldn’t make sense without quantization in both training and inference.

Reply

[-]

indicava@reddit

Have you experimented with full fine-tuning and found you get similar results to LoRA? For my fine tuning use cases (coding) I’ve found a significant difference in performance when doing a full fine-tune rather than going the LoRA/QLoRA route. Also, I don’t know what size models you’re fine tuning, but the biggest one I use (commercially) is a 32b parameter model and the various serverless services out there make it pretty reasonable to fine tune and serve inference (in production) for my customers.

Reply

[-]

philguyaz@reddit

I’m doing a 70b and I’m not finetuning code I’m finetuning language style, format and word choice preference which probably makes a big difference. I have not tried this approach vs full fine tune.

Reply

[-]

FullOf_Bad_Ideas@reddit

What kind of models are you finetuning? T/s in training is just time per iteration / batch size / sample length. Same thing in the end, but it's more fundamental since you can change sample length in your dataset and batch size, so time per iteration isn't really meaningful, unless we're talking about diffusion image models.

Reply

[-]

piggledy@reddit

Isn't memory bandwidth becoming the limiting factor here rather than memory size? The M3 Ultra has a memory Bandwidth of 800GB/s. Local R1 in Q4 is about 400GB. Wouldn't that make for a terrible experience at roughly 2 tokens per second? Is that good value for money at a minimum $9,499.00?

Reply

[-]

mxforest@reddit

It only has like 27B active params at a time. So you divide 800 by 27 and not 400.

Reply

[-]

piggledy@reddit

So good for Deepseek, but terrible for something like Llama 3.1 405B?

Reply

[-]

Longjumping_Kale3013@reddit

But isn’t llama 3.3 70B comparable to the 405B?

Reply

[-]

animealt46@reddit

Probably great for something like a 70~150B model running in the background while having plenty of RAM available to do other tasks I guess?

Reply

[-]

piggledy@reddit

And good for long context too

Reply

[-]

dinerburgeryum@reddit

MLX has solid KV cache quant options to boot. 6-8 feels near lossless. I’m not familiar enough with their backing algos to recommend 4-bit yet but at higher quants it’s great.

Reply

[-]

dinerburgeryum@reddit

405B monolithic was always hubristic. Silly that we even considered it for hosted inference. MoE was in the wild when it dropped. Just Meta being silly and throwing compute at problems instead of brains.

Reply

[-]

mxforest@reddit

True! Given the rumors that LLAMA team scrambled after R1 release, I think MoE is the way to go. Specially when thinking tokens need much higher tps to be usable.

Reply

[-]

Kind-Log4159@reddit

The zuck is definitely still getting flashbacks of r1 release. Llama 4 was canceled because of it

Reply

[-]

SomeOddCodeGuy@reddit

I will note that MoEs process prompts a little differently than the active param size would imply, and you definitely feel it on Mac. I have an M2 Ultra and one of my favorite models used to be WizardLM2 8x22b. The prompt processing time was definitely longer than what I'd expect a 40 something b model to process at; it felt like it was closer to a 70b in prompt processing speed, and the full size of it was around 141b if I remember right. Once it started writing, things sped up a lot.

Reply

[-]

Mrleibniz@reddit

> WizardLM2 I completely forgot about that model, whatever happened to that? They took it down and the buzz around it sort of died.

Reply

[-]

SomeOddCodeGuy@reddit

It's still available, just not from the original repo. It was dropped under open source license, some folks forked the repo while it was up, and those repositories continued to exist and gguf kept going up. You could still find it on huggingface if you were so inclined, but otherwise there wasn't a lot of buzz because without the official repo up, not many benchmarks wanted to run the numbers. Eventually, by the time they did, new models had come out that beat it pretty easily, so it wasn't worth the chatter anymore.

Reply

[-]

fullouterjoin@reddit

You still have a copy? How does it compare to Qwen?

Reply

[-]

SomeOddCodeGuy@reddit

I do still have it, but I haven't done a hard benchmark of real numbers to compare. However, as much as I've used both, I can tell you that I feel that knowledge wise and coherence wise Qwen is better. From my experience: * Wizard 8x22b was absolute magic in terms of coding ability for its time, but it's been a while since then; Qwen2.5 32b Coder is better. * Wizard sounded amazing in terms of speech quality and general understanding; it was exceptionally clever in terms of contextual reading between the lines. If you gave it requirements, it did a great job of really digging in to find what you actually wanted. It beats Qwen2.5 72b for me in that regard * Qwen2.5 72b is far better at RAG/summarization for me. Wizard hallucinated more than I liked with in-context learning.

Reply

[-]

Yes_but_I_think@reddit

Every token the selected expert changes. I thought 2 tokens/s is right

Reply

[-]

mxforest@reddit

But the other expert will also be loaded up, it's not like it has to spend time loading it first. It is available for use right away.

Reply

[-]

Low-Opening25@reddit

even then R1 is unlikely to hit more than 10

Reply

[-]

revotfel@reddit

I just bought the m2 ultra and I'm past the 14 day window, shit, I wasn't looking at release dates T_T

Reply

[-]

thrownawaymane@reddit

Go to the store, be very nice and be willing to try multiple locations. Someone might take pity and make a call.

Reply

[-]

revotfel@reddit

I f****** love you, thanks for encouraging me to go They did a straight swap and now I have 96 GB for the same price I bought the 64 at

Reply

[-]

thrownawaymane@reddit

Just make sure you pay it forward. I’ve given similar advice over the years and people are usually too scared to try. This is just how Apple works. My work machine has 64gb of RAM but the way it’s looking I won’t ever be able to afford that much in a personal machine, much less something like 512gb.

Reply

[-]

revotfel@reddit

You think so? I'll be at exactly a month on release date from when I bought it (the 12th)

Reply

[-]

anythingisavictory@reddit

Your fine, either call or go to the store and ask politely.

Reply

[-]

revotfel@reddit

I f****** love you, thanks for encouraging me to go They did a straight swap and now I have 96 GB for the same price I bought the 64 at

Reply

[-]

anythingisavictory@reddit

So glad it worked out! Just promise to name your 8th child after me.

Reply

[-]

revotfel@reddit

Well, I'll give it a go! Can't hurt to try

Reply

[-]

Turbulent_Pin7635@reddit

Ok, dumb newbie question here. The M3 Ultra will be enough to run the 671b Deepseek? Also, I work with bioinformatics and never used Mac, it is hard to use it with Ubuntu? Almost all my pipelines are built for Linux.

Reply

[-]

Daniel_H212@reddit

Cheaper than getting 512 GB of VRAM using discrete GPUs I guess.

Reply

[-]

dinerburgeryum@reddit

It really can’t be understated that we now have access to 256GB of unified ram at 800GB/s and you don’t need to have an electrician fix your house up with 240V drops.

Reply

[-]

ResolveSea9089@reddit

Sorry can you explain this to me. Is this regular RAM? I thought for AI applications we need VRAM. Is the GPU able to access to the "regular" ram because it's "unified" so we're all good there?

Reply

[-]

dinerburgeryum@reddit

Yes, "unified" in this case means the GPU and CPU have equal access to the same RAM at the same speed (800GB/s). In a "standard" setup (3090 and Intel i7 for example), the 3090 will have 800GB/s access to a small pool of 24GB of RAM. The Intel chip will have access to say a pool of 32GB of RAM at an anemic 70GB/s. (The GPU can technically access the Intel's RAM pool, but through the PCI-E lanes then through the sluggish DDR5 interface.) This means you realistically have access to only the 24GB near the GPU for "fast" inference. Compare this with the M3 Ultra: the CPU and GPU are on the same chip, and share the same high-speed memory controllers. They both have access to the full 800GB/s at all times, with no PCI-E or NUMA traversal. I hope this all makes sense haha.

Reply

[-]

ResolveSea9089@reddit

Yes it does! I'm learning more about computers work, so it's very cool to see some of these terms in your answer. >n a "standard" setup (3090 and Intel i7 for example), the 3090 will have 900GB/s access to a small pool of 24GB of RAM. The Intel chip will have access to say a pool of 32GB of RAM at an anemic 70GB/s. (The GPU can technically access the Intel's RAM pool, but through the PCI-E lanes then through the sluggish DDR5 interface.) This means you realistically have access to only the 24GB near the GPU for "fast" inference. This explanation you gave is very clear! I'm sure there's design reasons for it. Is this something only apple is able to do because of they have full control of their chip? Like, I try to run AI applications on my crummy 8GB VRAM computer and I'm so limited, but what's stopping others from imitating this and getting us super high VRAM/Unified memory? This seems really exciting, because I mean even the high end NVDA chips are like what 24GB for consumer models like you said? But now you're saying I could potentially run something with like 512GB of vram?? I'm hoping more folks take the leap here, it seems the cost for high vram could come down quite quickly?

Reply

[-]

dinerburgeryum@reddit

So, It's not really Apple specific; AMD does this in their APU line. It's more about having the CPU and the GPU on the same die (or processor, if that's easier to visualize.) Once their stuck together, they can utilize the same circuits to access the RAM (the memory controllers), and share the small, fast internal buffers all processors have (cache). What makes Apple's approach somewhat interesting here is both that the CPU and GPU are on the same die, _and_ they have simply thrown a crazy number of memory controllers at the problem. RAM, as it happens, can only send so much data at once. You need one memory controller per RAM "lane." You might hear "dual-channel" memory; that literally means there are 2 memory controllers that can access data from two different RAM banks at once[1]. The M3 Ultra chip packs 48 memory controllers, which can each access data from different RAM chips in parallel. This is closer to a GPU's architecture than a standard CPU architecture. What is stopping us from getting the perfect balance between these two approaches is, generally, this kind of insane memory bandwidth simply isn't needed for most computing workloads. So inexpensive computers without these requirements simply won't drive up costs to meet them. (AMD has historically positioned their APU line as budget hardware, which is why we're seeing them lag on this front.) Also, as you can imagine, 48 memory slots on a motherboard would look silly, take up too much space, and come with its own set of problems with traces running hither and yon. (DIMMs actually pack several RAM chips on one small board; GPU and Apple memory is soldered and accessed individually to support wider controller access.) Server boards can ship with 24 in dual processor configurations, but they're generally large, expensive and have bonkers power requirements to boot. So yeah, I hope that clarifies some of this anyway. Let me know if there's anything else I can help clarify. [1] This is reaching the edge of my knowledge of Intel memory layouts. It may be that a single controller can do dual issue, but the concept remains the same.

Reply

[-]

ResolveSea9089@reddit

I never responded here, I'm sorry. This is absolutely tremendous, thank you very much, I've learned a lot reading this post and keep referencing it as I read more about this stuff the last few days. Thank you!

Reply

[-]

dinerburgeryum@reddit

Cool I’m glad it helped. 😊

Reply

[-]

mxforest@reddit

Also fits in a backpack instead of taking up a room and tripping the circuit breaker.

Reply

[-]

xXprayerwarrior69Xx@reddit

it's kinda wild when you think about it

Reply

[-]

Daniel_H212@reddit

It doesn't double as a whole-house space heater, surely that's a downside!

Reply

[-]

phata-phat@reddit

Quiet, consumes less power and unobstrusive.

Reply

[-]

Innomen@reddit

Still don't understand why we can't just make sata connected bricks of ram or just pci cards of ram. It's all just sand and plastic.

Reply

[-]

hishnash@reddit

CXL is used in some servers to enable you to have memory over PCIe but it is way way way slower than the memory in these systems. The difficulty here is the longer the trace and the more connectors the higher the resistance and interfrances on the single. These singles are very very very fast, and it is easy for RV interfaces to screw them up so that you cant read the correct value. Even internal reflections on the wires themselves are issues when your switching signals at these speeds the copper trace stops behaving the same as I would for a constant current flow.

Reply

[-]

Innomen@reddit

Oh, I thought the interface was fast enough. Thank you for the educational reply. I assumed pci was as fast as the gpu slot. I guess in my brain "slots are slots" and it's more about where, and size, than what kind. Like, we can plugin ram, why can't we plugin more ram, know what I mean? Simplistic I guess, ignorant XD Thanks again I'll RTFM some more X)

Reply

[-]

hishnash@reddit

PCI is as fast as a GPU slot but a GPU does not access is VRAM over the PCI buss. The VRAM is typicly on the GPU card and has much much faster (and lower latency) connections. The issues here are all down the trace length and trace quality, with the speeds we are dealing with these days for memory the speed of light (electricity is light) in copper becomes a huge factory.

Reply

[-]

I_EAT_THE_RICH@reddit

I’m not overly impressed with the speed of models I can run on my Apple silicon. I’m wondering if I’ll ever run locally at this point. Considering my monthly expense for ai is now over $40, it’s still cheap compared to one of these bad boys.

Reply

[-]

hishnash@reddit

The main use case for these is if you're doing any personal model customization or dealing with data that you cant legally send to a third party. Private company data, legal data, medical, mill, gov data etc.

Reply

[-]

I_EAT_THE_RICH@reddit

I actually do model customization for clients in my industry, but we use cloud services. For these exact reasons.

Reply

[-]

hishnash@reddit

It depends on the industry but for many industries the paper work and compliance needed to send the data off site or to any server that it is not already approved on is a f-ing nightmare. (and ends up costing a lot in pointless hours of legal compliance contracts). There are some cloud providers that have certification but if your company has not yet gone through the steps to validate and approve that provide it is often not worth the effort. on perm deployments are returning to companies all over the world. For example I used to work for a SW company than build software for the mining industry, sometimes clients would have issues and would share their projects (real world locacitnos of high volue deposits) with us. This data was considered extremely valuable, as the company may have spent millions if not billions to surveying the location to collect the data, when they provided it to us it was explicitly provided to a named engineer and to that engineers (air gapped) machines only. The paper work, and insurance we would have had to go through if we wanted to say upload that to a cloud service was a no go so we had to have on prem compute HW for doing the needed processing of this data (HW that would commonly be fully wiped between each customers data being loaded).

Reply

[-]

I_EAT_THE_RICH@reddit

Lemme get a copy of those surveys

Reply

[-]

Spanky2k@reddit

Very disappointed that it’s not an M4 Ultra although 512GB instead of 256GB is very cool. Will have to wait for benchmarks to make any kind of decision though. If it can handle R1 at good speeds then it’ll make a great in house LLM host. I have a feeling that smaller dynamic quants of R1 might end up working better though in which case the 512 one might be overkill.

Reply

[-]

hishnash@reddit

The main user of this chip is apple itself within thier ML data centers. I expect the reason they want this volume of mem is to have multiple separate models loaded ready to go.

Reply

[-]

power97992@reddit

What over 800 GB/s, they didnt upgrade the bandwidth? It should be at least 1092GB/s!

Reply

[-]

hishnash@reddit

It is build on M3 chip IP not M4 so it I using the M3 memory controllers thus slower memory.

Reply

[-]

-6h0st-@reddit

Don’t think Ultra is worth it. M4 max with 64GB is probably best choice. But still getting two 5090s would be best choice for local usage - big context window. Macs can’t handle that.

Reply

[-]

Glebun@reddit

Why wouldn't a Mac with 512GB memory handle that?

Reply

[-]

Xyzzymoon@reddit

Very low token/s. Just because it fit in the memory doesn't mean it runs fast. The AI processing speed on the processor is subpar.

Reply

[-]

Glebun@reddit

It's comparable to 4090 speed

Reply

[-]

Xyzzymoon@reddit

Where are you seeing the benchmark showing both models fitting into VRAM where the speed is comparable? Mac only wins when offloading is included, from what I see. Outside of that 4090 wins by a factor of 4.

Reply

[-]

Glebun@reddit

I'm just looking at the memory bandwidth.

Reply

[-]

dkaminsk@reddit

Memory bandwidth matters for text processing speed - and it’s close to nvidia with 819GB/s but prompt processing speed relies on GPU AI capabilities and here M3 Max was 10x slower than multiple 3090. With M3 Ultra it might halve. So bigger the context window the more you will feel it.

Reply

[-]

Glebun@reddit

I thought that for inference memory bandwidth is the bottleneck, not the GPU itself.

Reply

[-]

dkaminsk@reddit

Inference yes, prompt processing no. Inference happens after prompt is processed.

Reply

[-]

Glebun@reddit

Inference means the entire process of using the LLM, as opposed to training.

Reply

[-]

dkaminsk@reddit

Ok then text processing vs prompt processing TP is about bandwidth, PP is not

Reply

[-]

Glebun@reddit

I don't understand the distinction - prompt is text

Reply

[-]

dkaminsk@reddit

Dunno how to explain it - See here under llama 3.0 you have a table with text processing speed in relation to context window which is directly linked to GPU bandwidth and then you have prompt processing speed which on Mac is sometimes 10x slower than nvidia GPU https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

Reply

[-]

Glebun@reddit

Oh, got it, so it's basically input vs output

Reply

[-]

Xyzzymoon@reddit

That is not the only thing determining LLM speed, see here for comparisons. https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference you can see that several Macs SKUs here tested have the same memory speed, but different processors impact the speed.

Reply

[-]

swagonflyyyy@reddit

800GB/s Mo-Mother of Mercy!

Reply

[-]

Zyj@reddit

I was hoping for more. There is a M4 Max chip with 546 GB/s. So something with 1092GB/s would have been logical.

Reply

[-]

swagonflyyyy@reddit

I was expecting less. Granted, 800GB/s isn't going to do much for colossal models, but it should run 70B models considerably faster than 512GB/s.

Reply

[-]

Zyj@reddit

Yes. It will be interesting to see two Project Digits (128GB $3000 each) connected with their high speed networking compete with a single $5600 Mac Studio M4 Max with 256GB RAM.

Reply

[-]

Glebun@reddit

Not a fair contest - this has 3x the memory bandwidth (not even considering that networking is what, 1GB/s (10gbit)?

Reply

[-]

TheElectroPrince@reddit

Thunderbolt 5 is there, and that's 80Gbit/s bidirectionally.

Reply

[-]

Glebun@reddit

That's 10 times less than the memory bandwidth.

Reply

[-]

TheElectroPrince@reddit

You can probably aggregate the three ports together between two Mac Studios for 240Gbit/s bidirectional bandwidth. Or you can connect multiple Mac Studios with about 1-2 connections between each-other for up to 160Gbit/s bandwidth.

Reply

[-]

Glebun@reddit

> You can probably aggregate the three ports together between two Mac Studios for 240Gbit/s bidirectional bandwidth. You cannot. > Or you can connect multiple Mac Studios with about 1-2 connections between each-other for up to 160Gbit/s bandwidth. The 80 Gbps will still be the bottleneck. I misspoke earlier, btw, the memory bandwidth isn't 10 times faster, since it's 800 GB/s, not 800 Gbit/s. It's actually 80 times faster

Reply

[-]

petuman@reddit

800GB/s is (only) for M3 Ultra, not M4 Max. Given M1 Ultra existed for almost 3 years with same bandwidth, there's really no reason to expect less.

Reply

[-]

SeymourBits@reddit

What you’re speculating about should be the (currently unreleased) M4 Ultra.

Reply

[-]

indicava@reddit

Probably won’t be released. They (Apple) specifically stated (for the first time publicly) that “not all CPU generations will get the Ultra variant” = No M4 Ultra, that’s why we’re getting an M3 Ultra so deep into the M4 rollout.

Reply

[-]

SeymourBits@reddit

That’s probably the point I’d float if the M4 Ultra wasn’t scheduled for at another year or so. Otherwise, knowledge of superior specs would hurt M3 Ultra sales, which is pure kryptonite to Apple. Notice how they didn’t *specifically* say that there will be no M4 Ultra.

Reply

[-]

JBsthirdleg@reddit

Not a computer guy here, but is there someone that could help me translate the computer power of m4max to the equivalent of what I have in my pc to know which apple chip will be the correct one for my wife to edit photos for her business and also play her guilty pleasure... WoW? My pc Intel i7-12700 Nvidia GeForce rtx 3070ti 1tb storage 32gb ddr4 ram Thank you.

Reply

[-]

iCruiser7@reddit (OP)

It roughly equivalent to Ultra 9 285K + RTX 4070-4080 desktop, depending on your specific software and its optimization

Reply

[-]

JBsthirdleg@reddit

Thank you! I'm still an idiot. With the new mac studio and m4 max Will it easily meet these?

Reply

[-]

iCruiser7@reddit (OP)

I don't play WoW but feel free to buy one from Apple and try for yourself. If it doesn't work just return it within 14 days.

Reply

[-]

JBsthirdleg@reddit

https://preview.redd.it/xua2xmpeydne1.jpeg?width=1080&format=pjpg&auto=webp&s=aab50ae2b9f5ea866d5b67d27a9230bf439f03c9

Reply

[-]

synn89@reddit

The RAM speed is disappointing. I'm not sure how practical the 512GB of RAM will be outside of niche MOE models that use smaller experts. It sounds great for a local Deepseek at a decent quant, but I'd really like to see what the landscape of new 200B+ models are, architecture-wise, before wanting to invest in this device. Will Llama4 405B be a MOE, or is Meta going to stick with monolithic models?

Reply

[-]

WhereIsYourMind@reddit

I'm particularly interested in the use case of extended context length. With enough context length, I can feed entire repositories into context and the model has to make fewer assumptions about how to use it.

Reply

[-]

Zyj@reddit

OK, so the Max is an M4 Max but the Ultra is an M3 Ultra. 819GB/s for the RAM for the M3 Max. German prices: 11874€ for the 512GB model 6999€ for the 256GB model (with the smaller CPU model)

Reply

[-]

Abject_Radio4179@reddit

Up to 819 GB/s for the M3 Ultra. The binned part will just slightly more bandwidth than the M4 Max.

Reply

[-]

Synyster328@reddit

Dumb question but would this work similarly as a 4090 for, say, training diffusion model LoRAs? Or do those require CUDA specifically?

Reply

[-]

noiserr@reddit

> It's interesting to compare this to a RTX 4090 with 96GB VRAM for $6000 (with around 1TB/s mem bandwidth). 96GB 5090 (L50? or A6000?) with like 1.7TB/s.

Reply

[-]

AbominableMayo@reddit

So basically get a much better amount of RAM, similar but materially slower speeds and a full MacOS front end for the same price? Is my interpretation there off base at all?

Reply

[-]

Zyj@reddit

It just shows how overpriced these RTX 4090 96GB are. The Mac memory is also overpriced i'm sure but it's hard to get 819GB/s of memory bandwidth for unified memory anywhere...

Reply

[-]

AbominableMayo@reddit

Right, memory bandwidth is the only knock against the ultra vs the 4090. I’m sure the power draw difference isn’t going to be insignificant either

Reply

[-]

poli-cya@reddit

Wouldn't processing speed differences also be a big difference between the two? I thought the 4090 was substantially faster.

Reply

[-]

AnotherSoftEng@reddit

Based on how the previous silicon Macs have been scaling, the power draw of an M3 Ultra should be *much* less by a significant factor.

Reply

[-]

-6h0st-@reddit

Mac 48 GB doesn’t equal 4090 48GB in speeds. Prompt processing and context window matters a lot for any serious use and Mac simply is orders of magnitude worse than Nvidia

Reply

[-]

Final-Rush759@reddit

4090 is much faster in training models.

Reply

[-]

dinerburgeryum@reddit

I don’t believe many people are proposing training, though MLX has support for it. I believe most use cases here are focused on inference.

Reply

[-]

Such_Advantage_6949@reddit

Mac is still overpriced like usual. However, when putting them next to Nvidia. Suddenly it doesnt look like it is that overpriced. When the price of this 512GB Mac studio is same as 1 A6000 48GB Ada

Reply

[-]

dinerburgeryum@reddit

That really puts it in perspective…

Reply

[-]

-6h0st-@reddit

Linked to GPU cores not exactly RAM. Ram is running at much lower speeds - ram for GPU usage only reaches those speeds

Reply

[-]

Glebun@reddit

What do you mean? It's unified memory.

Reply

[-]

-6h0st-@reddit

If you search Google you will find it - seqrch for Alex Ziskind YouTube channel - memory bandwidth for system runs at lower speeds and only ram for GPU usage can access those speeds, therefore you can see there is a correlation between bandwidth speed and number GPU cores/RAM size (both go up) - hence 60 GPU core version will have around 25% lower bandwidth. Haven’t seen if anyone tested if this correlation exists for GPU core count or memory size - in other words if 32 GPU max with 48 and 64GB will have same vram bandwidth or different. I believe it’s bound to GPUs, since only GPU accesses that bandwidth- and otherwise only 512GB version of M3 Ultra would be getting 819GB/s bandwidth which I don’t think would be the case (since 192GB M2 Ultra was).

Reply

[-]

Enough-Meringue4745@reddit

You can also network macs together for networked inferencing

Reply

[-]

siegevjorn@reddit

Wait how is M4 different in MBW depending on GPU count? Memory bandwidth depends on RAM speed and I suppose RAM speed would be the same in both cases.

Reply

[-]

-6h0st-@reddit

And I would add 819GB/s for 80 core GPU Probably 614GB/s for 60 core GPU

Reply

[-]

ReginaldBundy@reddit

German price includes 19% VAT. Most buyers will be businesses who won't have to pay VAT. However, it's just 1TB SSD. 2TB: +EUR 500, 4TB: + EUR 1200

Reply

[-]

noxtare@reddit

very strange that they are using M3 Ultra and no M4 Ultra.

Reply

[-]

PongRaider@reddit

They probably reserves it for future Mac Pro series

Reply

[-]

SeymourBits@reddit

Pretty sure they have to wait for yield to catch up as fabbing 2 perfect and adjacent M4 Max chips is relatively rare.

Reply

[-]

xrvz@reddit

The M3 Ultra isn't made up of two M3 Max anymore.

Reply

[-]

SeymourBits@reddit

Looks like it is exactly 2 M3 Max chips, connected: “Apple says the M3 Ultra chip is essentially two M3 Max chips fused together with its "UltraFusion" technology, so the chip's specs are all doubled compared to the M3 Max. There was speculation last year about the M3 Max chip lacking UltraFusion technology, but Apple's announcement today has proven that rumor was false.” https://www.macrumors.com/2025/03/05/apple-introduces-m3-ultra-chip/

Reply

[-]

fallingdowndizzyvr@reddit

The Ultra lags the Max by about a year.

Reply

[-]

indicava@reddit

https://www.reddit.com/r/LocalLLaMA/s/LocAC0c6iV

Reply

[-]

Dax_Thrushbane@reddit

My thoughts also.

Reply

[-]

mxforest@reddit

It takes time to glue 2 Max chips together. They didn't use a hairdryer so the process took over an year.

Reply

[-]

BaysQuorv@reddit

I wish they released a chip which had like 100x the neural engine size. Like an ultra chip but all that extra space and compute goes only to a gigantic neural engine. On my m4 running the same language model purely on the neural engine takes 1.7W, on the GPU it takes 8W. And that 8W is already much more efficient than running on a "normal" GPU. Now imagine scaling up that neural engine 100x to work at the same power draw as an nvidia gpu. It would be like having your own groq chips at home.

Reply

[-]

TheElectroPrince@reddit

They're definitely doing this for their own cloud server chips that run Apple Intelligence. No way they're giving these out to consumers.

Reply

[-]

SteveRD1@reddit

I think something like that is coming from Apple, but nowhere near ready. Hence they drop this dud to update the Studio.

Reply

[-]

Aaaaaaaaaeeeee@reddit

From this announcement, didn't see any increases to the neural engine cores, so we can assume that they just did nothing. Hopefully I'm wrong. Made the chart based on previous info. | Specs | Peak M2 Ultra | Peak M3 Ultra | Increase (%) | |---------------|---------------|---------------|--------------| | **CPU Cores** | 24 | 32 | +33.3% | | **GPU Cores** | 60 | 80 | +33.3% | | **NPU Cores** | 32 | 32 | 0% | | **NPU TOPS** | 31.6 | **31.6(?)** | 0% |

Reply

[-]

BaysQuorv@reddit

Same NPU = I sleep. ANE is the future, not more inefficient gpu/cpu cores

Reply

[-]

AngleFun1664@reddit

How are you running models directly in the neutral engine? I’d like to try that on my M1

Reply

[-]

Master-Meal-77@reddit

Probable ANEMLL

Reply

[-]

dinerburgeryum@reddit

[ANEMLL](https://github.com/Anemll/Anemll) is the only solution I know of, and you take a massive hit on context size and it’s Llama only right now.

Reply

[-]

dissemblers@reddit

You need M3 Ultra to get > 128GB unified memory, and M3 Ultra w/80 core GPU to get 512GB $14099 for top spec (m3 ultra, 32 core cpu, 80 core gpu, 512GB unified memory, 16 TB SSD) $9500 if you go with 1 TB SSD instead (cheapest config with 512GB memory) $3500 for M4 Max w/40 core GPU, 512GB SSD, 128 GB unified memory (cheapest 128GB)

Reply

[-]

joninco@reddit

It has thunderbolt 5 -- so no need to buy the much larger storage. Just get an external enclosure.

Reply

[-]

TheElectroPrince@reddit

If the insides don't change as much, I presume someone will reverse-engineer the NAND flash carrier PCB, and we'll get replaceable storage again like what happened with the last Mac Studios.

Reply

[-]

Magnus919@reddit

This is the way. I’ve got an external NVMe RAID.

Reply

[-]

Cool-Cicada9228@reddit

I’ve been paying hundreds of dollars per week for Claude credits using Cline/RooCode. I’m considering getting an M3 Ultra maxed out except for SSD (so around the $9500 price point). Can someone explain to me what I can expect to see? I e read that I could run R1 Q4 but I don’t know what kind of experience it is? Would I be disappointed compared with Claude? Open to any other model suggestions and expectations. I’ve also heard that you can connect 3 together if anyone has more information about doing that I’d consider investing in that if it means I could run R1 or something similar fully. What I don’t want to have happen is make a big purchase and still need to use Claude for most of my coding. I’m not very experienced with hardware so if anyone can explain how big of a jump it will be to M4 Ultra I’d appreciate it because I don’t know if I should wait for a Mac Pro. If it’s only marginally better or faster architecture then I’d rather buy a Mac Studio now.

Reply

[-]

RikuDesu@reddit

try the 641b q4 model on open router

Reply

[-]

Cool-Cicada9228@reddit

I looked but didn’t find it there. I’ll try running it in the cloud though before I make the hardware purchase. That’s a good idea I can experience the capabilities of the model but I won’t know how the speed compares with local hardware.

Reply

[-]

canyonkeeper@reddit

This is gonna be one again very slow for any PyTorch code and mlx is not mature at all

Reply

[-]

lordmord319@reddit

Doesn't look that appealing to be honest for that price you could build a nice dual socket epic server.

Reply

[-]

Glebun@reddit

It won't be this small and efficient, and will much slower memory

Reply

[-]

lordmord319@reddit

Sure won't be as small or efficient but with dual sockets we would have a theoretical bandwidth of **921.6 GB/s** that's more then the M3 Ultra. And obviously you get the flexibility of adding more Ram. Obviously one isn't clearly better then the other but for me i would preferer the epyc over the apple

Reply

[-]

Glebun@reddit

Oh interesting, I didn't think it'd be that fast. How much could that cost, though?

Reply

[-]

Kind-Log4159@reddit

Yeah, for around 6k you can get 6-8t/s with a dual socket build. I’m conflicted whether to pull the trigger or not, but I’ll hold off because they will announce the m4 ultra soon, It has less bandwidth than a 4090 which isn’t promising.

Reply

[-]

SporksInjected@reddit

Are those new prices or used? Wouldn’t you pay $6k for just the ram if it was new?

Reply

[-]

Kind-Log4159@reddit

New. Ram would be 4k or so for ddr5

Reply

[-]

indicava@reddit

M4 Ultra ain’t coming

Reply

[-]

chaddone@reddit

I am considering buying the maxed out new Mac Studio with M3 Ultra and 512GB of unified memory as a CAPEX investment for a startup that will be offering a then local llm interfered with a custom database of information for a specific application. The hardware requirements appears feasible to me with a ~15k investment, and open source models seems build to be tailored for detailed use cases. Of course this would be just to build an MVP, I don't expect this hardware to be able to sustain intensive usage by multiple users. How feasible is that?

Reply

[-]

RikuDesu@reddit

yeah you'd have to process one prompt at a time, there are ways to queue them but if you have a lot of people hitting the server it would be hard. Also not all models are optimized for MLX, most are cuda optimized so you may be limited when trying out some specific fine tunes

Reply

[-]

blacPanther55@reddit

Use the "education" discount and save yourself 3k.

Reply

[-]

tnnnn@reddit

Waiting for M4 Ultra with 1TB ram! /s

Reply

[-]

Soft_Constant_7355@reddit

For an amazing price of $29,999! \*sarcasm, but maybe not sarcasm\* :(

Reply

[-]

Sudden-Lingonberry-8@reddit

finally a computer that can run deepseek

Reply

[-]

gripntear@reddit

The question is how fast can it process my prompts when allocating 32k context for LLMs up to 123B, like Mistral Large. Given all that, if it can output 250 tokens at decent speeds, regardless of context size, I would fucking get one right away because, holy shit, this is what I have been waiting for.

Reply

[-]

Ok_Warning2146@reddit

Prompting processing should be around 80% of 3090. If you are happy running Mistral Large with multiple 3090s, then it should be good enough for you.

Reply

[-]

ortegaalfredo@reddit

These specs are good. I would like to know how they compare to the equivalent GPU. The advantage of GPUs is that you can batch requests. While a single individual prompt can run at 15 tokens per second in a GPU, you can run 20 prompts in parallel to achieve an effective throughput of hundreds of tokens per second. Can this be done on a Mac?

Reply

[-]

Ok_Warning2146@reddit

It is a slightly weakened 3090 with 512GB at max config as it gets 114.688TFLOPS FP16 vs 142.32TFLOPS FP16 for 3090 and memory bandwidth of 819.2GB/s vs 936GB/s.

Reply

[-]

DinoAmino@reddit

I understand these things do quite well with simple prompts and no to little context. Is this device going to perform well when using 16k to 32k context or will performance plummet?

Reply

[-]

ortegaalfredo@reddit

Yes, that's what I want to know before dropping 10k in hardware. GPUs still have massively more compute than a mac, not only memory bandwidth.

Reply

[-]

Glebun@reddit

What's an equivalent GPU with 512GB VRAM?

Reply

[-]

AutomaticDriver5882@reddit

Is there bench marks on this? I find it hard to believe this preforms like a 4090.

Reply

[-]

AbheekG@reddit

Yaaaaaaayyyyy!!!!

Reply

[-]

Spirited_Eggplant_98@reddit

This seems like a decent deal just saw a YouTube where network chuck tested 5 M2 Max studios with 64 gb of ram running deepseek r2 with exos - that’s at least $10k in hardware and only gets you 320gb of ram and thunderbolt 4 is nowhere near 800gb/s.

Reply

[-]

Mediocre-Ad9008@reddit

Wow, wasn't expecting the M3 Ultra at all at this point. Everyone said the M3 line was dead.

Reply

[-]

thrownawaymane@reddit

The TSMC process node it’s based on was supposedly a dud so this is a surprise.

Reply

[-]

extopico@reddit

I will not buy it but this is cheap. I’ve only recently started using a MacBook Pro and it’s a beast. For anyone used to Linux getting a super powerful Mac is hugely appealing.

Reply

[-]

Chelono@reddit

>Up to 16.9x faster token generation using an LLM with hundreds of billions of parameters in LM Studio when compared to Mac Studio with M1 Ultra, thanks to its massive amounts of unified memory. Yeah, cause it fits and doesn't use disk (swap)... Can't wait for actual numbers

Reply

[-]

Mochila-Mochila@reddit

Strix Halo-tier marketing bragging.

Reply

[-]

MoffKalast@reddit

Given how Apple prices SSDs, it's gonna be really funny when people have less disk than RAM.

Reply

[-]

v00d00_@reddit

Time for RAMdisk to make a comback

Reply

[-]

sluuuurp@reddit

Yeah, that’s what they said, “thanks to its massive amounts of unified memory”

Reply

[-]

Remote_Cap_@reddit

They're desperate.

Reply

[-]

Doublespeo@reddit

At 512GB… is it possible to just put the whole system in memory??? like a F1 SSD?

Reply

[-]

NNN_Throwaway2@reddit

Pretty tempting as it sounds like the chances of an M4 Ultra chip are remote to none.

Reply

[-]

Mochilongo@reddit

M3 Ultra instead of M4 Ultra 😭

Reply

[-]

Xyzzymoon@reddit

> biggest bottleneck for Macs is the memory bandwidth Not in the context of LLM. A 4090, for example, only has 1008 GB/s. Slightly more than an M2 Ultra, but as long as the model fits, 4090 is around 4 times faster. Even underclocking the memory speed on the 4090 doesn't yield a significant drawback. This suggests that the bottleneck on the M2 Ultra is most likely Processing.

Reply

[-]

Mochilongo@reddit

2 different architecture, maybe i didn’t express myself correctly. In Mac ecosystem the bottle neck right now is the memory bandwidth.

Reply

[-]

SteveRD1@reddit

Wondering now if M4 Ultra (M5 Ultra?) will be reserved for the Pro to give it some distinction from the Studio line. I and disappointed too.

Reply

[-]

maddogawl@reddit

I'm out of Kidney's to sell for computers and computer accessories!

Reply

[-]

davewolfs@reddit

The Apple Chips are great machines but I am not convinced the hardware will be capable of running a model that large at adequate speed. I hope to be proven wrong but the M3 and M4 max chips don’t do that well with anything beyond 32B. The so called thinking models output way too many tokens for them not to be running at least 40-60 tokens per second if you want an adequate experience.

Reply

[-]

fallingdowndizzyvr@reddit

Shit. I didn't think they would go 512GB. But it's great that they are holding price line with the 256GB model. That's the same price as the M2 Ultra with 192GB.

Reply

[-]

Turbulent-Week1136@reddit

Sorry for the noob question but how does this compare for training or fine tuning? Do these specs still only make it better for inference or does it make training easier/faster?

Reply

[-]

BumbleSlob@reddit

So you can run Unsloth DeepSeek R1 on the m3 ultra / 256GB ram at home for $7k (it needs 160Gb (V)RAM), while still having room for smaller models to use in speculative decoding. Very interested to see what real world tokens per second you could get out of this. To be clear this is still super expensive but it’s getting DeepSeek R1 closer to hobbyist households. I’d probably be willing to throw $5k at a solution that can run it at home at a reasonable throughput.

Reply

[-]

SubstantialSock8002@reddit

On my M1 Ultra Mac Studio I get 13.8 t/s with Llama 3.3 70B Q4 mlx. M1 Max to M4 Max inference speed seems to roughly double, so let's assume the same for M1 Ultra to M3 Ultra. Accounting for 2x faster performance, \~9.5x more parameters, Q2 vs Q4, it seems like you'd get closer to 5.8 t/s for R1 Q2 on M3 Ultra? It's definitely awesome that you can run this at home for <$8k, but I feel like using cloud infrastructure becomes more attractive at this point.

Reply

[-]

Individual_Holiday_9@reddit

Wild how spendy these things are. I’m sitting here plodding along with my m4 / 24gb

Reply

[-]

OverCategory6046@reddit

Fully maxxed out one comes to 18k USD in the UK lmao

Reply

[-]

DirectAd1674@reddit

Sure, but you can't carry a server to a friend's house. You can stuff a Mac Studio in a backpack.

Reply

[-]

MagicZhang@reddit

Just curious, what activities at a friend’s house would require 512GB of RAM?

Reply

[-]

Sudden-Lingonberry-8@reddit

run half deepseek

Reply

[-]

darth_chewbacca@reddit

I do this to show my poverty stricken friends just how fucking rich I am.

Reply

[-]

OverCategory6046@reddit

I already talk about AI enough as is, I don't want to put that burden on my poor friends.

Reply

[-]

animealt46@reddit

That's a very solid machine lol. Comparison is the thief of joy I guess. You can rock models that bring Nvidia 12~16GB users to their knees.

Reply

[-]

nonsoil2@reddit

In italy, 11k€ for the 512gb ram, 1tb ssd(minimum), m3 ultra.

Reply

[-]

robertotomas@reddit

In the us you would pay taxes on top of the numbers you see, in europe the VAT is built in

Reply

[-]

Cergorach@reddit

Sales taxes (VAT) depends on the state and they tend to be a lot lower then in the EU. I'm from the Netherlands and we pay 21% VAT, and that's not even the highest in the EU. The US version is \~$9500, the Dutch one is almost €12k. €1.00 is worth $0.93, so when we do the conversion and add the VAT or prices are around 10% higher then in the US.

Reply

[-]

robertotomas@reddit

Yes, you guys do a lot more funding via VAT. President Trump mentioned he was interested in doing that too

Reply

[-]

Cergorach@reddit

You have services? Free homeless on your lawn? ;)

Reply

[-]

42nd_loop@reddit

It would literally be cheaper to get on a flight to Oregon or another state without a sales tax just to buy it lmao

Reply

[-]

robertotomas@reddit

HEH. But VATs serve a purpose that Oregon never will

Reply

[-]

Glebun@reddit

Same price as the US (plus tax)

Reply

[-]

nonsoil2@reddit

That’s a first

Reply

[-]

Glebun@reddit

Not really

Reply

[-]

tzujan@reddit

This makes me wonder what NVIDIA will do with Project DIGITS. I know that Nvidia limits their consumer GPUs so that they can charge a fortune for their enterprise GPUs. There's also quite a bit of buzz, at least in my little curated feed, for systems that can run and even train larger local models such as EXO Labs. It seems like NVIDIA could really crush it in the Local Model market if they wanted to.

Reply

[-]

NootropicDiary@reddit

My body is ready

Reply

[-]

SeymourBits@reddit

But, is your wallet?

Reply

[-]

The_Hardcard@reddit

This will crush with reasoning MoE’s like Deepseek R1. The bandwidth for generating hundreds if not thousands of reasoning tokens at 37 billion active parameters I think will put both the M4 Max and M3 Ultra ahead of Strix Halo and Project Digits.

Reply

[-]

-6h0st-@reddit

Let’s wait for tests from first owners. But I’m doubtful it will be any good for any serious usage. Macs do suck with fine tuning or big context window or bigger prompts.

Reply

[-]

Puzzleheaded-Dust268@reddit

128Gb M4 Max MacBook Pro vs same spec Mac Studio 🤔. Any views? I am going for a high spec machine for an MSc project using transformers, etc.

Reply

[-]

gintrux@reddit

Imagine that this is gonna be like 1000$ laptop someday in the future

Reply

[-]

Regrets_397@reddit

The 512GB option is a bigger deal on the ultra than having only 2xM3 Max instead of 2xM4 Max. Looking forward to getting my hands on one of these refurbished in a few years lol

Reply

[-]

Only-Letterhead-3411@reddit

512 gb RAM is amazing for huge MoE models like R1. Not so good for huge dense models like 405B. Price is terrible. I don't think it's worth that price tbh. But it's Apple. I guess no one is surprised. I'm sad that affordable 64/96 gb mac studio is no longer an option like it used to be for M2 Max one.

Reply

[-]

TaloSi_II@reddit

how are you getting 500 gb of vram for less?

Reply

[-]

Only-Letterhead-3411@reddit

[EPYC 9334 CPU + Motherboard](https://www.ebay.com/itm/186024089736?_skw=epyc%20motherboard%20cpu%20combo%20sp5&itmmeta=01JNKK0S8CYSFPETQ4KPPPNNVC&itmprp=enc%3AAQAKAAABAFkggFvd1GGDu0w3yXCmi1cv6BJxhVmKioCpkwhXSOagZn3aap%2F2ZO6q8rZK%2BMtaHiWtbiV3LzoQdWQgLwk8FSJf%2BwuLnXrbbLYKlm9N%2FxOPXHWLNE%2F2M3g%2FkyvGvutipUDcZxoStxIfcjJ4jFd5%2FcAwdSPewTE%2F3BdiJbDo7W97BsZ28pGGpwXuj82XSmOzDea%2FmiXCfsjyE%2BgK5Wfbp4Wkur%2BxXfxCYAo%2BR5O5oyHo7JLdUwgJMd0eGzC1PDwRoWEdjzWEQxFKv7SQyE4o0QemR1XcuDCyuvU%2FzDdW6w0dyT%2BJ1bGdEHQpTvBMx9rkQun0fQ%2FzRdePSN0F0mPzcGo%3D%7Ctkp%3ABFBMrpSD86xl) \- $1.500 12x [32gb DDR5 RAM](https://www.ebay.com/itm/116454950032?_skw=ddr5%20ram%2032gb&itmmeta=01JNKK95NRR7J2NMVSP62X9QHN&itmprp=enc%3AAQAKAAAA0FkggFvd1GGDu0w3yXCmi1erSEmmMgbYq7x2QOAq52alOd2HX9EHxvdjyh27g5emCVyGaMhBYcCrZsITCQLGkkJR1ExvsJEMcsbx6FB%2F%2BIp9eu15Y2RE%2B2SvZjJhRnIFR3YQLFBb%2Bbl8V5eMP1AzyEcKtzP9A9RaJVkoxPRnQL916GT5oVfGVGC9w7YyljtSmCE9wz6BrhD%2Bn5cOtcTnlHuLOoVxNm4KmBkGUPhU5TElKipxahhd5IVRc0BWyu4RW27mgOmi0nZ1r5hKabnafTQ%3D%7Ctkp%3ABk9SR4LbpPOsZQ) \- $1.000 Other stuff like cooler, 1 TB nvme etc. - $250 = 2.750$ - 384gb RAM, 460gb/s bandwidth, 1 TB ssd - **384gb RAM should be enough for running MoE models like R1, can get 64gb sticks instead if need more RAM, can upgrade RAM and storage yourself, can use linux instead of macos** M4 Max = 3.700$ - 128gb RAM, 409gb/s bandwidth, 1 TB ssd - **costs more for less bandwidth** M3 Ultra = 9.500$ - 512gb RAM, 800gb/s bandwidth, 1 TB ssd - **costs 3.5x more, theoretically only 50% faster at inference**

Reply

[-]

codingworkflow@reddit

Models so big will be still damn slow and you are not close to run R1 Q8. I feel the unified is quite hyped. Best 80-90 GB Vram Gpu setup.

Reply

[-]

roshanpr@reddit

Should I return my 5090 and buy one of these?

Reply

[-]

mxforest@reddit

Depends on the kind of models you want to run.

Reply

[-]

roshanpr@reddit

I do a lot of crap that uses CUDA ; LLM I have aniniboc with 96gb ram and an Nvidia hpu

Reply

[-]

Dax_Thrushbane@reddit

Not sure how i feel about this. 512Gb ram was definitely on the cards, but only M3? Le sigh.

Reply

[-]

pseudonerv@reddit

M3 Ultra is two M3 Max soldered together, right? We need M4 Ultra, it should be more than 1TB/s.

Reply

[-]

davewolfs@reddit

10K so I can rub R1 on some local system instead of Firebase. Interesting but probably not worth it. I wonder how fast it will run.

Reply

[-]

undefinex@reddit

I'm assuming the limiting factor for running a full model like R1 will be it's memory bandwidth at this point? How many toks/s can the maxed out memory config expect?

Reply

[-]

Feisty-Pineapple7879@reddit

Now this is a proper AI Inference Hardware.

Reply

[-]

mxforest@reddit

Tim Cooked with this one. Based on RAM configs and their examples. It seems to be aimed directly as an R1 machine without saying it out loud to avoid Backlash from 🥭 for supporting China.

Reply

[-]

AaronFeng47@reddit

I know I don't really need this, but I want this...

Reply

[-]

Solaranvr@reddit

Mama Lisa Su, whatever you do with the Strix Halo sequel, please release a competing SKU to this.

Reply

[-]

bmo333@reddit

Heysus Christ!!!

Reply

[-]

tibbon@reddit

Consider that if you have a bonafide business _need_ for that much memory, then this is probably well within a reasonable budget. If this is a _want_ then the price probably seems absurd and that's ok.

Reply

[-]

AaronFeng47@reddit

Disappointing, the memory bandwidth is basically the same with M1 Ultra

Reply

[-]

AaronFeng47@reddit

But the 512gb one should be perfect for local DS V3&R1 + Qwen2.5 Max, since these models are MoE

Reply

[-]

sluuuurp@reddit

546 GB/sec memory bandwidth. So just over one token per second if you run the largest model that fits in the unified memory (with no mixture of experts or speculative decoding).

Reply

[-]

Krazie00@reddit

I’ll wait for reviews, looking forward to them.

Reply

[-]

Least_Expert840@reddit

Must. Resist. Clicking.

Reply

[-]

AaronFeng47@reddit

No m4 ultra?

Reply

[-]

NeedsMoreMinerals@reddit

Damn with 512gb of unified memory you could run some serious AI models

Reply

[-]

MannowLawn@reddit

512 isn’t cutting it anymore, next year 1024 I’ll have another look.

Reply

[-]

albus_the_white@reddit

Can't wait for reviews.

Reply

[-]

nrkishere@reddit

> offers an up to 80-core GPU, more than any Apple silicon chip; a powerful 32-core Neural Engine for on-device AI and machine learning (ML) Holy hell!! MLX go brrrrrrr

Reply

[-]

bullerwins@reddit

Really looking forward to the benchmarks. Let's hope someone reviews the 512GB variant with R1, you can probably fit Q6 in there. It's definitely more power efficient than the cpumax or gpumax way. But not sure about the performance. Realistically you can probably fit 8? 3090s in a rack, but thats less than half the VRAM, and it will cost around 9K for a setup like that.

Reply