TheaterFire

Apple releases new Mac Studio with M4 Max and M3 Ultra, and up to 512GB unified memory

Posted by iCruiser7@reddit | LocalLLaMA | View on Reddit | 478 comments

Reply to Post

478 Comments

anonynousasdfg@reddit

Mac Mini M4 pro (12/16) 48gb vs Mac Studio M4 Max 36gb (14/32). Which one would you choose assuming that you will use <=32b 4-bit quantized (mlx) LLMs with max. 16K context size. According to my experiments, let's say for QwQ, one will output approximately 11-13t/s, while the other either 17-19 or 22-24t/s With the memory bandwidth issue I'm not sure if this M4 Max entry model of mac studio has a memory bandwidth of 410 or 546.
View on Reddit #50579977

MusingsOfASoul@reddit

isn't that the difference between binned and not binned?
View on Reddit #53975292

iCruiser7@reddit (OP)

https://preview.redd.it/agq4au1sqvme1.jpeg?width=1290&format=pjpg&auto=webp&s=3b02abc558a7fe519500d1303b37fac24f7992ff "Testing conducted by Apple in January and February 2025 using preproduction Mac Studio systems with Apple M3 Ultra, 32-core CPU, 80-core GPU, and 512GB of RAM, production Mac Studio systems with Apple M2 Ultra, 24-core CPU, 76-core GPU, and 192GB of RAM, and production Mac Studio systems with Apple M1 Ultra, 20-core CPU, 64-core GPU, and 128GB of RAM, each configured with 8TB SSD. LM Studio v0.3.9 tested by measuring token rate using a 174.63GB model. Mac Studio systems tested with an attached 5K display. Performance tests are conducted using specific computer systems and reflect the approximate performance of Mac Studio."
View on Reddit #50292019

Chelono@reddit

Don't forget that without setting `iogpu.wired_limit_mb` the M2 Ultra only has about 144GB default meaning it doesn't fully run a model of 174GB on GPU, but rather uses CPU for the rest even if it doesn't have to use swap like the M1 Ultra with 128GB. These results are skewed wait for reviews...
View on Reddit #50293109

pkmxtw@reddit

I thought Apple would be better at not doing this kind of misleading benchmarks, and yet here we are.
View on Reddit #50293585

b0tbuilder@reddit

From what past experience with Apple do you conclude that they would not produce a misleading benchmark?
View on Reddit #51858454

Chelono@reddit

I can't fault them. Everyone is doing it. At least Apple compares against itself. I disliked AMD marketing comparing Strix Halo to Nvidia GPUs even more. Also it works. Screenshots like this are always shared massively on social media and news pages. Besides some nerds noone is gonna bother to fact check things and if enough people see it some will believe it. Probably also has to do with investors, same thing applies there.
View on Reddit #50294049

Cergorach@reddit

Yep, and some people preorder a $10k computer because of it... I'll wait for the reviews and the independent benchmarks with details about how they tested.
View on Reddit #50311175

smith7018@reddit

The worst was NVIDIA's 5090 marketing that compared an FP8 on 4090 against an FP4 on 5090. That was unbelievably disingenuous.
View on Reddit #50299996

fullouterjoin@reddit

Wait till they get FP0 support!
View on Reddit #50306347

Yes_but_I_think@reddit

Totally nailed it you. If they test with a 80GB model it will be a no different from M2 Ultra. Why are these idiots comparing memory overflow with within memory cases? As if we want to test the usability of higher RAM.
View on Reddit #50295710

TastesLikeOwlbear@reddit

> Why are these idiots comparing memory overflow with within memory cases? Marketing.
View on Reddit #50494605

fallingdowndizzyvr@reddit

> If they test with a 80GB model it will be a no different from M2 Ultra. I wouldn't say that. Since the M2 Ultra is faster than the M1 Ultra even though they have the same memory bandwidth. Until now, there's more memory bandwidth than the M1 can use. Time will tell if it's the same with the M2. So the M3 can be faster.
View on Reddit #50320942

dinerburgeryum@reddit

Can the wired limit be adjusted at boot? Seems like an easy problem to conquer if true.
View on Reddit #50305179

fallingdowndizzyvr@reddit

Yes.
View on Reddit #50320772

dinerburgeryum@reddit

Oh nice. So a non-problem if your general use case is inference. That’s a good note thank you.
View on Reddit #50321430

fallingdowndizzyvr@reddit

There's no reason to not set it there even if your general use case isn't inference. It's not like on an AMD system where it's a hard limit. It's not reserved on a Mac. That just sets the limit that the GPU can use if it needs to. If it doesn't, that memory is available for the CPU to use for anything. On a Mac, it's dynamic. It's not static like it is on an AMD system.
View on Reddit #50323738

siegevjorn@reddit

You're right. This benchmark result is just garbage.
View on Reddit #50309356

2str8_njag@reddit

32 core CPU? All I can see in store is 28 core, maybe it's regional thing?
View on Reddit #50313158

SubstantialSock8002@reddit

Since we're given such a specific model size (174.63GB), can anyone figure out which one?
View on Reddit #50307463

Careless_Garlic1438@reddit

well the one M2 Ultra did 14 tokens with 1.58bit dynamic quant or 2 Ultra’s with EXO did 4 bit also at around 14 tokens … so if this holds true of 2x between M2 and M3 then brrrrr 30 tokens/s are in reach 🤯
View on Reddit #50293430

GreatBigJerk@reddit

Now accepting pre-orders using first born children as payment.
View on Reddit #50291685

Remote_Cap_@reddit

Why first born specifically?
View on Reddit #50292244

darth_chewbacca@reddit

They have less time until they can be shoved down in the mines. A child can reasonably be utilized in mining operations once they hit the age of 8, so if you take a first-born at 6 years old vs a second-born at 4, it's an extra 2 years before Tim Apple can see an increase to his coal mining investment. First borns also tend to be more compliant than subsequent children. The middle children are especially difficult to manage, often wanting higher portions of food, and slacking on the job to "play with friends." Apple has found that second borns cost an average of 18% more on disciplinary actions. Overall, first borns just make more financial sense.
View on Reddit #50293810

b0tbuilder@reddit

Small children fit in the mines better
View on Reddit #51857982

FreezeS@reddit

Completely false, this is not the real reason.  The first born is first in line for succession so he will inherit it and they could sell it again after 20.. 50 years. 
View on Reddit #50301772

darth_chewbacca@reddit

Disagree strongly. Having the line of succession is a "nice to have," but the idea that it's the primary motivator is a complete fake news conspiracy theory. You see, the morality rate is 86% by the time the mine worker reaches the age of 12, and 94% by the time the mine worker reaches 18; so inheritance usually isn't collected. Add to this that the family selling the firstborn is doing this because they are poor (and ugly, but that's besides the point), and the 6% inheritance collection isn't the primary motivator of Tim Apple. It is an important aspect, just not the primary motivator. Tim is honest when he says "I want to send your rat children down into the mines! You filthy ugly beasts. Buy my Apples bitches!"
View on Reddit #50311332

Everlier@reddit

Imagine all the LLMs that will see replies to your message in their training data
View on Reddit #50369131

Cergorach@reddit

Parents will do better with the second... ;)
View on Reddit #50301142

GreatBigJerk@reddit

Subsequent children will imitate their older siblings, and thus will no long "think different".
View on Reddit #50300957

-oshino_shinobu-@reddit

Everything after the original are cheap replicas.
View on Reddit #50293680

bfume@reddit

They’re worth more than the subsequent “accidents”
View on Reddit #50293186

catgirl_liker@reddit

They taste better
View on Reddit #50293056

SecuredStealth@reddit

Jerk
View on Reddit #50303582

geekgodOG@reddit

Pricing: 256GB = 5.6K 512GB = 9.5K
View on Reddit #50291649

Tadpole5050@reddit

128 GB = 3.5k M4 Max, 546 GB/s memory bandwidth. Probably a Nvidia Digits competitor at that price point and bandwidth.
View on Reddit #50292762

pkmxtw@reddit

Damn, now if digits come with less than 500 GB/s it would be pretty much DOA.
View on Reddit #50293098

mxforest@reddit

Rumored to be 256 GBps. RIP if true.
View on Reddit #50293404

b0tbuilder@reddit

This is already confirmed
View on Reddit #51857774

Paganator@reddit

Rumors from where? My guess is that Nvidia hasn't communicated the bandwidth because they wanted to see what they could get away with. Now that AMD and Apple are releasing directly competing products, Nvidia will feel more pressure to offer more bandwidth.
View on Reddit #50319008

emprahsFury@reddit

the rumor is just that what's advertised matches closely (but not perfectly) with the a Blackwell config that has 256gbps
View on Reddit #50332627

ReginaldBundy@reddit

Yeah, it would need either higher bandwidth or lower price. At the current setting (as far as we know) it's dead.
View on Reddit #50307279

perelmanych@reddit

It is DOA for those who want just to use models. But not for anyone who is doing training as it has full CUDA support in such incredibly small form factor.
View on Reddit #50297542

FullOf_Bad_Ideas@reddit

Which framework supports training with ARM CPU like what GH200 has? Compute wise, it's gonna be at single 3090 level. That's not as powerful as you might think.
View on Reddit #50303763

bigmanbananas@reddit

My 3090s have gone up in value 50% in n the last few months. I'd better sell them before the digits arrive.
View on Reddit #50304240

FullOf_Bad_Ideas@reddit

To buy them back later when Digits is unavailable?
View on Reddit #50308414

fallingdowndizzyvr@reddit

And lower compute.
View on Reddit #50320291

FullOf_Bad_Ideas@reddit

lower compute than single 3090? I think it should be around equal. It's almost 1000 FP4 sparse TOPS, once you convert it to real FP16 non-sparse you get 125 FP16 TFLOPS. 3090 has 142 FP16 TFLOPS as per [GA102 Whitepaper](https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf) that's 3090-level compute I mentioned earlier. It's a bit lower, so your statement is true, and who knows if it will throttle, but it's very similar.
View on Reddit #50321890

fallingdowndizzyvr@reddit

That's "up to" 1000 FP4. Not exactly reassuring. Like an "up to" 90% off sale. Nvidia has been very light on real specs.
View on Reddit #50323539

FullOf_Bad_Ideas@reddit

I think they have it listed as just 1 PFlop in specs. >AI Performance 1 PFLOP FP4 https://www.nvidia.com/en-us/project-digits/ probably not more than a few percent off. On their marketing slide it's also without the "up to" thingy.
View on Reddit #50336158

fallingdowndizzyvr@reddit

It absolutely says "up to". From your link. "Experience **up to** 1 petaflop of AI performance at FP4 precision with the Grace Blackwell architecture."
View on Reddit #50352336

FullOf_Bad_Ideas@reddit

In one place yes, but in two others it just says 1 PF AI TOPS. Marketing can't decide on exact words it seems like.
View on Reddit #50360932

florinandrei@reddit

Many things are "possible". Only one thing is real - and we don't have that yet.
View on Reddit #50310455

bigmanbananas@reddit

But. So much more ram than the 48GB vram
View on Reddit #50308547

FullOf_Bad_Ideas@reddit

For sure. In some usecases, you want more VRAM. Sometimes you want more compute. I've been in both. I hope DIGITS will be good, but I think I'll be sticking with normal GPUs. Or if I make a switch, it will be to PCI-E NPUs. Something like what Tenstorrent is doing.
View on Reddit #50310044

perelmanych@reddit

I am not the most qualified person to answer that, but I believe almost any that comes in form of source files that you can compile for any platform that has recent enough implementation of C++.
View on Reddit #50305644

noiserr@reddit

You're not going to be doing training on Digits, unless you're talking about super small models or just fine tunes on small models.
View on Reddit #50341085

perelmanych@reddit

The idea is to use Digits for proof of concept on lets say 1% of dataset before opening a wallet.
View on Reddit #50355359

Cergorach@reddit

Right tool for the job. I think it's great news for everyone if that's true. If DIGITS is worse at inference then the new Mac stuff (even the M4 Pro has more bandwidth), people can buy Macs, better availability then Nvidia stuff anyway. For the folks doing training that means there is less of a run on the DIGITS product and they might actually get it at a normal price...
View on Reddit #50299922

amhotw@reddit

Do we know anything about the training speed on DIGITS? I haven't seen any benchmarks but I remember that the expectation was that it would be slower than 5090.
View on Reddit #50345690

fallingdowndizzyvr@reddit

Training with 256GB/s of memory bandwidth and corresponding low compute. I guess if you have all the time in the world to wait.
View on Reddit #50320236

perelmanych@reddit

Not proper training. Running training on 1% of dataset, to see if the code works correctly and how gradient behaves.
View on Reddit #50321141

Enough-Meringue4745@reddit

Too slow of bandwidth for training
View on Reddit #50305379

perelmanych@reddit

I think it is used more as a proof of concept, before renting monster GPUs on the cloud.
View on Reddit #50306034

indicava@reddit

POC’ing on rented GPUs isn’t that bad either, I regularly rent out 4x4090 machines for about a $1.20 an hour. I do my experiments locally on my MBP M2 or on my gaming rig (my precious) that has a 3070 and then POC in the cloud, usually no more than 12 hours for a test train. (Then for the full training I dip into the 2xH200 $7-$8 an hour machines)
View on Reddit #50308755

Ok_Warning2146@reddit

Even at 500GB/s, it still can't compete. 2xDIGITS is 6k. But M3 Ultra 256GB is 5.6k.
View on Reddit #50344100

WhiteHorseTito@reddit

I’m waiting to see how DIGITS benchmarks against this Studio lineup, and I’ll probably pull the trigger on one of them by end of May.
View on Reddit #50318885

DirectAd1674@reddit

https://preview.redd.it/2xp30jeptvme1.jpeg?width=1284&format=pjpg&auto=webp&s=9a49db03d927828d1870406827a8e367c6aeb2e8 500? It says 800+
View on Reddit #50293407

Tadpole5050@reddit

M3 Ultra is 800+, M4 Max is up to 546 
View on Reddit #50293761

Abject_Radio4179@reddit

Not all M4 will be 800+ though.
View on Reddit #50420954

fallingdowndizzyvr@reddit

> M3 Ultra is 800+ There is no M3 Ultra. The last Ultra is M2.
View on Reddit #50320330

DirectAd1674@reddit

You're right, i overlooked that you mentioned M4 not M3. Either way, I'm excited to see how the love test results turn out. Thanks for clarifying!
View on Reddit #50310693

-6h0st-@reddit

M3 Ultra with 80GPUs, 60GPU model would have lower bandwidth
View on Reddit #50307066

Yes_but_I_think@reddit

So no difference from m2 ultra in bandwidth?
View on Reddit #50295300

animealt46@reddit

19GB/s more. Might have fewer memory controllers IDK.
View on Reddit #50297698

power97992@reddit

819GB/s
View on Reddit #50357600

sage-longhorn@reddit

Don't trust AI
View on Reddit #50296235

fallingdowndizzyvr@reddit

Betting than digits since you can use a Mac like as a general purpose computer.
View on Reddit #50320154

TheElectroPrince@reddit

DIGITS also runs a custom version of Ubuntu, and it has HDMI and USB ports, so you can definitely use it like a normal computer.
View on Reddit #50375703

fallingdowndizzyvr@reddit

How are you running Microsoft Office on it? How are you running Davinci resolve? How are you running the vast library of software on both Windows and Mac OS that people use for general computing?
View on Reddit #50406026

Caffeine_Monster@reddit

You can get near that bandwidth with any modern ddr5 compatible server - can cost a lot less too.
View on Reddit #50297773

AXYZE8@reddit

Single stick of DDR5-6000 is 48GB/s buddy, you're far off with your calculations, especially with costs.
View on Reddit #50299175

Desm0nt@reddit

12-channel epyc = 48\*12=576GB/s. He is not far off. He was talking about SERVER pc, not consumer one.
View on Reddit #50300740

AXYZE8@reddit

Intel offerings are 8 channel, AMD Genoa drops to 4800MHz if you use 12 channels. 80% of "DDR5 compatible servers" are out of the question to start with. If you want to have 12channels on DDR5-6000MHz you need to use AMD Turin. Single CCD read memory bandwidth on Turin is 106GB/s https://preview.redd.it/bdaahicwawme1.png?width=1055&format=png&auto=webp&s=96c31ad8f461c3c3e85b53eed99d7cca14c5469a You need 5x CCD to go above 500GB/s. Cheapest one that has that is EPYC 9355P, it costs $2998. :) So there you go with "cost a lot less too".
View on Reddit #50301969

noiserr@reddit

You can get dual P boards with 24 channels with Genoa or better.
View on Reddit #50341190

AXYZE8@reddit

Of course you can! Now instead of $2998 for CPU you need to get two 9275F, two of these cost $7k. If you use ktransformers (no-brainer for CPU inference) you also need to load weights twice, therefore instead of 512GB RAM you'll need 1TB. Go ahead! :) 
View on Reddit #50365243

noiserr@reddit

why do you need to load weights twice?
View on Reddit #50373207

AXYZE8@reddit

Because thats how ktransformers work. "copy model into RAM *twice* for big dual socket systems (as cross NUMA nodes is bottleneck)" [https://github.com/ubergarm/r1-ktransformers-guide/blob/main/README.md](https://github.com/ubergarm/r1-ktransformers-guide/blob/main/README.md)
View on Reddit #50377401

noiserr@reddit

> # ONLY IF you have Intel dual socket and >1TB RAM to hold 2x copies of entire model in RAM (one copy per socket) # Dual socket AMD EPYC NPS0 probably makes this not needed? # $ export USE_NUMA=1 Also this is an MoE model not a dense model. You shouldn't need to load it twice. Even if you had to manually designate experts surely you could split the model instead of loading it twice.
View on Reddit #50377669

AXYZE8@reddit

Do you have couple of minutes? [https://github.com/ggml-org/llama.cpp/discussions/11733](https://github.com/ggml-org/llama.cpp/discussions/11733) Go here and teach them. One person gets 102.2% performance where other has better result "only 105% compared to a single CPU benchmark run".
View on Reddit #50378856

noiserr@reddit

Wish I had time :(
View on Reddit #50379818

Caffeine_Monster@reddit

> "cost a lot less too". About 2/3 the price of the $10k Mac m4 - or at least was. I will admit there now seems to be a big shortage of server memory. The Mac will be a bit faster, but you will get ecosystem locked, so it is kind of swings and roundabouts.
View on Reddit #50321007

AXYZE8@reddit

That $10k Mac Studio has **M3 Ultra** chip that does more than **800GB/s**. It's not "a bit faster", Mac is **40% faster**. This is in best case scenario for server, as Mac has beefy GPU that will massively speed up prompt eval, effectively making responses like **2x faster** if you have longer context. All that while eating like 1/4 of the power and sitting in random place on your desk barely making any sound. I don't know why you underestimate the Mac so much, table has flipped and now Apple is the value king for performance across a lot of workloads. I'm not commenting just about high end, this goes all way to bottom end, where on PC you're 2 generations behind on CPU (Ryzen 5 5500), 2 generation behind on GPU (RTX3050 6GB) and that 7nm+8nm+DDR4 combo is supposed to compete with 3nm M4 Mac Mini that costs $529 right now at Amazon.
View on Reddit #50329543

FullOf_Bad_Ideas@reddit

Lots of moving parts need to get right to get this kind of speed. More than dual-channel is rare in consumer world at consumer prices. Even with quad channel and higher, silly things like some internal CPU die specs on AMD CPUs matter and drop your bandwidth beyond what you should get. Also AMD cpus seem to be getting less bandwidth out of the theoretical maximum for some reason. Reasonably priced DIY computers have up to around 200gb/s bandwidth, beyond that costs are similar to what you'd be getting with Macs/ Digits.
View on Reddit #50303952

rorowhat@reddit

Digits all the way
View on Reddit #50348030

-6h0st-@reddit

I think still nvidia will win with much higher TOPS.
View on Reddit #50306973

Zyj@reddit

It states the memory bandwidth of the M4 Max at 410 GB/s. Where did you get 546GB/s?
View on Reddit #50294257

Playful_Accident8990@reddit

I was tiring of all the precious gems in my house anyways!
View on Reddit #50292472

mxforest@reddit

Those precious gems are dead weight. Why buy shiny stuff when you can actually buy intelligence. Make the wise choice.
View on Reddit #50293976

Individual_Aside7554@reddit

Except in two years the m4 Max could be dead weight (given the pace of tech progress) and the gems' value will appreciate :)
View on Reddit #50302275

tothatl@reddit

It depends for what you want it. If it's for being your DeepSeek R1/R2 backend and make it work and produce income, it can be totally justifiable economically regardless if will become obsolete in a few years. That's why people keep buying work machines and computers. But if it's just for fun, jewels and the mac with m4 max are just a matter of taste.
View on Reddit #50308594

poli-cya@reddit

Produce income? Unless you're talking about a programmer using it for work, I can't imagine what that'd be. And even then, it'd be so glacially slow compared to API, I just can't see it. If you were trying to run it to serve an actually service to customers, you're not going to get the studio IMO... so this purchase comes down to interest in LLMs and if you can justify using the mac for something else also.
View on Reddit #50311511

Forgot_Password_Dude@reddit

It's slow? I heard the memory is fast? I guess it's not as fast as Nvidia?
View on Reddit #50319474

poli-cya@reddit

It's gonna be insanely slow compared to online services, and extremely cost ineffective. This is for if you're doing something you absolutely don't want sent to any outsider or as a hobby. Still cool it exists, but anyone with space for a server and this kind of money might be best served to go that route with older GPUs or GPUs mixed with RAM- at least to my understanding.
View on Reddit #50320835

zxyzyxz@reddit

Yep, imagine how many cloud AI tokens you can get for ~10k USD, I know this is /r/LocalLLaMA but the economics don't necessarily make sense.
View on Reddit #50348832

b0tbuilder@reddit

They make sense when you need to run a large context model doing RAG on highly sensitive information
View on Reddit #51857740

tothatl@reddit

I doubt DeepSeek is going to go the Western route of just adding moar layers and parameters. Maybe they do, maybe they don't. What I expect is they will find better algorithms and optimizations to run rationalizing multimodal models, probably with *less" parameters or execution overhead.
View on Reddit #50326610

WhyIsSocialMedia@reddit

DS1/2 will be like the brick cellphone to a smartphone at this rate.
View on Reddit #50312912

llamabott@reddit

Same goes for one's redundant liver!
View on Reddit #50331633

xor_2@reddit

And the best thing is that when you start 'buying intelligence' by buying Apple products then you probably need that artificial intelligence :D
View on Reddit #50310383

SkyFeistyLlama8@reddit

Gems? How many kidneys and lungs do you think you actually need?
View on Reddit #50342175

Playful_Accident8990@reddit

All of them! 🫴
View on Reddit #50345394

SomeOddCodeGuy@reddit

I hope that some streamer or another shows us what running a larger model looks like on this machine. $10k for a q5\_K\_M of Deepseek R1 may not be, from my perspective, not a particularly terrible deal as long as it runs at any form of an acceptable speed.
View on Reddit #50292727

joninco@reddit

I'm taking the plunge, will let you know.
View on Reddit #50296060

Lyuseefur@reddit

RemindMe! 7 days “Polar Mac Plunge”
View on Reddit #50300569

joninco@reddit

Mar 17 delivery date.
View on Reddit #50300961

DeSibyl@reddit

How’d it go?
View on Reddit #51579840

joninco@reddit

Oh, it went the same as all the reviews out there. Basically 18t/s on deepseek r1. One thing I was impressed with that I didn't see mentioned is the speed of the nvme storage. Deepseek was loading at over 10GB/sec and loads fairly quickly, which surprised me.
View on Reddit #51614870

joninco@reddit

I’m actually going to return it. The 900GB bw is nice, but the RTX 6000 pro with 96gb ddr7 is more what I’m looking for. The ability to run smaller denser models at speed, rather than r1 at 18tk/sec.
View on Reddit #51726646

ComingInSideways@reddit

RemindMe! 20 days “Mac for AI #2”
View on Reddit #50347965

DeSibyl@reddit

RemindMe! 14 days
View on Reddit #50329252

thrownawaymane@reddit

When’s the one you ordered for me getting here? (I look forward to your tests)
View on Reddit #50304142

joninco@reddit

Sorry, I ran out of babies to sell.
View on Reddit #50305536

man_and_a_symbol@reddit

Fr plz post tests I’m really curious as to how it performs
View on Reddit #50327816

joninco@reddit

What should I test? Just load up LM Studio and R1 and see what it do
View on Reddit #50328143

man_and_a_symbol@reddit

Yeah I think R1 is probably the best one to test; also try it with big context & different quants. If it gives a decent speed with big context and some decent quants, I bet people will be really interested. Also try some big non MoEs, maybe the Llama's just to see how they perform although I assume a dense 405B will be extremely slow
View on Reddit #50328377

Lyuseefur@reddit

RemindMe! 14 days “Polar Mac Plunge 2: The Plunganing”
View on Reddit #50302039

RemindMeBot@reddit

I will be messaging you in 7 days on [**2025-03-12 15:59:51 UTC**](http://www.wolframalpha.com/input/?i=2025-03-12%2015:59:51%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1j43us5/apple_releases_new_mac_studio_with_m4_max_and_m3/mg5xyuw/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1j43us5%2Fapple_releases_new_mac_studio_with_m4_max_and_m3%2Fmg5xyuw%2F%5D%0A%0ARemindMe%21%202025-03-12%2015%3A59%3A51%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201j43us5) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
View on Reddit #50300649

joninco@reddit

Ran the 4bit MLX deepseek r1. Long story short, 18t/s like everyone else found out. But the longer the context, the longer the TTFT. That prompt processing is slow. How can I get an exact benchmark for that besides arbitrary contexts?
View on Reddit #51009769

joninco@reddit

What is the ideal deepseek R1 quantization to run with 512GB ram? a Q4\_K\_M or Q5?
View on Reddit #50473534

AstroZombie138@reddit

What config did you get? I'm wondering if the +$1500 for more cores is worth it. Otherwise I will go with the 256gb memory and 4tb storage (which hurts, but I think I'd eventually need that much storage. Remember you can get \~$600 off for a student discount if you qualify
View on Reddit #50318617

joninco@reddit

Apple M3 Ultra chip with 32-core CPU, 80‑core GPU, 32-core Neural Engine 512GB unified memory 2TB SSD storage
View on Reddit #50324634

EternalOptimister@reddit

RemindMe! Blood diamonds
View on Reddit #50320197

EvilPencil@reddit

My money would be on Alex Ziskand being first to market on that...
View on Reddit #50488362

power97992@reddit

Two m2 ultras maxed out runs q4 at 17 tokens/s , expect it be less than it for q5 or q4 on one mac.. Maybe 8-10 tokens/s due to less memory bandwidth but faster interconnect and higher flops
View on Reddit #50357730

Careless_Garlic1438@reddit

2 M2 Ultra’s with EXO runs it at 14 tokens / s so if the 2x holds then we are looking at 30 🤞
View on Reddit #50293583

Hoodfu@reddit

This new m3 ultra isn't twice as fast as an M2 Ultra. It's roughly the same.
View on Reddit #50311369

Ok_Warning2146@reddit

GPU compute is 2x faster than M2 Ultra and 2.6x faster than M1 Ultra per the press release. I also have doubt on this. But we just have to wait for more tests to confirm.
View on Reddit #50344405

Hoodfu@reddit

with LLMs it's really just about that memory speed. losing 20% compared to the m4 ultra that it was supposed to be was a big letdown when I saw the news.
View on Reddit #50346547

fallingdowndizzyvr@reddit

> with LLMs it's really just about that memory speed. That's completely not true. If that were the case then a RX580 would be competitive with a 4060. It's not. It's only about memory speed if you have enough compute to use it. On Mac Ultras that hasn't been the case. They have more memory bandwidth than they have compute to use it. That's why the M2 Ultra is faster than the M1 Ultra even though they have the same memory bandwidth. There's no reason to believe that the M2 Ultra is using all available memory bandwidth. If not, the the M3 Ultra will be faster.
View on Reddit #50352177

-6h0st-@reddit

No it won’t do acceptable speed. You need GPU processing power also, on Macs bigger the model slower prompt processing, and can’t handle big context window. So pointless for local usage.
View on Reddit #50307204

chespirito2@reddit

Why do you say it can't handle big context window?
View on Reddit #50329563

bullerwins@reddit

819GB/s memory bandwidth for the M3 Ultra. Do you think llama.cpp on mac would run faster than ktransformers on a server with 512GB ram and 1-4 GPUs?
View on Reddit #50293209

Its_Powerful_Bonus@reddit

Both m3 ultra variants have 800gb/s? Also 28cpu/60gpu? M3 max 14/30 afair had 300gb/s …
View on Reddit #50328910

Yes_but_I_think@reddit

So 2 tokens/s on 400GB sized model (R1 Q6K)
View on Reddit #50295417

perelmanych@reddit

No, since R1 is not a monolithic model, but MOE and only 37B parameters are activated. Should be more close to 10t/s.
View on Reddit #50297918

Healthy-Nebula-3603@reddit

10? Lol. No Rather 20t/s if memory has 500+ GB /s
View on Reddit #50302285

perelmanych@reddit

Yes, yes I know you learnt how to use division)) And now look here [https://github.com/ggml-org/llama.cpp/discussions/4167](https://github.com/ggml-org/llama.cpp/discussions/4167)
View on Reddit #50305784

joninco@reddit

It's MoE -- isn't it just a matter of being able to keep R1 in vram and then it runs whichever 36B model relatively quickly or am I missing something for decent speeds?
View on Reddit #50297578

teachersecret@reddit

Someone ran R1 on 2x 192gb macs at about 17 tokens/second at a decent quant, so yeah, should be possible to get good usable speed with one of these 512gb rigs.
View on Reddit #50298767

joninco@reddit

This should be fun, maybe the biggest bonus will be having full context in ram.
View on Reddit #50299411

martinerous@reddit

And it's important to make sure they try it with large text. One thing is when you ask about QStarberries and another - if you want to work with code or text summaries.
View on Reddit #50304329

SomeOddCodeGuy@reddit

Agreed. With that said, I do want to make a post not long from now discussing this in a bit more detail. I've always been hard on Macs for the prompt processing speed; I don't mind waiting, but other people do, and if you look at my profile I've made sure to pin a post showing the real numbers of what large context looks like on an M2 Ultra. With that said, I decided to test out ChatGPT's Deep Research by asking it to find me the ms per token numbers of inferencing 70b models on an a6000 (not the ada), and interestingly, it came back with results showing several posts putting the inference around 5ms per token in prompt processing. I recently got my hands on the more powerful M2 Ultra, the 76 core GPU version, and it processes prompts on Qwen2.5 72b at 10ms per token. It's 2x slower on the Mac,but that's not as bad as what I think a lot of folks were imagining. And with speculative decoding it was a much smaller gap for prompt writing, so i want to try to do a bit more research and get a conversation going about just how big of a difference there is in response times on BIG models at big contexts between certain CUDA cards and a Mac.
View on Reddit #50307007

TyraVex@reddit

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/37 IQ3_M/IQ4_XS is all you need for V3/R1 I believe that a 3k ram server with a 3090 with ktransformers would equally, around 13-15 tok/s. I may be wrong.
View on Reddit #50297609

IlIllIlllIlllIllllI@reddit

I could probably get $10k for my car, why would I need transportation when I can have a Mac?
View on Reddit #50376109

ykoech@reddit

Time to sell that car 🚗
View on Reddit #50293331

Harvard_Med_USMLE267@reddit

Haha, my car is worth less than my RTX 4090, it’s certainly not going to pay for this!
View on Reddit #50365722

thunk_stuff@reddit

On the positive side, the overpriced SSD upgrades start to feel like rounding errors when you hit $10k.
View on Reddit #50313144

fotiro@reddit

But would you like a stand for only $1,500?
View on Reddit #50344300

Ok_Warning2146@reddit

tb5 is fast enough that using cheap external SSD is just as good.
View on Reddit #50343998

cafedude@reddit

Cars are trouble anyway. And it's not like you need to go anywhere anymore.
View on Reddit #50309524

Rich_Repeat_22@reddit

And the actual cost of that RAM is barely $300.
View on Reddit #50303216

Karyo_Ten@reddit

What? 512GB/s bandwidth RAM is not exactly cheap, be it on GPU or 12-channel ECC RAM.
View on Reddit #50303912

Rich_Repeat_22@reddit

This thing is using LPDDR5 (not X) 6400. Which street price is $1.8 per GB lets say $2per GB on (12/16/18GB modules) So $921-$1024 for 512GB LPDDR5 6400. Apple is selling $5000 the 256GB.
View on Reddit #50305236

ResolveSea9089@reddit

Can I ask a stupid question, why aren't there others offerng that kind of RAM for that price point? Do you suspect there will be some going forward? Just no demand until now? Also does this apply to VRAM? I'm a bit of a tech novice and trying to understand, for these LLM's we want VRAM right?
View on Reddit #50342920

Rich_Repeat_22@reddit

Yes. Because there wasn't much demand up to now. And corporations are kinda slow to make decisions always been minimum 6-12 months behind. Also some manufacturers will be extremely reluctant to do that, like AMD & NVIDIA because it would cut from their professional and accelerators markets.
View on Reddit #50354534

Karyo_Ten@reddit

>Apple is selling $5000 the 256GB. The CPU+GPU aren't $5.6k - $5000 = $600. 512GB ECC RAM @6000Mhz is $1.7k on newegg: https://www.newegg.com/p/1X5-0009-00A03
View on Reddit #50305539

Rich_Repeat_22@reddit

You confuse ECC DDR5 for servers and workstations with LPDDR5.
View on Reddit #50322181

Karyo_Ten@reddit

I'm not, if you want to build a machine with 256GB DDR5 yourself, either you stack GPUs or you stack ECC DDR5, you can't do LPDDR5 yourself.
View on Reddit #50322910

Rich_Repeat_22@reddit

The Apple is using LPDDR5.
View on Reddit #50323547

Karyo_Ten@reddit

And you can't build a 256GB machine with LPDDR5 yourself from parts. You should compare with what you pay if you build it yourself.
View on Reddit #50323695

Rich_Repeat_22@reddit

Yes we can. If we can get out hands on the 512GB BIOS of the M3 128GB we can replace the VRAM if the PCB is the same for less than $600.
View on Reddit #50326199

Karyo_Ten@reddit

>If we can get out hands on So is this BIOS in the room with us right now?
View on Reddit #50331016

Rich_Repeat_22@reddit

🤦‍♂️
View on Reddit #50332796

fullouterjoin@reddit

If I had spent every dollar on apple stock as I spent on apple products, could probably afford a plane or a nice cabin.
View on Reddit #50305831

Rich_Repeat_22@reddit

256GB don't cost $5000.
View on Reddit #50304513

Karyo_Ten@reddit

Are you arguing that the CPU is $5.6k - $5000 = $600?
View on Reddit #50304849

Rich_Repeat_22@reddit

I am arguing that 512GB costs 10K and 256GB 5.6K. So Apple is selling 256GB for 5K while they cost barely $512 in LPDDR5 6400 modules.
View on Reddit #50322342

fullouterjoin@reddit

It does when put into that machine.
View on Reddit #50305742

Kavor@reddit

That's the price you pay to have ~~a cold soulless robot put it in and solder it in place~~ an Apple engineering expert handcraft and personally sign the ram stick in a process that takes days and the uttermost love and then carefully place it in the slot.
View on Reddit #50312962

jabblack@reddit

Cost about as much as a 5090
View on Reddit #50352161

rorowhat@reddit

Apple pricing is the worst!
View on Reddit #50347972

angry_queef_master@reddit

Damn, that is how much I paid for my 2010 honda civic
View on Reddit #50335994

the_Luik@reddit

How much is that in liver
View on Reddit #50332999

YearnMar10@reddit

12.5k in € for 512gb
View on Reddit #50303025

zoe934@reddit

But you cant find a 12.5k GPU with 512gb VRAM\~
View on Reddit #50320706

YearnMar10@reddit

No one said that, just wanted to inform
View on Reddit #50321963

zoe934@reddit

SAME!
View on Reddit #50330259

Rich_Repeat_22@reddit

I wonder if we can get our hands on the 512GB BIOS and the PCB is the same with the cheapest version (64/32GB), if we can replace the LPDDR5 6400 modules with 512GB ones. It would cost less than $1000 to buy 512GB worth of modules, replace the cheapest version and flash the bios 🤔
View on Reddit #50329258

Forgot_Password_Dude@reddit

Oh hell yes 512, but damn almost double the price
View on Reddit #50319224

wen_mars@reddit

9.5k is getting close to dual socket epyc territory. Nice that Apple gives people the option.
View on Reddit #50313493

StoneyCalzoney@reddit

It's worth noting that the edu discount drops the 512GB price down to \~$8.6k
View on Reddit #50312095

BigMagnut@reddit

It's not worth 10k.
View on Reddit #50311488

candre23@reddit

Might as well just buy real GPUs for that money.
View on Reddit #50303412

ASYMT0TIC@reddit

Which GPU with 512GB of VRAM are you buying for under $10k?
View on Reddit #50306463

candre23@reddit

That will buy more than a dozen 3090s, which would run rings around the mac. Like order-of-magnitude faster. 512GB in a "unified memory" machine like this with laughable GPU cores is objectively pointless. Even with 123b models at moderate bpw you're only looking at about 96GB memory needed, and the mac would already be horrifically bandwidth and compute bound at that point. You load up a model that actually needs 512GB of memory and that mac will be lucky to produce more than a dozen tokens *per minute*.
View on Reddit #50308333

indicava@reddit

Yea, cause the costs of building and running a 12x3090 rig end with the GPU’s right?
View on Reddit #50310081

candre23@reddit

You wouldn't buy 12 3090s. You'd buy a reasonable number like 4. The point here is that it's factually impossible to actually take advantage of the "512GB" of memory in the mac. It's too slow in several metrics to run models that large at anything approaching usable speeds.
View on Reddit #50311329

mxforest@reddit

It's not bad for what it is. A small 10-20 person startup can host local R1. That comes out to be $20 per person for 2 yrs.
View on Reddit #50292890

Wildcard355@reddit

Very good point. It's likely that we'll get cheaper local options in the coming years (months 🙏), but given that most start ups have a small size like this and budget could probably accommodate one local LLM this way.
View on Reddit #50308560

-6h0st-@reddit

5.8k I see here in Uk for 256GB 28/60 core version which doesn’t have full bandwidth of 800GB/s (25% lower?)
View on Reddit #50306888

Healthy-Nebula-3603@reddit

Still very worth it! !!
View on Reddit #50301840

tangoshukudai@reddit

cheap.
View on Reddit #50300635

roshanpr@reddit

FUCK MY LIFE
View on Reddit #50295840

mxforest@reddit

512 GB holy hell. Great machine for local R1.
View on Reddit #50291300

DirectAd1674@reddit

https://preview.redd.it/es871ccxsvme1.jpeg?width=1284&format=pjpg&auto=webp&s=d73170646c8281250a3c4219264efc2faad8d9d5 The wording here certainly aims to suggest that
View on Reddit #50293058

half_a_pony@reddit

it's funny to mention apple intelligence here because apple intelligence models are tiny. going to be a drop in a bucket in all of that memory
View on Reddit #50315694

2016YamR6@reddit

Deepseek R1 Distill Siri
View on Reddit #50330756

nexusprime2015@reddit

DeepSiri
View on Reddit #50458462

getmevodka@reddit

💀🤭
View on Reddit #51427986

ready-eddy@reddit

Really curious in the performance for Diffusion models. Stable Diffusion is running much better than I thought it would be on my 24gb mac mini.. 512GB sounds… tasty
View on Reddit #50312909

Background-Hour1153@reddit

If I'm not mistaken, diffusion models are compute bound, so as long as the diffusion model fits in the RAM/VRAM (most image diffusion models fit in 24 gb of RAM), you shouldn't get faster generation if it's the same exact GPU.
View on Reddit #50326410

tomz17@reddit

Is it tho? For $10k you can buy a proper 12-channel DDR5 system with similar memory BW, expandability (i.e. an nvidia card for prompt processing, more than 512GB RAM), and far more CPU compute power. -or- you can just rent $10k of actual cloud on a proper hopper, blackwell, etc. system and get orders of magnitude the throughput. I mean it's priced competitively to that once you factor in the apple tax, but it's not exactly a game changer in that price range.
View on Reddit #50293557

Zyj@reddit

A 12-channel DDR5-6000 system provides a mere 576GB/s
View on Reddit #50294451

mxforest@reddit

That's theoretical though. The more Kits you have the high the chance that they will run at lower clocks. I will be surprised if 12 modules result in it barely managing 5000-5200.
View on Reddit #50295775

tomz17@reddit

Not "theoretical". DDR5 6000 is the spec for 5th gen Epyc parts, you WILL get exactly that speed.
View on Reddit #50309590

Zyj@reddit

Well, DDR5-6000 past 32GB are still pretty rare. There's Kingston https://www.kingston.com/unitedkingdom/de/memory/search/?partid=KVR64A52BD8-64 but i'm not sure if UDIMMs are officially supported
View on Reddit #50453728

tomz17@reddit

It's not a gaming PC. if you are buying a workstation or server class CPU you just look at the HQL for that cpu + motherboard and buy one of the samsung or hynix part numbers they actually tested in that config. Everything else is ymmv.
View on Reddit #50570805

Zyj@reddit

Right. The QVL often doesn't list DDR5 6000 at higher capacities
View on Reddit #50607071

tomz17@reddit

Huh? Anyone selling you a Turin-compatible board / system certainly does. How else would they sell the thing? e.g. H13-SSL-N lists MEM-DR516MB-ER64 **16GB** DDR5-6400 1Rx8 LP (16Gb) ECC RDIMM MEM-DR532MD-ER64 **32GB** DDR5-6400 2Rx8 (16Gb) ECC RDIMM MEM-DR512PC-ER64 **128GB** DDR5-6400 2Rx4 LP (32Gb) ECC RDIMM,RoHS
View on Reddit #50607468

Zyj@reddit

You're right. With my ASUS Pro WS WRX80E-Sage SE WIFI mainboard, the QVL lacked any 8x32GB memories at DDR4-3200 speeds so you were on your own.
View on Reddit #50611883

Zyj@reddit

Right! I'm wondering if Dual EPYC systems provide twice the memory bandwidth in a useful way (i.e. something that increases tokens per second)?
View on Reddit #50299455

dinerburgeryum@reddit

It’s tough because you really have to watch NUMA parameters at that point. Ktransformers makes shadow copies of critical matrices at each NUMA node to prevent this, but that kind of tuning is not generally available for all models.
View on Reddit #50304694

tomz17@reddit

\> A 12-channel DDR5-6000 system provides a mere 576GB/s **per socket**
View on Reddit #50309632

BumbleSlob@reddit

> you can just rent $10k of actual cloud on a proper hopper, blackwell, etc. system and get orders of magnitude the throughput. sir this is /r/localllama
View on Reddit #50311921

calcium@reddit

Apple tax? At this point when comparing workstations to one another they remain pretty competitive.
View on Reddit #50296841

gandhi_theft@reddit

Can it fit R1? I read that it's \~700GB
View on Reddit #50525550

mxforest@reddit

It can run a quantized version. 700GB is assuming 8 buts. You can run 4 bit which is half the size.
View on Reddit #50532971

philguyaz@reddit

The memory bandwidth is going to make me cry it’s the same as the m2
View on Reddit #50291671

bullerwins@reddit

819GB/s memory bandwidth for the M3 Ultra 546 GB/s memory bandwidth for the M4 Max
View on Reddit #50293528

Abject_Radio4179@reddit

Actually, it’s up to 819 GB/s. The 60 core one will have 614 GB/s, so just 10% more than the M4 Max.
View on Reddit #50421569

Barry_Jumps@reddit

Someone smart please explain why M4s have half the bandwidth
View on Reddit #50342441

CtrlAltDelve@reddit

Apple's "Ultra" chips can be thought of as two "Max" chips glued together. So, an M3 Ultra is like having two M3 Max chips working together. Unless the new M4 Max was somehow magically twice as powerful as the M3 Max, the older M3 Ultra (which again is two M3 Max chips) will still be faster. The big mystery is why Apple didn't just release an M4 Ultra.
View on Reddit #50347071

Barry_Jumps@reddit

🤜 🤛
View on Reddit #50366040

animax00@reddit

should it be 410GB/s memory bandwidth for M4 max? [https://www.apple.com/mac-studio/specs/](https://www.apple.com/mac-studio/specs/)
View on Reddit #50312983

Competitive-Bake4602@reddit

48GB and higher models are 546 GB/s
View on Reddit #50341073

TrashPandaSavior@reddit

That page shows that it's "Configurable to" 545 GB/s. So basically the non-binned chip has that speed. For LLMs, that's a $300 upgrade I'd take.
View on Reddit #50322386

animealt46@reddit

M4 Max is looking mighty mighty good.
View on Reddit #50297882

mxforest@reddit

It's not great but it is ok for MoE with low number of active params.
View on Reddit #50291740

philguyaz@reddit

Truuuuu! Also for finetuning which I use my ultra for it’s more than good cause I have more time than ram .
View on Reddit #50291811

FullOf_Bad_Ideas@reddit

how fast (t/s) can you finetune bigger models on your Ultra? Are you doing QLoRA/LoRA or is full finetuning possible too?
View on Reddit #50304193

philguyaz@reddit

I don’t know that I keep track of T/s trainings I more care about time per iteration. We do Lora MLX fine tuning and get the impact we want. I don’t get the obsession with full sized models and training I run a SaaS product that wouldn’t make sense without quantization in both training and inference.
View on Reddit #50306618

indicava@reddit

Have you experimented with full fine-tuning and found you get similar results to LoRA? For my fine tuning use cases (coding) I’ve found a significant difference in performance when doing a full fine-tune rather than going the LoRA/QLoRA route. Also, I don’t know what size models you’re fine tuning, but the biggest one I use (commercially) is a 32b parameter model and the various serverless services out there make it pretty reasonable to fine tune and serve inference (in production) for my customers.
View on Reddit #50310910

philguyaz@reddit

I’m doing a 70b and I’m not finetuning code I’m finetuning language style, format and word choice preference which probably makes a big difference. I have not tried this approach vs full fine tune.
View on Reddit #50328689

FullOf_Bad_Ideas@reddit

What kind of models are you finetuning? T/s in training is just time per iteration / batch size / sample length. Same thing in the end, but it's more fundamental since you can change sample length in your dataset and batch size, so time per iteration isn't really meaningful, unless we're talking about diffusion image models.
View on Reddit #50308121

piggledy@reddit

Isn't memory bandwidth becoming the limiting factor here rather than memory size? The M3 Ultra has a memory Bandwidth of 800GB/s. Local R1 in Q4 is about 400GB. Wouldn't that make for a terrible experience at roughly 2 tokens per second? Is that good value for money at a minimum $9,499.00?
View on Reddit #50292369

mxforest@reddit

It only has like 27B active params at a time. So you divide 800 by 27 and not 400.
View on Reddit #50292435

piggledy@reddit

So good for Deepseek, but terrible for something like Llama 3.1 405B?
View on Reddit #50292790

Longjumping_Kale3013@reddit

But isn’t llama 3.3 70B comparable to the 405B?
View on Reddit #50319160

animealt46@reddit

Probably great for something like a 70~150B model running in the background while having plenty of RAM available to do other tasks I guess?
View on Reddit #50298051

piggledy@reddit

And good for long context too
View on Reddit #50298690

dinerburgeryum@reddit

MLX has solid KV cache quant options to boot. 6-8 feels near lossless. I’m not familiar enough with their backing algos to recommend 4-bit yet but at higher quants it’s great.
View on Reddit #50304603

dinerburgeryum@reddit

405B monolithic was always hubristic. Silly that we even considered it for hosted inference. MoE was in the wild when it dropped. Just Meta being silly and throwing compute at problems instead of brains.
View on Reddit #50304535

mxforest@reddit

True! Given the rumors that LLAMA team scrambled after R1 release, I think MoE is the way to go. Specially when thinking tokens need much higher tps to be usable.
View on Reddit #50293498

Kind-Log4159@reddit

The zuck is definitely still getting flashbacks of r1 release. Llama 4 was canceled because of it
View on Reddit #50301946

SomeOddCodeGuy@reddit

I will note that MoEs process prompts a little differently than the active param size would imply, and you definitely feel it on Mac. I have an M2 Ultra and one of my favorite models used to be WizardLM2 8x22b. The prompt processing time was definitely longer than what I'd expect a 40 something b model to process at; it felt like it was closer to a 70b in prompt processing speed, and the full size of it was around 141b if I remember right. Once it started writing, things sped up a lot.
View on Reddit #50293376

Mrleibniz@reddit

> WizardLM2 I completely forgot about that model, whatever happened to that? They took it down and the buzz around it sort of died.
View on Reddit #50302075

SomeOddCodeGuy@reddit

It's still available, just not from the original repo. It was dropped under open source license, some folks forked the repo while it was up, and those repositories continued to exist and gguf kept going up. You could still find it on huggingface if you were so inclined, but otherwise there wasn't a lot of buzz because without the official repo up, not many benchmarks wanted to run the numbers. Eventually, by the time they did, new models had come out that beat it pretty easily, so it wasn't worth the chatter anymore.
View on Reddit #50307182

fullouterjoin@reddit

You still have a copy? How does it compare to Qwen?
View on Reddit #50306114

SomeOddCodeGuy@reddit

I do still have it, but I haven't done a hard benchmark of real numbers to compare. However, as much as I've used both, I can tell you that I feel that knowledge wise and coherence wise Qwen is better. From my experience: * Wizard 8x22b was absolute magic in terms of coding ability for its time, but it's been a while since then; Qwen2.5 32b Coder is better. * Wizard sounded amazing in terms of speech quality and general understanding; it was exceptionally clever in terms of contextual reading between the lines. If you gave it requirements, it did a great job of really digging in to find what you actually wanted. It beats Qwen2.5 72b for me in that regard * Qwen2.5 72b is far better at RAG/summarization for me. Wizard hallucinated more than I liked with in-context learning.
View on Reddit #50306546

Yes_but_I_think@reddit

Every token the selected expert changes. I thought 2 tokens/s is right
View on Reddit #50296036

mxforest@reddit

But the other expert will also be loaded up, it's not like it has to spend time loading it first. It is available for use right away.
View on Reddit #50296142

Low-Opening25@reddit

even then R1 is unlikely to hit more than 10
View on Reddit #50294045

revotfel@reddit

I just bought the m2 ultra and I'm past the 14 day window, shit, I wasn't looking at release dates T_T
View on Reddit #50323080

thrownawaymane@reddit

Go to the store, be very nice and be willing to try multiple locations. Someone might take pity and make a call.
View on Reddit #50335539

revotfel@reddit

I f****** love you, thanks for encouraging me to go They did a straight swap and now I have 96 GB for the same price I bought the 64 at
View on Reddit #50915101

thrownawaymane@reddit

Just make sure you pay it forward. I’ve given similar advice over the years and people are usually too scared to try. This is just how Apple works. My work machine has 64gb of RAM but the way it’s looking I won’t ever be able to afford that much in a personal machine, much less something like 512gb.
View on Reddit #50950931

revotfel@reddit

You think so? I'll be at exactly a month on release date from when I bought it (the 12th)
View on Reddit #50338907

anythingisavictory@reddit

Your fine, either call or go to the store and ask politely.
View on Reddit #50342576

revotfel@reddit

I f****** love you, thanks for encouraging me to go They did a straight swap and now I have 96 GB for the same price I bought the 64 at
View on Reddit #50915081

anythingisavictory@reddit

So glad it worked out! Just promise to name your 8th child after me.
View on Reddit #50917316

revotfel@reddit

Well, I'll give it a go! Can't hurt to try
View on Reddit #50342612

Turbulent_Pin7635@reddit

Ok, dumb newbie question here. The M3 Ultra will be enough to run the 671b Deepseek? Also, I work with bioinformatics and never used Mac, it is hard to use it with Ubuntu? Almost all my pipelines are built for Linux.
View on Reddit #50915062

Daniel_H212@reddit

Cheaper than getting 512 GB of VRAM using discrete GPUs I guess.
View on Reddit #50294299

dinerburgeryum@reddit

It really can’t be understated that we now have access to 256GB of unified ram at 800GB/s and you don’t need to have an electrician fix your house up with 240V drops.
View on Reddit #50305423

ResolveSea9089@reddit

Sorry can you explain this to me. Is this regular RAM? I thought for AI applications we need VRAM. Is the GPU able to access to the "regular" ram because it's "unified" so we're all good there?
View on Reddit #50343012

dinerburgeryum@reddit

Yes, "unified" in this case means the GPU and CPU have equal access to the same RAM at the same speed (800GB/s). In a "standard" setup (3090 and Intel i7 for example), the 3090 will have 800GB/s access to a small pool of 24GB of RAM. The Intel chip will have access to say a pool of 32GB of RAM at an anemic 70GB/s. (The GPU can technically access the Intel's RAM pool, but through the PCI-E lanes then through the sluggish DDR5 interface.) This means you realistically have access to only the 24GB near the GPU for "fast" inference. Compare this with the M3 Ultra: the CPU and GPU are on the same chip, and share the same high-speed memory controllers. They both have access to the full 800GB/s at all times, with no PCI-E or NUMA traversal. I hope this all makes sense haha.
View on Reddit #50343763

ResolveSea9089@reddit

Yes it does! I'm learning more about computers work, so it's very cool to see some of these terms in your answer. >n a "standard" setup (3090 and Intel i7 for example), the 3090 will have 900GB/s access to a small pool of 24GB of RAM. The Intel chip will have access to say a pool of 32GB of RAM at an anemic 70GB/s. (The GPU can technically access the Intel's RAM pool, but through the PCI-E lanes then through the sluggish DDR5 interface.) This means you realistically have access to only the 24GB near the GPU for "fast" inference. This explanation you gave is very clear! I'm sure there's design reasons for it. Is this something only apple is able to do because of they have full control of their chip? Like, I try to run AI applications on my crummy 8GB VRAM computer and I'm so limited, but what's stopping others from imitating this and getting us super high VRAM/Unified memory? This seems really exciting, because I mean even the high end NVDA chips are like what 24GB for consumer models like you said? But now you're saying I could potentially run something with like 512GB of vram?? I'm hoping more folks take the leap here, it seems the cost for high vram could come down quite quickly?
View on Reddit #50344333

dinerburgeryum@reddit

So, It's not really Apple specific; AMD does this in their APU line. It's more about having the CPU and the GPU on the same die (or processor, if that's easier to visualize.) Once their stuck together, they can utilize the same circuits to access the RAM (the memory controllers), and share the small, fast internal buffers all processors have (cache). What makes Apple's approach somewhat interesting here is both that the CPU and GPU are on the same die, _and_ they have simply thrown a crazy number of memory controllers at the problem. RAM, as it happens, can only send so much data at once. You need one memory controller per RAM "lane." You might hear "dual-channel" memory; that literally means there are 2 memory controllers that can access data from two different RAM banks at once[1]. The M3 Ultra chip packs 48 memory controllers, which can each access data from different RAM chips in parallel. This is closer to a GPU's architecture than a standard CPU architecture. What is stopping us from getting the perfect balance between these two approaches is, generally, this kind of insane memory bandwidth simply isn't needed for most computing workloads. So inexpensive computers without these requirements simply won't drive up costs to meet them. (AMD has historically positioned their APU line as budget hardware, which is why we're seeing them lag on this front.) Also, as you can imagine, 48 memory slots on a motherboard would look silly, take up too much space, and come with its own set of problems with traces running hither and yon. (DIMMs actually pack several RAM chips on one small board; GPU and Apple memory is soldered and accessed individually to support wider controller access.) Server boards can ship with 24 in dual processor configurations, but they're generally large, expensive and have bonkers power requirements to boot. So yeah, I hope that clarifies some of this anyway. Let me know if there's anything else I can help clarify. [1] This is reaching the edge of my knowledge of Intel memory layouts. It may be that a single controller can do dual issue, but the concept remains the same.
View on Reddit #50347223

ResolveSea9089@reddit

I never responded here, I'm sorry. This is absolutely tremendous, thank you very much, I've learned a lot reading this post and keep referencing it as I read more about this stuff the last few days. Thank you!
View on Reddit #50684374

dinerburgeryum@reddit

Cool I’m glad it helped. 😊
View on Reddit #50684771

mxforest@reddit

Also fits in a backpack instead of taking up a room and tripping the circuit breaker.
View on Reddit #50297217

xXprayerwarrior69Xx@reddit

it's kinda wild when you think about it
View on Reddit #50303692

Daniel_H212@reddit

It doesn't double as a whole-house space heater, surely that's a downside!
View on Reddit #50298946

phata-phat@reddit

Quiet, consumes less power and unobstrusive.
View on Reddit #50296182

Innomen@reddit

Still don't understand why we can't just make sata connected bricks of ram or just pci cards of ram. It's all just sand and plastic.
View on Reddit #50328839

hishnash@reddit

CXL is used in some servers to enable you to have memory over PCIe but it is way way way slower than the memory in these systems. The difficulty here is the longer the trace and the more connectors the higher the resistance and interfrances on the single. These singles are very very very fast, and it is easy for RV interfaces to screw them up so that you cant read the correct value. Even internal reflections on the wires themselves are issues when your switching signals at these speeds the copper trace stops behaving the same as I would for a constant current flow.
View on Reddit #50587084

Innomen@reddit

Oh, I thought the interface was fast enough. Thank you for the educational reply. I assumed pci was as fast as the gpu slot. I guess in my brain "slots are slots" and it's more about where, and size, than what kind. Like, we can plugin ram, why can't we plugin more ram, know what I mean? Simplistic I guess, ignorant XD Thanks again I'll RTFM some more X)
View on Reddit #50638512

hishnash@reddit

PCI is as fast as a GPU slot but a GPU does not access is VRAM over the PCI buss. The VRAM is typicly on the GPU card and has much much faster (and lower latency) connections. The issues here are all down the trace length and trace quality, with the speeds we are dealing with these days for memory the speed of light (electricity is light) in copper becomes a huge factory.
View on Reddit #50682311

I_EAT_THE_RICH@reddit

I’m not overly impressed with the speed of models I can run on my Apple silicon. I’m wondering if I’ll ever run locally at this point. Considering my monthly expense for ai is now over $40, it’s still cheap compared to one of these bad boys.
View on Reddit #50355433

hishnash@reddit

The main use case for these is if you're doing any personal model customization or dealing with data that you cant legally send to a third party. Private company data, legal data, medical, mill, gov data etc.
View on Reddit #50586705

I_EAT_THE_RICH@reddit

I actually do model customization for clients in my industry, but we use cloud services. For these exact reasons.
View on Reddit #50598721

hishnash@reddit

It depends on the industry but for many industries the paper work and compliance needed to send the data off site or to any server that it is not already approved on is a f-ing nightmare. (and ends up costing a lot in pointless hours of legal compliance contracts). There are some cloud providers that have certification but if your company has not yet gone through the steps to validate and approve that provide it is often not worth the effort. on perm deployments are returning to companies all over the world. For example I used to work for a SW company than build software for the mining industry, sometimes clients would have issues and would share their projects (real world locacitnos of high volue deposits) with us. This data was considered extremely valuable, as the company may have spent millions if not billions to surveying the location to collect the data, when they provided it to us it was explicitly provided to a named engineer and to that engineers (air gapped) machines only. The paper work, and insurance we would have had to go through if we wanted to say upload that to a cloud service was a no go so we had to have on prem compute HW for doing the needed processing of this data (HW that would commonly be fully wiped between each customers data being loaded).
View on Reddit #50623775

I_EAT_THE_RICH@reddit

Lemme get a copy of those surveys
View on Reddit #50640804

Spanky2k@reddit

Very disappointed that it’s not an M4 Ultra although 512GB instead of 256GB is very cool. Will have to wait for benchmarks to make any kind of decision though. If it can handle R1 at good speeds then it’ll make a great in house LLM host. I have a feeling that smaller dynamic quants of R1 might end up working better though in which case the 512 one might be overkill.
View on Reddit #50312062

hishnash@reddit

The main user of this chip is apple itself within thier ML data centers. I expect the reason they want this volume of mem is to have multiple separate models loaded ready to go.
View on Reddit #50587245

power97992@reddit

What over 800 GB/s, they didnt upgrade the bandwidth? It should be at least 1092GB/s!
View on Reddit #50357460

hishnash@reddit

It is build on M3 chip IP not M4 so it I using the M3 memory controllers thus slower memory.
View on Reddit #50586623

-6h0st-@reddit

Don’t think Ultra is worth it. M4 max with 64GB is probably best choice. But still getting two 5090s would be best choice for local usage - big context window. Macs can’t handle that.
View on Reddit #50307452

Glebun@reddit

Why wouldn't a Mac with 512GB memory handle that?
View on Reddit #50309326

Xyzzymoon@reddit

Very low token/s. Just because it fit in the memory doesn't mean it runs fast. The AI processing speed on the processor is subpar.
View on Reddit #50318371

Glebun@reddit

It's comparable to 4090 speed
View on Reddit #50324555

Xyzzymoon@reddit

Where are you seeing the benchmark showing both models fitting into VRAM where the speed is comparable? Mac only wins when offloading is included, from what I see. Outside of that 4090 wins by a factor of 4.
View on Reddit #50325083

Glebun@reddit

I'm just looking at the memory bandwidth.
View on Reddit #50327745

dkaminsk@reddit

Memory bandwidth matters for text processing speed - and it’s close to nvidia with 819GB/s but prompt processing speed relies on GPU AI capabilities and here M3 Max was 10x slower than multiple 3090. With M3 Ultra it might halve. So bigger the context window the more you will feel it.
View on Reddit #50366275

Glebun@reddit

I thought that for inference memory bandwidth is the bottleneck, not the GPU itself.
View on Reddit #50367160

dkaminsk@reddit

Inference yes, prompt processing no. Inference happens after prompt is processed.
View on Reddit #50369207

Glebun@reddit

Inference means the entire process of using the LLM, as opposed to training.
View on Reddit #50395028

dkaminsk@reddit

Ok then text processing vs prompt processing TP is about bandwidth, PP is not
View on Reddit #50399079

Glebun@reddit

I don't understand the distinction - prompt is text
View on Reddit #50399562

dkaminsk@reddit

Dunno how to explain it - See here under llama 3.0 you have a table with text processing speed in relation to context window which is directly linked to GPU bandwidth and then you have prompt processing speed which on Mac is sometimes 10x slower than nvidia GPU https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
View on Reddit #50401091

Glebun@reddit

Oh, got it, so it's basically input vs output
View on Reddit #50563802

Xyzzymoon@reddit

That is not the only thing determining LLM speed, see here for comparisons. https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference you can see that several Macs SKUs here tested have the same memory speed, but different processors impact the speed.
View on Reddit #50329977

swagonflyyyy@reddit

800GB/s Mo-Mother of Mercy!
View on Reddit #50293138

Zyj@reddit

I was hoping for more. There is a M4 Max chip with 546 GB/s. So something with 1092GB/s would have been logical.
View on Reddit #50295063

swagonflyyyy@reddit

I was expecting less. Granted, 800GB/s isn't going to do much for colossal models, but it should run 70B models considerably faster than 512GB/s.
View on Reddit #50295423

Zyj@reddit

Yes. It will be interesting to see two Project Digits (128GB $3000 each) connected with their high speed networking compete with a single $5600 Mac Studio M4 Max with 256GB RAM.
View on Reddit #50299670

Glebun@reddit

Not a fair contest - this has 3x the memory bandwidth (not even considering that networking is what, 1GB/s (10gbit)?
View on Reddit #50309163

TheElectroPrince@reddit

Thunderbolt 5 is there, and that's 80Gbit/s bidirectionally.
View on Reddit #50376462

Glebun@reddit

That's 10 times less than the memory bandwidth.
View on Reddit #50394979

TheElectroPrince@reddit

You can probably aggregate the three ports together between two Mac Studios for 240Gbit/s bidirectional bandwidth. Or you can connect multiple Mac Studios with about 1-2 connections between each-other for up to 160Gbit/s bandwidth.
View on Reddit #50559741

Glebun@reddit

> You can probably aggregate the three ports together between two Mac Studios for 240Gbit/s bidirectional bandwidth. You cannot. > Or you can connect multiple Mac Studios with about 1-2 connections between each-other for up to 160Gbit/s bandwidth. The 80 Gbps will still be the bottleneck. I misspoke earlier, btw, the memory bandwidth isn't 10 times faster, since it's 800 GB/s, not 800 Gbit/s. It's actually 80 times faster
View on Reddit #50560992

petuman@reddit

800GB/s is (only) for M3 Ultra, not M4 Max. Given M1 Ultra existed for almost 3 years with same bandwidth, there's really no reason to expect less.
View on Reddit #50299999

SeymourBits@reddit

What you’re speculating about should be the (currently unreleased) M4 Ultra.
View on Reddit #50302008

indicava@reddit

Probably won’t be released. They (Apple) specifically stated (for the first time publicly) that “not all CPU generations will get the Ultra variant” = No M4 Ultra, that’s why we’re getting an M3 Ultra so deep into the M4 rollout.
View on Reddit #50312132

SeymourBits@reddit

That’s probably the point I’d float if the M4 Ultra wasn’t scheduled for at another year or so. Otherwise, knowledge of superior specs would hurt M3 Ultra sales, which is pure kryptonite to Apple. Notice how they didn’t *specifically* say that there will be no M4 Ultra.
View on Reddit #50331291

JBsthirdleg@reddit

Not a computer guy here, but is there someone that could help me translate the computer power of m4max to the equivalent of what I have in my pc to know which apple chip will be the correct one for my wife to edit photos for her business and also play her guilty pleasure... WoW? My pc Intel i7-12700 Nvidia GeForce rtx 3070ti 1tb storage 32gb ddr4 ram Thank you.
View on Reddit #50508228

iCruiser7@reddit (OP)

It roughly equivalent to Ultra 9 285K + RTX 4070-4080 desktop, depending on your specific software and its optimization
View on Reddit #50528490

JBsthirdleg@reddit

Thank you! I'm still an idiot. With the new mac studio and m4 max Will it easily meet these?
View on Reddit #50531657

iCruiser7@reddit (OP)

I don't play WoW but feel free to buy one from Apple and try for yourself. If it doesn't work just return it within 14 days.
View on Reddit #50541293

JBsthirdleg@reddit

https://preview.redd.it/xua2xmpeydne1.jpeg?width=1080&format=pjpg&auto=webp&s=aab50ae2b9f5ea866d5b67d27a9230bf439f03c9
View on Reddit #50531675

synn89@reddit

The RAM speed is disappointing. I'm not sure how practical the 512GB of RAM will be outside of niche MOE models that use smaller experts. It sounds great for a local Deepseek at a decent quant, but I'd really like to see what the landscape of new 200B+ models are, architecture-wise, before wanting to invest in this device. Will Llama4 405B be a MOE, or is Meta going to stick with monolithic models?
View on Reddit #50307048

WhereIsYourMind@reddit

I'm particularly interested in the use case of extended context length. With enough context length, I can feed entire repositories into context and the model has to make fewer assumptions about how to use it.
View on Reddit #50541106

Zyj@reddit

OK, so the Max is an M4 Max but the Ultra is an M3 Ultra. 819GB/s for the RAM for the M3 Max. German prices: 11874€ for the 512GB model 6999€ for the 256GB model (with the smaller CPU model)
View on Reddit #50291758

Abject_Radio4179@reddit

Up to 819 GB/s for the M3 Ultra. The binned part will just slightly more bandwidth than the M4 Max.
View on Reddit #50421944

Synyster328@reddit

Dumb question but would this work similarly as a 4090 for, say, training diffusion model LoRAs? Or do those require CUDA specifically?
View on Reddit #50344283

noiserr@reddit

> It's interesting to compare this to a RTX 4090 with 96GB VRAM for $6000 (with around 1TB/s mem bandwidth). 96GB 5090 (L50? or A6000?) with like 1.7TB/s.
View on Reddit #50341435

AbominableMayo@reddit

So basically get a much better amount of RAM, similar but materially slower speeds and a full MacOS front end for the same price? Is my interpretation there off base at all?
View on Reddit #50293230

Zyj@reddit

It just shows how overpriced these RTX 4090 96GB are. The Mac memory is also overpriced i'm sure but it's hard to get 819GB/s of memory bandwidth for unified memory anywhere...
View on Reddit #50294735

AbominableMayo@reddit

Right, memory bandwidth is the only knock against the ultra vs the 4090. I’m sure the power draw difference isn’t going to be insignificant either
View on Reddit #50295521

poli-cya@reddit

Wouldn't processing speed differences also be a big difference between the two? I thought the 4090 was substantially faster.
View on Reddit #50311835

AnotherSoftEng@reddit

Based on how the previous silicon Macs have been scaling, the power draw of an M3 Ultra should be *much* less by a significant factor.
View on Reddit #50297583

-6h0st-@reddit

Mac 48 GB doesn’t equal 4090 48GB in speeds. Prompt processing and context window matters a lot for any serious use and Mac simply is orders of magnitude worse than Nvidia
View on Reddit #50307827

Final-Rush759@reddit

4090 is much faster in training models.
View on Reddit #50300804

dinerburgeryum@reddit

I don’t believe many people are proposing training, though MLX has support for it. I believe most use cases here are focused on inference.
View on Reddit #50305018

Such_Advantage_6949@reddit

Mac is still overpriced like usual. However, when putting them next to Nvidia. Suddenly it doesnt look like it is that overpriced. When the price of this 512GB Mac studio is same as 1 A6000 48GB Ada
View on Reddit #50296800

dinerburgeryum@reddit

That really puts it in perspective…
View on Reddit #50304367

-6h0st-@reddit

Linked to GPU cores not exactly RAM. Ram is running at much lower speeds - ram for GPU usage only reaches those speeds
View on Reddit #50307703

Glebun@reddit

What do you mean? It's unified memory.
View on Reddit #50308518

-6h0st-@reddit

If you search Google you will find it - seqrch for Alex Ziskind YouTube channel - memory bandwidth for system runs at lower speeds and only ram for GPU usage can access those speeds, therefore you can see there is a correlation between bandwidth speed and number GPU cores/RAM size (both go up) - hence 60 GPU core version will have around 25% lower bandwidth. Haven’t seen if anyone tested if this correlation exists for GPU core count or memory size - in other words if 32 GPU max with 48 and 64GB will have same vram bandwidth or different. I believe it’s bound to GPUs, since only GPU accesses that bandwidth- and otherwise only 512GB version of M3 Ultra would be getting 819GB/s bandwidth which I don’t think would be the case (since 192GB M2 Ultra was).
View on Reddit #50310894

Enough-Meringue4745@reddit

You can also network macs together for networked inferencing
View on Reddit #50305425

siegevjorn@reddit

Wait how is M4 different in MBW depending on GPU count? Memory bandwidth depends on RAM speed and I suppose RAM speed would be the same in both cases.
View on Reddit #50308682

-6h0st-@reddit

And I would add 819GB/s for 80 core GPU Probably 614GB/s for 60 core GPU
View on Reddit #50307627

ReginaldBundy@reddit

German price includes 19% VAT. Most buyers will be businesses who won't have to pay VAT. However, it's just 1TB SSD. 2TB: +EUR 500, 4TB: + EUR 1200
View on Reddit #50307599

noxtare@reddit

very strange that they are using  M3 Ultra and no M4 Ultra.
View on Reddit #50293770

PongRaider@reddit

They probably reserves it for future Mac Pro series
View on Reddit #50388413

SeymourBits@reddit

Pretty sure they have to wait for yield to catch up as fabbing 2 perfect and adjacent M4 Max chips is relatively rare.
View on Reddit #50302467

xrvz@reddit

The M3 Ultra isn't made up of two M3 Max anymore.
View on Reddit #50325809

SeymourBits@reddit

Looks like it is exactly 2 M3 Max chips, connected: “Apple says the M3 Ultra chip is essentially two M3 Max chips fused together with its "UltraFusion" technology, so the chip's specs are all doubled compared to the M3 Max. There was speculation last year about the M3 Max chip lacking UltraFusion technology, but Apple's announcement today has proven that rumor was false.”  https://www.macrumors.com/2025/03/05/apple-introduces-m3-ultra-chip/
View on Reddit #50330397

fallingdowndizzyvr@reddit

The Ultra lags the Max by about a year.
View on Reddit #50320972

indicava@reddit

https://www.reddit.com/r/LocalLLaMA/s/LocAC0c6iV
View on Reddit #50312361

Dax_Thrushbane@reddit

My thoughts also.
View on Reddit #50298418

mxforest@reddit

It takes time to glue 2 Max chips together. They didn't use a hairdryer so the process took over an year.
View on Reddit #50297095

BaysQuorv@reddit

I wish they released a chip which had like 100x the neural engine size. Like an ultra chip but all that extra space and compute goes only to a gigantic neural engine. On my m4 running the same language model purely on the neural engine takes 1.7W, on the GPU it takes 8W. And that 8W is already much more efficient than running on a "normal" GPU. Now imagine scaling up that neural engine 100x to work at the same power draw as an nvidia gpu. It would be like having your own groq chips at home.
View on Reddit #50297177

TheElectroPrince@reddit

They're definitely doing this for their own cloud server chips that run Apple Intelligence. No way they're giving these out to consumers.
View on Reddit #50376575

SteveRD1@reddit

I think something like that is coming from Apple, but nowhere near ready. Hence they drop this dud to update the Studio.
View on Reddit #50314295

Aaaaaaaaaeeeee@reddit

From this announcement, didn't see any increases to the neural engine cores, so we can assume that they just did nothing. Hopefully I'm wrong. Made the chart based on previous info. | Specs | Peak M2 Ultra | Peak M3 Ultra | Increase (%) | |---------------|---------------|---------------|--------------| | **CPU Cores** | 24 | 32 | +33.3% | | **GPU Cores** | 60 | 80 | +33.3% | | **NPU Cores** | 32 | 32 | 0% | | **NPU TOPS** | 31.6 | **31.6(?)** | 0% |
View on Reddit #50306212

BaysQuorv@reddit

Same NPU = I sleep. ANE is the future, not more inefficient gpu/cpu cores
View on Reddit #50307340

AngleFun1664@reddit

How are you running models directly in the neutral engine? I’d like to try that on my M1
View on Reddit #50303886

Master-Meal-77@reddit

Probable ANEMLL
View on Reddit #50305942

dinerburgeryum@reddit

[ANEMLL](https://github.com/Anemll/Anemll) is the only solution I know of, and you take a massive hit on context size and it’s Llama only right now.
View on Reddit #50305608

dissemblers@reddit

You need M3 Ultra to get > 128GB unified memory, and M3 Ultra w/80 core GPU to get 512GB $14099 for top spec (m3 ultra, 32 core cpu, 80 core gpu, 512GB unified memory, 16 TB SSD) $9500 if you go with 1 TB SSD instead (cheapest config with 512GB memory) $3500 for M4 Max w/40 core GPU, 512GB SSD, 128 GB unified memory (cheapest 128GB)
View on Reddit #50292414

joninco@reddit

It has thunderbolt 5 -- so no need to buy the much larger storage. Just get an external enclosure.
View on Reddit #50300568

TheElectroPrince@reddit

If the insides don't change as much, I presume someone will reverse-engineer the NAND flash carrier PCB, and we'll get replaceable storage again like what happened with the last Mac Studios.
View on Reddit #50376030

Magnus919@reddit

This is the way. I’ve got an external NVMe RAID.
View on Reddit #50337308

Cool-Cicada9228@reddit

I’ve been paying hundreds of dollars per week for Claude credits using Cline/RooCode. I’m considering getting an M3 Ultra maxed out except for SSD (so around the $9500 price point). Can someone explain to me what I can expect to see? I e read that I could run R1 Q4 but I don’t know what kind of experience it is? Would I be disappointed compared with Claude? Open to any other model suggestions and expectations. I’ve also heard that you can connect 3 together if anyone has more information about doing that I’d consider investing in that if it means I could run R1 or something similar fully. What I don’t want to have happen is make a big purchase and still need to use Claude for most of my coding. I’m not very experienced with hardware so if anyone can explain how big of a jump it will be to M4 Ultra I’d appreciate it because I don’t know if I should wait for a Mac Pro. If it’s only marginally better or faster architecture then I’d rather buy a Mac Studio now.
View on Reddit #50338545

RikuDesu@reddit

try the 641b q4 model on open router
View on Reddit #50352569

Cool-Cicada9228@reddit

I looked but didn’t find it there. I’ll try running it in the cloud though before I make the hardware purchase. That’s a good idea I can experience the capabilities of the model but I won’t know how the speed compares with local hardware.
View on Reddit #50370334

canyonkeeper@reddit

This is gonna be one again very slow for any PyTorch code and mlx is not mature at all
View on Reddit #50363155

lordmord319@reddit

Doesn't look that appealing to be honest for that price you could build a nice dual socket epic server.
View on Reddit #50293251

Glebun@reddit

It won't be this small and efficient, and will much slower memory
View on Reddit #50309060

lordmord319@reddit

Sure won't be as small or efficient but with dual sockets we would have a theoretical bandwidth of **921.6 GB/s** that's more then the M3 Ultra. And obviously you get the flexibility of adding more Ram. Obviously one isn't clearly better then the other but for me i would preferer the epyc over the apple
View on Reddit #50357347

Glebun@reddit

Oh interesting, I didn't think it'd be that fast. How much could that cost, though?
View on Reddit #50357491

Kind-Log4159@reddit

Yeah, for around 6k you can get 6-8t/s with a dual socket build. I’m conflicted whether to pull the trigger or not, but I’ll hold off because they will announce the m4 ultra soon, It has less bandwidth than a 4090 which isn’t promising.
View on Reddit #50302305

SporksInjected@reddit

Are those new prices or used? Wouldn’t you pay $6k for just the ram if it was new?
View on Reddit #50307979

Kind-Log4159@reddit

New. Ram would be 4k or so for ddr5
View on Reddit #50334739

indicava@reddit

M4 Ultra ain’t coming
View on Reddit #50312286

chaddone@reddit

I am considering buying the maxed out new Mac Studio with M3 Ultra and 512GB of unified memory as a CAPEX investment for a startup that will be offering a then local llm interfered with a custom database of information for a specific application. The hardware requirements appears feasible to me with a ~15k investment, and open source models seems build to be tailored for detailed use cases. Of course this would be just to build an MVP, I don't expect this hardware to be able to sustain intensive usage by multiple users. How feasible is that?
View on Reddit #50323271

RikuDesu@reddit

yeah you'd have to process one prompt at a time, there are ways to queue them but if you have a lot of people hitting the server it would be hard. Also not all models are optimized for MLX, most are cuda optimized so you may be limited when trying out some specific fine tunes
View on Reddit #50352660

blacPanther55@reddit

Use the "education" discount and save yourself 3k.
View on Reddit #50326665

tnnnn@reddit

Waiting for M4 Ultra with 1TB ram! /s
View on Reddit #50296831

Soft_Constant_7355@reddit

For an amazing price of $29,999! \*sarcasm, but maybe not sarcasm\* :(
View on Reddit #50346789

Sudden-Lingonberry-8@reddit

finally a computer that can run deepseek
View on Reddit #50315858

gripntear@reddit

The question is how fast can it process my prompts when allocating 32k context for LLMs up to 123B, like Mistral Large. Given all that, if it can output 250 tokens at decent speeds, regardless of context size, I would fucking get one right away because, holy shit, this is what I have been waiting for.
View on Reddit #50329933

Ok_Warning2146@reddit

Prompting processing should be around 80% of 3090. If you are happy running Mistral Large with multiple 3090s, then it should be good enough for you.
View on Reddit #50345167

ortegaalfredo@reddit

These specs are good. I would like to know how they compare to the equivalent GPU. The advantage of GPUs is that you can batch requests. While a single individual prompt can run at 15 tokens per second in a GPU, you can run 20 prompts in parallel to achieve an effective throughput of hundreds of tokens per second. Can this be done on a Mac?
View on Reddit #50304376

Ok_Warning2146@reddit

It is a slightly weakened 3090 with 512GB at max config as it gets 114.688TFLOPS FP16 vs 142.32TFLOPS FP16 for 3090 and memory bandwidth of 819.2GB/s vs 936GB/s.
View on Reddit #50345015

DinoAmino@reddit

I understand these things do quite well with simple prompts and no to little context. Is this device going to perform well when using 16k to 32k context or will performance plummet?
View on Reddit #50309857

ortegaalfredo@reddit

Yes, that's what I want to know before dropping 10k in hardware. GPUs still have massively more compute than a mac, not only memory bandwidth.
View on Reddit #50316812

Glebun@reddit

What's an equivalent GPU with 512GB VRAM?
View on Reddit #50309478

AutomaticDriver5882@reddit

Is there bench marks on this? I find it hard to believe this preforms like a 4090.
View on Reddit #50340237

AbheekG@reddit

Yaaaaaaayyyyy!!!!
View on Reddit #50337514

Spirited_Eggplant_98@reddit

This seems like a decent deal just saw a YouTube where network chuck tested 5 M2 Max studios with 64 gb of ram running deepseek r2 with exos - that’s at least $10k in hardware and only gets you 320gb of ram and thunderbolt 4 is nowhere near 800gb/s.
View on Reddit #50337001

Mediocre-Ad9008@reddit

Wow, wasn't expecting the M3 Ultra at all at this point. Everyone said the M3 line was dead.
View on Reddit #50301369

thrownawaymane@reddit

The TSMC process node it’s based on was supposedly a dud so this is a surprise.
View on Reddit #50335725

extopico@reddit

I will not buy it but this is cheap. I’ve only recently started using a MacBook Pro and it’s a beast. For anyone used to Linux getting a super powerful Mac is hugely appealing.
View on Reddit #50330473

Chelono@reddit

>Up to 16.9x faster token generation using an LLM with hundreds of billions of parameters in LM Studio when compared to Mac Studio with M1 Ultra, thanks to its massive amounts of unified memory. Yeah, cause it fits and doesn't use disk (swap)... Can't wait for actual numbers
View on Reddit #50291632

Mochila-Mochila@reddit

Strix Halo-tier marketing bragging.
View on Reddit #50328712

MoffKalast@reddit

Given how Apple prices SSDs, it's gonna be really funny when people have less disk than RAM.
View on Reddit #50299436

v00d00_@reddit

Time for RAMdisk to make a comback
View on Reddit #50316495

sluuuurp@reddit

Yeah, that’s what they said, “thanks to its massive amounts of unified memory”
View on Reddit #50293594

Remote_Cap_@reddit

They're desperate.
View on Reddit #50292436

Doublespeo@reddit

At 512GB… is it possible to just put the whole system in memory??? like a F1 SSD?
View on Reddit #50325472

NNN_Throwaway2@reddit

Pretty tempting as it sounds like the chances of an M4 Ultra chip are remote to none.
View on Reddit #50323214

Mochilongo@reddit

M3 Ultra instead of M4 Ultra 😭
View on Reddit #50312656

Xyzzymoon@reddit

> biggest bottleneck for Macs is the memory bandwidth Not in the context of LLM. A 4090, for example, only has 1008 GB/s. Slightly more than an M2 Ultra, but as long as the model fits, 4090 is around 4 times faster. Even underclocking the memory speed on the 4090 doesn't yield a significant drawback. This suggests that the bottleneck on the M2 Ultra is most likely Processing.
View on Reddit #50318644

Mochilongo@reddit

2 different architecture, maybe i didn’t express myself correctly. In Mac ecosystem the bottle neck right now is the memory bandwidth.
View on Reddit #50322735

SteveRD1@reddit

Wondering now if M4 Ultra (M5 Ultra?) will be reserved for the Pro to give it some distinction from the Studio line. I and disappointed too.
View on Reddit #50314161

maddogawl@reddit

I'm out of Kidney's to sell for computers and computer accessories!
View on Reddit #50322566

davewolfs@reddit

The Apple Chips are great machines but I am not convinced the hardware will be capable of running a model that large at adequate speed. I hope to be proven wrong but the M3 and M4 max chips don’t do that well with anything beyond 32B. The so called thinking models output way too many tokens for them not to be running at least 40-60 tokens per second if you want an adequate experience.
View on Reddit #50322328

fallingdowndizzyvr@reddit

Shit. I didn't think they would go 512GB. But it's great that they are holding price line with the 256GB model. That's the same price as the M2 Ultra with 192GB.
View on Reddit #50320064

Turbulent-Week1136@reddit

Sorry for the noob question but how does this compare for training or fine tuning? Do these specs still only make it better for inference or does it make training easier/faster?
View on Reddit #50316644

BumbleSlob@reddit

So you can run Unsloth DeepSeek R1 on the m3 ultra / 256GB ram at home for $7k (it needs 160Gb (V)RAM), while still having room for smaller models to use in speculative decoding.  Very interested to see what real world tokens per second you could get out of this. To be clear this is still super expensive but it’s getting DeepSeek R1 closer to hobbyist households. I’d probably be willing to throw $5k at a solution that can run it at home at a reasonable throughput.
View on Reddit #50293049

SubstantialSock8002@reddit

On my M1 Ultra Mac Studio I get 13.8 t/s with Llama 3.3 70B Q4 mlx. M1 Max to M4 Max inference speed seems to roughly double, so let's assume the same for M1 Ultra to M3 Ultra. Accounting for 2x faster performance, \~9.5x more parameters, Q2 vs Q4, it seems like you'd get closer to 5.8 t/s for R1 Q2 on M3 Ultra? It's definitely awesome that you can run this at home for <$8k, but I feel like using cloud infrastructure becomes more attractive at this point.
View on Reddit #50316099

Individual_Holiday_9@reddit

Wild how spendy these things are. I’m sitting here plodding along with my m4 / 24gb
View on Reddit #50291969

OverCategory6046@reddit

Fully maxxed out one comes to 18k USD in the UK lmao
View on Reddit #50293431

DirectAd1674@reddit

Sure, but you can't carry a server to a friend's house. You can stuff a Mac Studio in a backpack.
View on Reddit #50294120

MagicZhang@reddit

Just curious, what activities at a friend’s house would require 512GB of RAM?
View on Reddit #50297024

Sudden-Lingonberry-8@reddit

run half deepseek
View on Reddit #50315819

darth_chewbacca@reddit

I do this to show my poverty stricken friends just how fucking rich I am.
View on Reddit #50298154

OverCategory6046@reddit

I already talk about AI enough as is, I don't want to put that burden on my poor friends.
View on Reddit #50298519

animealt46@reddit

That's a very solid machine lol. Comparison is the thief of joy I guess. You can rock models that bring Nvidia 12~16GB users to their knees.
View on Reddit #50298253

nonsoil2@reddit

In italy, 11k€ for the 512gb ram, 1tb ssd(minimum), m3 ultra.
View on Reddit #50291486

robertotomas@reddit

In the us you would pay taxes on top of the numbers you see, in europe the VAT is built in
View on Reddit #50294590

Cergorach@reddit

Sales taxes (VAT) depends on the state and they tend to be a lot lower then in the EU. I'm from the Netherlands and we pay 21% VAT, and that's not even the highest in the EU. The US version is \~$9500, the Dutch one is almost €12k. €1.00 is worth $0.93, so when we do the conversion and add the VAT or prices are around 10% higher then in the US.
View on Reddit #50312026

robertotomas@reddit

Yes, you guys do a lot more funding via VAT. President Trump mentioned he was interested in doing that too
View on Reddit #50312573

Cergorach@reddit

You have services? Free homeless on your lawn? ;)
View on Reddit #50315479

42nd_loop@reddit

It would literally be cheaper to get on a flight to Oregon or another state without a sales tax just to buy it lmao
View on Reddit #50310048

robertotomas@reddit

HEH. But VATs serve a purpose that Oregon never will
View on Reddit #50311044

Glebun@reddit

Same price as the US (plus tax)
View on Reddit #50308608

nonsoil2@reddit

That’s a first
View on Reddit #50308889

Glebun@reddit

Not really
View on Reddit #50309186

tzujan@reddit

This makes me wonder what NVIDIA will do with Project DIGITS. I know that Nvidia limits their consumer GPUs so that they can charge a fortune for their enterprise GPUs. There's also quite a bit of buzz, at least in my little curated feed, for systems that can run and even train larger local models such as EXO Labs. It seems like NVIDIA could really crush it in the Local Model market if they wanted to.
View on Reddit #50315439

NootropicDiary@reddit

My body is ready
View on Reddit #50306304

SeymourBits@reddit

But, is your wallet?
View on Reddit #50314040

The_Hardcard@reddit

This will crush with reasoning MoE’s like Deepseek R1. The bandwidth for generating hundreds if not thousands of reasoning tokens at 37 billion active parameters I think will put both the M4 Max and M3 Ultra ahead of Strix Halo and Project Digits.
View on Reddit #50312853

-6h0st-@reddit

Let’s wait for tests from first owners. But I’m doubtful it will be any good for any serious usage. Macs do suck with fine tuning or big context window or bigger prompts.
View on Reddit #50308136

Puzzleheaded-Dust268@reddit

128Gb M4 Max MacBook Pro vs same spec Mac Studio 🤔. Any views? I am going for a high spec machine for an MSc project using transformers, etc.
View on Reddit #50308051

gintrux@reddit

Imagine that this is gonna be like 1000$ laptop someday in the future
View on Reddit #50307121

Regrets_397@reddit

The 512GB option is a bigger deal on the ultra than having only 2xM3 Max instead of 2xM4 Max. Looking forward to getting my hands on one of these refurbished in a few years lol
View on Reddit #50306666

Only-Letterhead-3411@reddit

512 gb RAM is amazing for huge MoE models like R1. Not so good for huge dense models like 405B. Price is terrible. I don't think it's worth that price tbh. But it's Apple. I guess no one is surprised. I'm sad that affordable 64/96 gb mac studio is no longer an option like it used to be for M2 Max one.
View on Reddit #50292675

TaloSi_II@reddit

how are you getting 500 gb of vram for less?
View on Reddit #50298076

Only-Letterhead-3411@reddit

[EPYC 9334 CPU + Motherboard](https://www.ebay.com/itm/186024089736?_skw=epyc%20motherboard%20cpu%20combo%20sp5&itmmeta=01JNKK0S8CYSFPETQ4KPPPNNVC&itmprp=enc%3AAQAKAAABAFkggFvd1GGDu0w3yXCmi1cv6BJxhVmKioCpkwhXSOagZn3aap%2F2ZO6q8rZK%2BMtaHiWtbiV3LzoQdWQgLwk8FSJf%2BwuLnXrbbLYKlm9N%2FxOPXHWLNE%2F2M3g%2FkyvGvutipUDcZxoStxIfcjJ4jFd5%2FcAwdSPewTE%2F3BdiJbDo7W97BsZ28pGGpwXuj82XSmOzDea%2FmiXCfsjyE%2BgK5Wfbp4Wkur%2BxXfxCYAo%2BR5O5oyHo7JLdUwgJMd0eGzC1PDwRoWEdjzWEQxFKv7SQyE4o0QemR1XcuDCyuvU%2FzDdW6w0dyT%2BJ1bGdEHQpTvBMx9rkQun0fQ%2FzRdePSN0F0mPzcGo%3D%7Ctkp%3ABFBMrpSD86xl) \- $1.500 12x [32gb DDR5 RAM](https://www.ebay.com/itm/116454950032?_skw=ddr5%20ram%2032gb&itmmeta=01JNKK95NRR7J2NMVSP62X9QHN&itmprp=enc%3AAQAKAAAA0FkggFvd1GGDu0w3yXCmi1erSEmmMgbYq7x2QOAq52alOd2HX9EHxvdjyh27g5emCVyGaMhBYcCrZsITCQLGkkJR1ExvsJEMcsbx6FB%2F%2BIp9eu15Y2RE%2B2SvZjJhRnIFR3YQLFBb%2Bbl8V5eMP1AzyEcKtzP9A9RaJVkoxPRnQL916GT5oVfGVGC9w7YyljtSmCE9wz6BrhD%2Bn5cOtcTnlHuLOoVxNm4KmBkGUPhU5TElKipxahhd5IVRc0BWyu4RW27mgOmi0nZ1r5hKabnafTQ%3D%7Ctkp%3ABk9SR4LbpPOsZQ) \- $1.000 Other stuff like cooler, 1 TB nvme etc. - $250 = 2.750$ - 384gb RAM, 460gb/s bandwidth, 1 TB ssd - **384gb RAM should be enough for running MoE models like R1, can get 64gb sticks instead if need more RAM, can upgrade RAM and storage yourself, can use linux instead of macos** M4 Max = 3.700$ - 128gb RAM, 409gb/s bandwidth, 1 TB ssd - **costs more for less bandwidth** M3 Ultra = 9.500$ - 512gb RAM, 800gb/s bandwidth, 1 TB ssd - **costs 3.5x more, theoretically only 50% faster at inference**
View on Reddit #50305325

codingworkflow@reddit

Models so big will be still damn slow and you are not close to run R1 Q8. I feel the unified is quite hyped. Best 80-90 GB Vram Gpu setup.
View on Reddit #50302188

roshanpr@reddit

Should I return my 5090 and buy one of these?
View on Reddit #50295994

mxforest@reddit

Depends on the kind of models you want to run.
View on Reddit #50297002

roshanpr@reddit

I do a lot of crap that uses CUDA ; LLM I have aniniboc with 96gb ram and an Nvidia hpu
View on Reddit #50297828

Dax_Thrushbane@reddit

Not sure how i feel about this. 512Gb ram was definitely on the cards, but only M3? Le sigh.
View on Reddit #50297659

pseudonerv@reddit

M3 Ultra is two M3 Max soldered together, right? We need M4 Ultra, it should be more than 1TB/s.
View on Reddit #50297516

davewolfs@reddit

10K so I can rub R1 on some local system instead of Firebase. Interesting but probably not worth it. I wonder how fast it will run.
View on Reddit #50296628

undefinex@reddit

I'm assuming the limiting factor for running a full model like R1 will be it's memory bandwidth at this point? How many toks/s can the maxed out memory config expect?
View on Reddit #50296555

Feisty-Pineapple7879@reddit

Now this is a proper AI Inference Hardware.
View on Reddit #50291526

mxforest@reddit

Tim Cooked with this one. Based on RAM configs and their examples. It seems to be aimed directly as an R1 machine without saying it out loud to avoid Backlash from 🥭 for supporting China.
View on Reddit #50296003

AaronFeng47@reddit

I know I don't really need this, but I want this...
View on Reddit #50295656

Solaranvr@reddit

Mama Lisa Su, whatever you do with the Strix Halo sequel, please release a competing SKU to this.
View on Reddit #50295399

bmo333@reddit

Heysus Christ!!!
View on Reddit #50295371

tibbon@reddit

Consider that if you have a bonafide business _need_ for that much memory, then this is probably well within a reasonable budget. If this is a _want_ then the price probably seems absurd and that's ok.
View on Reddit #50294614

AaronFeng47@reddit

Disappointing, the memory bandwidth is basically the same with M1 Ultra 
View on Reddit #50293788

AaronFeng47@reddit

But the 512gb one should be perfect for local DS V3&R1 + Qwen2.5 Max, since these models are MoE
View on Reddit #50294510

sluuuurp@reddit

546 GB/sec memory bandwidth. So just over one token per second if you run the largest model that fits in the unified memory (with no mixture of experts or speculative decoding).
View on Reddit #50293816

Krazie00@reddit

I’ll wait for reviews, looking forward to them.
View on Reddit #50293794

Least_Expert840@reddit

Must. Resist. Clicking.
View on Reddit #50293477

AaronFeng47@reddit

No m4 ultra?
View on Reddit #50293442

NeedsMoreMinerals@reddit

Damn with 512gb of unified memory you could run some serious AI models
View on Reddit #50292758

MannowLawn@reddit

512 isn’t cutting it anymore, next year 1024 I’ll have another look.
View on Reddit #50292733

albus_the_white@reddit

Can't wait for reviews.
View on Reddit #50292285

nrkishere@reddit

> offers an up to 80-core GPU, more than any Apple silicon chip; a powerful 32-core Neural Engine for on-device AI and machine learning (ML) Holy hell!! MLX go brrrrrrr
View on Reddit #50292186

bullerwins@reddit

Really looking forward to the benchmarks. Let's hope someone reviews the 512GB variant with R1, you can probably fit Q6 in there. It's definitely more power efficient than the cpumax or gpumax way. But not sure about the performance. Realistically you can probably fit 8? 3090s in a rack, but thats less than half the VRAM, and it will cost around 9K for a setup like that.
View on Reddit #50291907