I'm strong enough to admit that this bugs the hell out of me

[-]

SadEntertainer9808@reddit

It's almost as if a device that's designed end-to-end by the same firm, largely stamped out of a single die, and built with a robotic degree of precision hand-assembly could achieve, using parts too finicky to be user-serviceable, might actually just be better, and that the guys assembling a "perfect" workstation are maximizing an aesthetic instead of maximizing performance. If you saw a guy gluing together a car in his garage and actually thinking it might beat a Porsche you wouldn't even think he was insane. You'd just think he was stupid.

Reply

[-]

ForsookComparison@reddit (OP)

Thread is older than I am and my knees hurt

Reply

[-]

Cergorach@reddit

If this is the case, someone sucks at assembling a 'perfect' workstation. ;) Sidenote: Owner of a Mac Mini M4 Pro 64GB.

Reply

[-]

o5mfiHTNsH748KVq@reddit

Im pretty happy with my 512gb m3 ultra compared to what I’d need to do for the same amount of vram with 3090s. Spent a lot of money for it, but it sits on my desk as a little box instead of whirring like a jet engine and heating my office. I wish I could do a cuda setup though. I feel like I’m constantly working around the limitations of my architecture instead of being productive building things.

Reply

[-]

bigh-aus@reddit

I really think the jumps up the stack are: 512gb m3 $10k 4x RTX6000 pro Maxq $36k 4x H200NVL $130k (or at this point you're into second hand DGX stations) I think it depends on your usecase and what you want to do. $10k is a bit to spend on privacy (now), I'm hopeful that IF the public models ramp their cost, there will be more hardware options to run locally then.

Reply

[-]

Suitable-Program-181@reddit

Cuda is trash, asahi unlocks M chip better than cuda will with any rtx.

Reply

[-]

Cergorach@reddit

I agree, your M3 Ultra 512GB is a LOT more energy efficient and cheaper then 21x 3090... But it's not faster then that 3090 card. Which is what the meme is hinting at.

Reply

[-]

o5mfiHTNsH748KVq@reddit

Right, yeah, it's definitely not faster.

Reply

[-]

ArtfulGenie69@reddit

If mac wasn't so slow I'd have got one too. All the hardware is weirdly only taking care of one aspect of the ai. Like Mac can run a big model slowly but is expensive. Nvidia has speed and good drivers and many projects take advantage of cuda but the price per GB of vram is very high. AMD sucks at drivers and almost none of the new gits work out of box with it and it is slow but it's like half the price of Nvidia and you don't need the big bucks a Mac investment would take.

Reply

[-]

Gallagger@reddit

"AMD sucks at drivers and almost none of the new gits work out of box with it" The new gits work with Mac though? Thought they don't work because they're cuda based = needs Nvidia.

Reply

[-]

ArtfulGenie69@reddit

Maybe it's about the same for amd/mac? Like with both you are buying only inference. I think when you stay in the realm of comfyui and the toys that have had some time to mature they work ok after you get past the rocm hurdles but they are much slower on the amd and Mac. I'm kinda thinking more about when the new models drop or using some extra fast inference like vllm or exl3 or all the tts models that come out. They are all only going to work through Nvidia hardware from the jump as they were all developed on it, at least that's what it looks like when I'm installing them all the time. Like I was saying I don't think you would want to train wan/qwen/flux or any of the tts models on an amd card or a mac. It sounds like hell just getting amd to infer but maybe it's just rocm. It's like every one of their cards has some different reason why this or that version of rocm doesn't work. It sounds so fucking tedious.

Reply

[-]

blazze@reddit

The M5 Ultra is going to match the RTX 3090.

Reply

[-]

CryptoCryst828282@reddit

Not even close. I have a M3 Ultra and t/s isnt bad but once you load up on context the PP time is just stupid slow and no one really talks about that part. I dont know what makes it so bad but its garbage at higher context.

Reply

[-]

blazze@reddit

Also all M series pre M5 are also "stupid slow" on Flux and stable diffusion. Those NPU cares make a difference.

Reply

[-]

CryptoCryst828282@reddit

I will say I havent tried the newest stuff, but it always felt different to me. They were great for chatbots and stuff like that, if you want to do anything agentic, they seemed to be a bit lacking.

Reply

[-]

Cergorach@reddit

Possibly, but by the time that comes out the 3090 is 6 years old and a 5090 will still be 2x the memory bandwidth. And a 6090 not that far away (or already out)... Neither is inherently a better solution then the other, each has their use. The point here was 'faster'... The Mac solution is a lot of things, but faster isn't one of them.

Reply

[-]

OrneryMinimum8801@reddit

Isn't the m3 ultra 815 bps bandwidth? That's less than a 3090 or 4090, but you should make up for the bandwidth issues by the monster amount of ram allowing a huge context window to be held in memory (dollar for dollar). I mean what does a system cost for 15x 4090 networked together?

Reply

[-]

blazze@reddit

M5 a generational shift that will come close to challenging Nvidia GPU dominance. Similar competition like Google's TPU.

Reply

[-]

ErisLethe@reddit

Your 3090 costs over $1,000. The performance per dollar favors Metal.

Reply

[-]

Nepherpitu@reddit

3090 costs around $500, not even close to 1000. Otherwise yes, it's impossible to assemble 20+ gpus cheaper than single mac.

Reply

[-]

polikles@reddit

depending on where do you live. In my location 3090 can be between $700 and $1200, provided you find any in working condition

Reply

[-]

ArtfulGenie69@reddit

I'm guessing they are under selling the price, for a used one it's been riding between $700-$1000

Reply

[-]

polikles@reddit

yeah, I think so. There is a subtle difference between how much does thing cost and how much someone feels it 'should' cost had just checked and in my location most of the 3090s are $800+ (after conversion) right now. Mind that there isn't a lot of them available, as they're all used cards I got mine 3090 for slightly less $700 last year, as it had a broken fan and caked cooler. I've replaced the fans and thermal pads and it's as good as new

Reply

[-]

ArtfulGenie69@reddit

I did the exact same thing just a less lucky buy on some of them. On had a fan issue that the seller was trying to hide but they had dropped the price to like $800. A new full set of fans is under $20 and because the fans were bad I did thermal pads too. Boy does it drop a lot more heat with the new pads. Put them on the back plate too. I was lucky with the other 3 3090s and didn't have to put in that effort though.

Reply

[-]

Cergorach@reddit

No, a 3090 is a lot better for performance per dollar, Macs are expensive and the 3090 is very fast! No, Apple is king on memory per dollar vs VRAM. And neither is what the meme was saying.

Reply

[-]

Individual-Source618@reddit

no it is not, if was the case you would have mac ultra in datacenter.

Reply

[-]

Cergorach@reddit

The amount of companies actually running 3090 cards in datacenters is *extremely* limited, there are probably some, but nothing I would call 'professional' or 'Enterprise'. A H200 server with 8 cards costs half a million, has 1128GB of VRAM, uses about 7+kW, but is utilized in most cases almost 100%. Those servers are great for multi user loads. A M3 Ultra isn't great at multiuser LLM loads, it's mostly used for either lab work or single user work, and even then the load is a fraction what a dedicated H200 server does. That H200 server at idle draws more power then the M3 Ultra under load. 21x 3090 cards (+ hardware) are even more power hungry then a single H200 server. Way slower and less versatile. Not surprising for 5 year old hardware. Hence I talked about energy efficiency, not per token, but over the whole time those machines are on and used by a single user (workstation). A 3090/4090/5090 or even a 6000 Pro can be a great choice in certain scenarios where the model fits in the VRAM and produces good enough output. But in most cases, this is not the case. Thus you're generally better off with a solution that has more memory, but is slightly slower. Unless of course you have the money for H200 servers, then the equation becomes totally different. We're also not talking about datacenter servers, we're talking about home workstations specifically.

Reply

[-]

The_Hardcard@reddit

Is there a workstation setup that can hold, power, and orchestrate enough 3090s for 512 GB RAM? I can see getting 6 6000 Pros in a rig for significantly more money than an M3 Ultra.

Reply

[-]

BumbleSlob@reddit

Don’t discount how much power it takes for the Apple chip vs the 22 3090s it would take to get equivalent VRAM. Back of the napkin math it would take 22 3090s at 350watts a piece so 8,800 watts. Versus I think the m3 ultra maxes out around 300 itself.

Reply

[-]

cyberdork@reddit

Running 3090 at 350W is plain stupid. Needs no more than 275W with very little performance hit.

Reply

[-]

CryptoCryst828282@reddit

200 isnt even that bad of a hit

Reply

[-]

TokenRingAI@reddit

Yes, but with 24x the memory bandwidth and compute.

Reply

[-]

Ill_Barber8709@reddit

Memory bandwidth doesn't scale like that... Single card compute is useless already for inference. Imagine 22 times more compute. 22 times more useless.

Reply

[-]

CheatCodesOfLife@reddit

Was this logic generated by an LLM that fits on a 1060 3GB?

Reply

[-]

MitsotakiShogun@reddit

> instead of whirring like a jet engine and heating my office ngl, summers are tough... but I haven't turned on the apartment's heating for the last 2 winters, and I'm getting refunds on the utility bill at the end of the year.

Reply

[-]

Sufficient-Past-9722@reddit

I solved this with... putting that beast in the basement and running a single Ethernet cable to it.

Reply

[-]

MitsotakiShogun@reddit

No basement, just a single, small apartment that I vastly overpay for. Like a proper European 🇪🇺

Reply

[-]

Sufficient-Past-9722@reddit

Haha I just moved out of Europe where I had a good basement, to Asia where I'm hoping to find a 40m2 place for 4 people. Mayyybe I'll get a balcony for the server to live on.

Reply

[-]

polikles@reddit

balcony for server? Aren't you afraid of dust?

Reply

[-]

Sufficient-Past-9722@reddit

Yeah it complicates things for sure. Air cooling would require intake filters and probably a custom case, and condensation could be an issue if I'm not using it 24/7. That said the balconies here are enclosed, so the exposure is a lot lower. Currently I'm leaning towards a split setup like an air conditioner uses, with radiator (big ass Mora IV) outside and water hoses going through the wall.

Reply

[-]

Medium_Ordinary_2727@reddit

Wouldn’t a GPU be a less-efficient heater for your apartment, than a heater? Your utility bills should go up if kept at the same temperature.

Reply

[-]

Soggy_Audience_6706@reddit

the meme is about a REAL PC computer not overpriced Mac joke shit

Reply

[-]

Novel-Mechanic3448@reddit

you dont have tensor cores. m5 does and is 3-5x faster

Reply

[-]

apVoyocpt@reddit

there is quite a difference in speed: M4 (base) 120 GB/s M4 Pro 273 GB/s M4 Max 546 GB/s Ultra would be around 900GB/s and the faster the toughput the faster is inference

Reply

[-]

Cergorach@reddit

The this year released M3 Ultra runs at 819.3 GB/s, the 5 year old RTX 3090 runs at 936 GB/s. This years 5090 runs at 1.8 TB/s...

Reply

[-]

apVoyocpt@reddit

Yes I know, but try running the 120b OpenAI model on it. Or linking two together to get more ram.

Reply

[-]

Cergorach@reddit

Maybe, but you have an even bigger issue with clustering multiple Mac Studios together over Thunderbolt 5 (or worse, over 10Gbit networking) and trying to run a Deepseek full model on there (without quantization). It's always about the right tool for the job. And using a model that doesn't fit in the VRAM or unified memory is not using the right tool.

Reply

[-]

apVoyocpt@reddit

https://www.youtube.com/watch?v=4l4UWZGxvoc

Reply

[-]

mxforest@reddit

24 GB available at 936 vs 512 GB available at 819.3. The cliff after the small memory of 1 or many 3090s fills up is pretty sharp. And the models that do fit are not smart enough for professional workload other than some very basic stuff. I run 8bit GLM 4.6 at 200k context and still have 62 GB left for everything else. It's a beast.

Reply

[-]

Rabo_McDongleberry@reddit

I own the basic M4 mini. And on that machine i do basic hobby stuff and peeing my niece and nephew learn AI (under admin supervision). Fort that kind of stuff it's great. But I wouldn't push it beyond that...or can't.

Reply

[-]

holchansg@reddit

Yeah... M4 Mini bandwidth is 120gb/s. The only Mac that is worth are the Max and Ultra. AMD AI 395 is cheaper and have the same bandwidth as the Pro, without the con of being ARM, dedicated TPU...

Reply

[-]

pastelfemby@reddit

Con of being arm? It hasnt been a con in some time unless you're a windows user.

Reply

[-]

CheatCodesOfLife@reddit

Easier for the vendor to lock it down.

Reply

[-]

zipzag@reddit

An Apple user is going to choose a Mac, and the Pro version at a minimum. Even the 800gb/s in my M3 Ultra isn't fast. 120gb/s for chat is rough. I expect a lot of people are disappointed. There no point in buying a shared memory machine and running an 8B because its the size that feels fast enough. Just by the video card.

Reply

[-]

calcium@reddit

> I expect a lot of people are disappointed. I doubt many people are running an LLM locally? Even when I run them on my M1 Pro MBP I get around 17 tokens/s which is sufficient for my needs - it's able to generate text faster than I can read it.

Reply

[-]

recoverygarde@reddit

Depends gpt oss flies as do the Qwen VL models

Reply

[-]

holchansg@reddit

Yeah, not only the size, the context size, at huge context sizes is painfully slow.

Reply

[-]

recoverygarde@reddit

AI 365 is slower than M4 Pro and even the base M3 is decent depending on what you’re using it for

Reply

[-]

holchansg@reddit

Yeah, since M4 they bumped the Pro Bandwidth.

Reply

[-]

Ill_Barber8709@reddit

> AMD AI 395 is cheaper Cheaper than what? How many VRAM? What memory bandwidth? M4 Max Mac Studio with 128GB of 546GB/s memory is $3499

Reply

[-]

holchansg@reddit

Thats why i stated Mini and Pro... you only have more bandwidth with the Max and Ultra... and then a rig with RTX XX90's blow it out of the water.

Reply

[-]

Ill_Barber8709@reddit

> and then a rig with RTX XX90's blow it out of the water. LOL, no.

Reply

[-]

holchansg@reddit

Are you crazy? Just use the testimonials on this very own topic... RTX XX90 is a 500w+ specialized chip with way more memory bandwidth...

Reply

[-]

Ill_Barber8709@reddit

> blow it out of the water. As I said, my own 32GB M2 Max runs Qwen3-30B-A3B 4Bit MLX at 65tps, while the desktop 5090 runs the same model Q4 at 135-200tps. Not exactly blown out of the water.

Reply

[-]

holchansg@reddit

Thats only a slice of the whole picture, is not only about TPS, is also about TTFT, prefill, E2E...

Reply

[-]

RoomyRoots@reddit

They would probably learn better if you stopped peeing on them.

Reply

[-]

DerFreudster@reddit

He mean flush it beyond that.

Reply

[-]

Rabo_McDongleberry@reddit

Fixed the typo. Lol

Reply

[-]

SpicyWangz@reddit

Please fix your typo

Reply

[-]

Rabo_McDongleberry@reddit

Lol. Fixed!

Reply

[-]

fredandlunchbox@reddit

That’s not even close to the best performing mac.

Reply

[-]

Cergorach@reddit

Even the M3 Ultra 512GB is not faster then even a consumer 3090. And even a MacBook only fits an M4 Max, which is only faster if you're building your LLM 'workstation' with RTX 5060 cards...

Reply

[-]

Ill_Barber8709@reddit

LOL. First, go run 600B+ parameters models on 3090s... Which you can on a single M3 Ultra 512GB. Second, 3090 **TI** is 1000GB/s, 3090 is 900GB/s. M3 Ultra is 800GB/s but MLX is 20% faster than GGUF. Third, M4 Max is a laptop chip. Show me a laptop with 128GB of 540GB/s memory... You're just saying shit.

Reply

[-]

Automatic-Arm8153@reddit

You’re missing the point entirely. Own both Mac and NVIDIA. NVIDIA wins, no competition. Mac OS a good generalist/ beginner device though no one can hold them against that. To sum it up simply Better value proposition: Mac Better performance: NVIDIA Until you get to serious AI use then. NVIDIA wins both

Reply

[-]

j_osb@reddit

Even the M3 ultra doesn't hold a candle to proper accelerators.

Reply

[-]

fredandlunchbox@reddit

M4 Max also considerably outperforms the pro.

Reply

[-]

Cergorach@reddit

Yes, it does, but that's not the point. I'm a 'normy' with a Mac and even I say that if you think a Mac is faster then a properly build LLM workstation, you *really* suck at building LLM workstations. A 5 year old GPU (3090) is still faster then any Apple silicon in the LLM space. The advantage that an Apple brings is a LOT of memory for a reasonable price compared to VRAM with 'decent' performance and a very low power footprint (high efficiency).

Reply

[-]

wdsoul96@reddit

Why take months? Is he mining iron from the ground? right?

Reply

[-]

Academic-Lead-5771@reddit

lmfao what

Reply

[-]

african-stud@reddit

Try processing a 16k prompt

Reply

[-]

ForsookComparison@reddit (OP)

Can anyone with an M4 Max give some perspective on how long this usually takes with certain models?

Reply

[-]

JockY@reddit

Macbook M4 Max 128GB, LM Studio, 14,000 tokens (not bytes) prompt, measuring time to first token ("ttft"): - GLM 4.5 Air 6-bit MLX: 117 seconds. - Qwen3 32b 8-bit MLX: 106 seconds. - gpt-oss-120b native MXFP4: 21 seconds. - Qwen3 30B A3B 2507 8-bit MLX: 17 seconds.

Reply

[-]

bigh-aus@reddit

It'd be great if you could give some comparisons between this and your RTX rig. :)

Reply

[-]

iMrParker@reddit

On the bright side, you can go fill up your coffee in between prompts

Reply

[-]

JockY@reddit

Yeah, trying to work under those conditions would be painful. Wow. Luckily I also have a quad RTX 6000 PRO rig, which does not suffer any such slow nonsense... and it also heats my coffee for me.

Reply

[-]

Beneficial-Shame-483@reddit

Quad RTX 6000 PRO ?? With that gpu power you don't even need a cpu anymore. Just run stable diffusion on machine code to run it

Reply

[-]

abnormal_human@reddit

I don't know how to say this but we might be the same person.

Reply

[-]

JockY@reddit

Paw-paw?

Reply

[-]

comfyui_user_999@reddit

Jesus, you two. Save some VRAM and some watts for the rest of us.

Reply

[-]

10minOfNamingMyAcc@reddit

* gpt-oss-120b native MXFP4: 21 seconds. I'm jealous, and not evena little bit. (64 GB VRAM here)

Reply

[-]

JockY@reddit

It might just fit. Seriously. It comes quantized with MXFP4 from OpenAI and needs ~ 60GB. I dunno for sure, but it might just work with tiny contexts!

Reply

[-]

vertical_computer@reddit

It’s shared memory though, you need to leave some room for the OS Between that and context… gonna be real tough (borderline impossible)

Reply

[-]

Its_Powerful_Bonus@reddit

And you can run minimax m2 q3 dwq MLX, which is beast! My favorite lately. gpt-oss-120b 2nd place, since it is blazing fast.

Reply

[-]

Bozhark@reddit

Shit I’ve been on 20b, is 120b worth the extra seconds?

Reply

[-]

JockY@reddit

I never tried the 20b, too much like a toy ;-)

Reply

[-]

Bozhark@reddit

Hmmpsired

Reply

[-]

Sufficient_Prune3897@reddit

2 minutes is crazy

Reply

[-]

waescher@reddit

M4 Max 128GB https://preview.redd.it/85d5l0p9dk7g1.png?width=2719&format=png&auto=webp&s=300b2f325dcc6de322238bac9c81bf57d9da8b6a Sauce: [https://www.reddit.com/r/LocalLLaMA/comments/1o08igx/improved\_time\_to\_first\_token\_in\_lm\_studio/#lightbox](https://www.reddit.com/r/LocalLLaMA/comments/1o08igx/improved_time_to_first_token_in_lm_studio/#lightbox)

Reply

[-]

koffieschotel@reddit

...you don't know? then why did you create this post?

Reply

[-]

ForsookComparison@reddit (OP)

It's Monday and Jira is bugging me

Reply

[-]

koffieschotel@reddit

lol if that itsn't a valid reason I don't know what is!

Reply

[-]

SpicyWangz@reddit

I would just wait another month or two to see how M5 pro/max perform with PP

Reply

[-]

ForsookComparison@reddit (OP)

I'm not in the market for any hardware right now, just curious on how things have changed.

Reply

[-]

SpicyWangz@reddit

Standard M5 chips have added matmul acceleration, which significantly speeds up the prompt processing. You'd have to look for posts actually benchmarking M4 vs M5, but it was pretty impressive. Actual token generation should be sped up as well, but prompt processing will be multiple times more efficient now

Reply

[-]

Ill_Barber8709@reddit

M5 is 4 times faster than M4 at prompt processing.

Reply

[-]

Aggressive_Dream_294@reddit

someone had posted about this before [https://www.reddit.com/r/LocalLLaMA/comments/1mt3epi/m4\_max\_generation\_speed\_vs\_context\_size/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/LocalLLaMA/comments/1mt3epi/m4_max_generation_speed_vs_context_size/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)

Reply

[-]

twisted_nematic57@reddit

I do it all the time with Qwen3 32B on my i5-1334U on a single stick of 48GB DDR5-5200. Takes like an hour to start responding and another hour to craft enough response for me to do something with it but it works alright. <1 tok/s.

Reply

[-]

LocoMod@reddit

gpt-oss-120b will sneeze at that and happily do it. Go higher.

Reply

[-]

apifree@reddit

Same here

Reply

[-]

popsumbong@reddit

Honestly the real winners are the cloud users

Reply

[-]

TrumanCompote@reddit

s/users/providers

Reply

[-]

JoanofArc0531@reddit

So just go with a Mac instead?

Reply

[-]

ForsookComparison@reddit (OP)

Was a joke from a while ago. Thread is dead

Reply

[-]

oodelay@reddit

You guys spend too much time looking at other guy's dicks to compare. My system works great and does what I ask it to. It's petty and sad.

Reply

[-]

Soggy_Audience_6706@reddit

there because all girls have them now

Reply

[-]

ForsookComparison@reddit (OP)

Meme

Reply

[-]

Noiselexer@reddit

I rather take model thst fits in my 5090 see who's fasten then...

Reply

[-]

Soggy_Audience_6706@reddit

there because all girls have them now

Reply

[-]

ForsookComparison@reddit (OP)

If I had ~2TB/s VRAM I would do backflips to make my pipeline fit in 32GB.

Reply

[-]

No-Refrigerator-1672@reddit

If by "perfect workstation" you mean no cpu offload, then Mac aren't anywhere near what full GPU setup can do.

Reply

[-]

egomarker@reddit

And nowhere near that power consumption figures either.

Reply

[-]

Super_Sierra@reddit

'my 3090 setup is much faster and only cost a little more than the 512gb macbook!' \>didn't mention that they had to rewire their house

Reply

[-]

Lissanro@reddit

I did not have rewire my house but for my 4x3090 worstation I had to get 6 kW online UPS, since previous one was only 900W. And 5 kW diesel generator as a backup, but I already had it. The rig itself during text generation with K2 or DeepSeek consumes about 1.2 kW, under full load (especially during image generation on all GPUs) can be about 2 kW. But important part, that I built my rig gradually... for example, in the beginning of this year I got 1 TB of RAM for $1600, and when I was upgrading to EPYC, I already had PSUs and 4x3090, which I bought one by one. I also highly prefer Linux, and need my rig for other things besides LLMs, including Blender and 3D modeling/rendering that can take advantage of 4x3090 very well and do some tasks that benefit from large disk cache in RAM or require high amounts of memory. So, I wouldn't exchange my rig to a pair of 512 GB Macs with similar total memory, besides, my workstation in total spent costs is still less than even a single one. Of course, a lot depends on use cases, personal preferences, and local electricity costs. In my case, electricity costs are small enough to not matter much, but in some countries they are so high that using not so energy efficient hardware may not be an option.

Reply

[-]

egomarker@reddit

So how fast is GLM4.6 on your rig.

Reply

[-]

YouKilledApollo@reddit

No one is running GLM4.6 on 4x3090. My estimates places GLM4.7 around 300-400GB of memory needed. GLM Air would run fine though, even with just a RTX Pro 6000, then you'll get \~2872 t/s for prompt processing, and \~112 t/s for decoding.

Reply

[-]

egomarker@reddit

And it was exactly my point. But you can run it on 512Gb mac.

Reply

[-]

YouKilledApollo@reddit

But again, prompt processing will be so terribly slow, that it won't matter! That is my exact point. Save the money, don't buy Apple hardware based on decoding speeds when prompt processing is what slows that hardware down, so much it's not usable for day-to-day work. Instead, get real hardware that is meant for ML proper, and with support of the community at large, and run smaller models but run them multiple times, or with a harness with tools that makes it better. Don't go by online public benchmarks, make your own benchmarks, create the tooling, and smaller models won't matter. Not sure how people aren't getting this yet.

Reply

[-]

egomarker@reddit

"Slow prompt processing" vs. "can't run at all", hmmmmm, let me think.

Reply

[-]

YouKilledApollo@reddit

What's the point of being to run something when it'll take 5 minutes to get a very simple reply? Anyways, I'm using my hardware for paid work for clients, I can't be half-assing things or waiting hours for stuff to finish. If you want to go for Apple hardware, do it, I'm not your boss :) But if you're aiming to do serious ML development or otherwise actually need performance from what you're able to run, you won't be using Apple hardware, at least today. Maybe in the future.

Reply

[-]

egomarker@reddit

"5mins" is a lie, it will run at 11-16tks and will be more than enough for chit-chat.

Reply

[-]

YouKilledApollo@reddit

Again, please do share the prompt processing speed, which is exactly the part that is bog slow on Apple hardware. But every-time you ask a Apple-fanboi about the prompt processing speed, they tend to stop responding, hoping you won't disappoint me like all the others before you.

Reply

[-]

egomarker@reddit

You are simply deflecting from the fact you can't run it at all. As I've said, prompt processing speed is enough for chat use.

Reply

[-]

YouKilledApollo@reddit

> As I've said, prompt processing speed is enough for chat use Yeah? What is the concrete numbers? As is typical for Apple fanbois, everyone says it's "fast enough" yet no one is providing concrete numbers. Lets compare, if you have the hardware in front of you. For `GLM-4.5-Air-Q4_K_M` as a slightly related example, I'm getting ~2836 t/s for pp512 on llama-bench (`build: 7f8ef50cc (7209)`) and 110 t/s for tg128, on a RTX Pro 6000. Please do provide actual concrete numbers so we can actually compare data instead of your vibes, otherwise please kindly stop trying to propagate the myth that Apple hardware makes sense for inference with the bigger models. But of course, it's highly unlikely you'll actually come back with concrete numbers, and you probably have some excuse for it that makes sense. Which is exactly the problem, Apple fanbois keeping saying "It's fast enough!" yet not a single individual have been able to give me some concrete comparable numbers. Please be one of the reasonable people who can provide actual data!

Reply

[-]

egomarker@reddit

You just can't stop deflecting from the fact you can't run something at all, right? And yeah, you know you can drop the numbers yourself, if you think "apple fanboiz are covering up". Personally I simply don't give a single fuck about pp speed if it's a "can run - can't run" situation. Enjoy your imaginary prompt processing speed numbers on your imaginary nvidia rig lol. They are high as fuck.

Reply

[-]

YouKilledApollo@reddit

Haha, still no numbers? Shocking I tell you, never could have expected that.

Reply

[-]

egomarker@reddit

>you know you can drop the numbers yourself Did you miss this part of my message. And don't forget to compare numbers you post with zero tks you get on your rig.

Reply

[-]

YouKilledApollo@reddit

> you know you can drop the numbers yourself I don't know what that's supposed to even mean, you mean I intentionally lowered them? That wouldn't make sense, so surely you meant something else. But anyways, what's the point? You clearly don't have the hardware yourself, or are embarrassed over your purchase and refuse to actually provide any data, so yeah. Not sure what you want from me? Just stop engaging if you don't want to have a faithful conversation by providing data to backup your statements. Otherwise this is all just a waste of time for everyone involved.

Reply

[-]

egomarker@reddit

And this is how far a stupid deflection can take a man. It's 140tks. As I've said, enough for chat use. Now gtfo, you are wasting my time.

Reply

[-]

YouKilledApollo@reddit

> 140tks Hah, suddenly everyone understands why you have to pry out any sort of benchmarking data out of the Apple fanbois who spent $10K on a computer that does ~140tks t/s in prompt processing :| But you know what, you do you, I'm not saying everyone should get a RTX Pro 6000, just that if you're aiming to do professional ML development, you really can't be on Apple hardware, you need at the very least something CUDA compatible. But it requires you to understand software engineering to grok this, so I'd understand if it's easier to go with Apple hardware. > Now gtfo, you are wasting my time. You can stop responding whenever you feel like, not sure if you've understood how this whole "internet" thing works like or not, but just in case; you don't have to respond, no one is forcing you.

Reply

[-]

egomarker@reddit

>Hah, suddenly everyone understands why you have to pry out any sort of benchmarking data out of the Apple fanbois who spent $10K on a computer that does \~140tks t/s in prompt processing :| This "prying" happens only in your head. No one is "prying" anything, I was just having fun watching you deflecting and trying to pretend there's a CONSPIRACY against you and everyone is hiding numbers that are available on a first google search. It was hillarious, made my day, thank you. So, what is your prompt processing speed for glm 4.6. Still zero? As expected? Lmao. Bye.

Reply

[-]

cyberdork@reddit

4x3090 takes less power than your electric kettle.

Reply

[-]

Ragerist@reddit

Must be an American thing, I'm too European to understand. Well, actually I'm an former industrial electrician. So I fully understand that most houses here as a 3x230V 20-35A supply to their houses, then often divided into 10-13A sub-groups and 16A for appliances like dryer and washer. So not really an issue. Electrical bill on the other hand is a completely different issue.

Reply

[-]

Super_Sierra@reddit

To replace 512gb of system ram with vram, it is nearly 16 GPUs. You cannot run that on a normal line in a house.

Reply

[-]

CryptoCryst828282@reddit

Imagine saying I can afford a M3 Ultra for no real reason other than i just want too, but cant afford to buy a NEMA 14-50 plug. If you can afford 16 gpus you can easily pay to get a 50A 240V circuit. My servers all run off 240v anyway because they are actually way more efficient.

Reply

[-]

Ragerist@reddit

I normal line in the US you mean.

Reply

[-]

Gudeldar@reddit

There's plenty of power going to the \*house\* in the US. Modern homes have 200A supply at 240V. However most branch circuits are only 15A at 120V.

Reply

[-]

mi_throwaway3@reddit

What a stupid arse cope response. I find this response hilarious. Mac people say this like it matters. Like, who cares? Seriously. I want to get things done, don't Mac folks want to get things done? "Oh no, not if it means I'm using 40 extra watts, gee, I'd rather sit on my thumbs" Stop.

Reply

[-]

YouKilledApollo@reddit

For some people electricity costs are important. Not for me personally, but I know some people already have really high bills, makes sense they try to optimize for it if that's their situation. And no, Apple hardware is excellent for some specific AI models they have specifically worked to make it run somewhat OK on their hardware, since the community isn't really focused on Apple hardware. If you want stuff that just runs, go for nvidia hardware, simple as that. And no limitations except your wallet in that case, which if you go for Apple hardware, probably already isn't a issue. Apple hardware will be dog slow and you'll regret getting it.

Reply

[-]

egomarker@reddit

Imagine having a context window of 25 tokens and completely missing the fact conversation is about full gpu offload just to write another toxic comment from your throwaway account. "Extra 40 watts", lol.

Reply

[-]

zipzag@reddit

True, but different tools. My Mac is always on, frequently working and holds multiple LLM in memory. 8 watts idle, 300+ watts works, never makes a sound. Big MOE models are particularly suited for shared memory machines, including MOE. I do expect I will also have a CUDA machine in the next few years. But for me, a high end mac was a good choice for learning and fun.

Reply

[-]

-dysangel-@reddit

Also Deepseek 3.2 is out now, demonstrating that you can make SOTA models with close to linear prompt processing. Mac and EPYC machines with a lot of RAM are only going to become more useful over the next couple of years IMO. Especially now that you can cluster Macs effectively.

Reply

[-]

Ill_Barber8709@reddit

Show me a laptop with 128GB of 546GB/s memory. Price a desktop with 128GB of 546GB/s memory compared to Mac Studio M4 Max. I won’t even talk about power efficiency. Sure, they’re not meant for training. But most of us here only use inference anyway.

Reply

[-]

No-Refrigerator-1672@reddit

>Show me a laptop with 128GB of 546GB/s memory. Laptop is not a workstation. >Price a desktop with 128GB of 546GB/s memory 6x 3090 if you can get them at $500, or modded 3080 if you can't - $3000. Mobo, CPU and DDR - $100. Power supply and case - up to $500. Total: $4600. >I won’t even talk about power efficiency. If a system consumes 10x more power, but does the same task 10x faster, then it's exactly as power efficient.

Reply

[-]

Ill_Barber8709@reddit

> Laptop is not a workstation. For inference? LOL > 6x 3090 if you can get them at $500, or modded 3080 if you can't - $3000. Mobo, CPU and DDR - $1000. Power supply, cpu cooler, fans and case - up to $500. Total: $4500. The M4 Max Studio 128GB cost $3,499.00 > If a system consumes 10x more power, but does the same task 10x faster, then it's exactly as power efficient. I run Qwen3-30B-A3B 4Bit MLX at 65tps on a 32GB M2 Max MacBook Pro. Best benchmarks of the desktop 5090 running this model Q4 were between 135 and 200tps. You're funny, but completely delusional. See comments here https://www.reddit.com/r/LocalLLaMA/comments/1p7wjx9/rtx_5090_qwen_30b_moe_135_toks_in_nvfp4_full/

Reply

[-]

CheatCodesOfLife@reddit

> Qwen3-30B-A3B 4Bit 189t/s on my single RTX3090 just now. That's running the Q4_K_M gguf. (59t/s for a 0 context 'hi' prompt with the cpu-only build) Adding more cards won't really help for a batch of 1 since the model is so light, memory bandwidth is the constraint (58 t/s with just my CPU). Macbook is obviously the better form factor though :)

Reply

[-]

Ill_Barber8709@reddit

If you follow the link I put, you’ll that’s exactly what I’m saying. OP was flexing about their results and I pointed that their numbers were nothing to flex about. TBH I’m genuinely surprised by the poor performance of this model on the 5090 considering the compute power and memory bandwidth. You should explain to him how you get much better results with a much less powerful GPU ^_^

Reply

[-]

CheatCodesOfLife@reddit

Yeah I clicked the link, just wanted to let you know that GPUs aren't usually that... slow. I didn't do anything special for that ^ just ran the model in llama.cpp lol

Reply

[-]

No-Refrigerator-1672@reddit

>I run Qwen3-30B-A3B 4Bit MLX at 65tps on a 32GB M2 Max MacBook Pro. Best benchmarks of the desktop 5090 running this model Q4 were between 135 and 200tps. You're funny, but completely delusional. Ah, if I had a dollar every time a person judges performance by 0-lenght prompt, I would have RTX 600 Pro by now. IRL you're now working with short context, especially not if you're paying for Max/Ultra chips; and their prompt processing is terrible. With Qwen3 30B, a very light model, 30-40k long prompt, [M3 Ultra only gets](https://www.reddit.com/r/LocalLLaMA/comments/1kvd0jr/m3_ultra_mac_studio_benchmarks_96gb_vram_60_gpu/) \~400 tok/s PP, while dular 3080 [will get](https://www.reddit.com/r/LocalLLaMA/comments/1p0bbrl/rtx_3080_20gb_a_comprehensive_review_of_chinese/) 4000 tok/s PP at the same depth. This is exactly 10x faster.

Reply

[-]

Ill_Barber8709@reddit

Dude, I'm a developper. I spend my time processing big context. > prompt processing is terrible M4 and M3 generation yes. M5 architecture brings tensor cores to every GPU and prompt processing is now 4 times faster than M4. > This is exactly 10x faster. Ah, if I had a euro every time a person judges performance by one metric only, I would own a house in France by now.

Reply

[-]

No-Refrigerator-1672@reddit

>M5 architecture brings tensor cores to every GPU and prompt processing is now 4 times faster than M4. Can I buy M5 with 128GB of memory? No? Come back when it becomes available, I will happily compare it to equivalently-priced Blackwell. >Ah, if I had a euro every time a person judges performance by one metric only, I would own a house in France by now. Surely, if I'm wrong, you would easily provide numbers that prove it.

Reply

[-]

Ill_Barber8709@reddit

> Surely, if I'm wrong, you would easily provide numbers that prove it. You told me yourself that Nvidia were 10 times faster at **prompt processing** I've shown you that a 5090 is barely 2 to 3 times faster than an M2 Max. Hence, **one metric only**

Reply

[-]

No-Refrigerator-1672@reddit

If you would look into your usage stats, you'll see that generation length is typically multiple times shorter than prompts, basically for all usecases except essays or other creative writing. This it is the metric that makes the most impact. Besides, TG also drops at long prompts, and on exactly the same links you can see it's 35 for m3 Ultra and 70 for 2x3080 at depth \~30k. The difference is immense, and I'm not even talking about 5090.

Reply

[-]

JockY@reddit

Those might be *your* usage stats, but a lot of us do batched requests with massive prompts and very small output from the LLM. We need fast PP and don't care about TG.

Reply

[-]

No-Refrigerator-1672@reddit

Dude, I literally am arguing the whole conversation that RTX has massively faster PP and it is what matter the most. You are arguing against the wrong person.

Reply

[-]

egomarker@reddit

>If a system consumes 10x more power, but does the same task 10x faster, then it's exactly as power efficient. Will it really be 10x faster at concurrency 1.

Reply

[-]

CheatCodesOfLife@reddit

>Will it really be 10x faster at concurrency 1. Depends what you're doing. For diffusion, VibeVoice, training then yeah, 10x faster. For a single user with sparse MoE models, maybe 3x or 4x faster.

Reply

[-]

MitsotakiShogun@reddit

Depends on the model and architecture, but yes, 2^(n) 3090s with TP (and less so with other odd/even numbers), especially vLLM/sglang, can be plenty faster, even at x4 and stock drivers. Here are some numbers on a 50-60K prompt with 3 different models: https://preview.redd.it/gbim61iu6f7g1.png?width=1473&format=png&auto=webp&s=00a5b6bb35a4a832008eef27d9ae016e28af8e7b

Reply

[-]

egomarker@reddit

These numbers are very exaggerated in favor of the prompt size tho. It's like "what color is the sky?" and "here's 50K personality prompt" or something. Most of the time, especially in agentic use with reasoning models, ratio is 5:1 or higher in favor of generation size. And I'm looking at generation outputs... They are around mac level, give or take.

Reply

[-]

MitsotakiShogun@reddit

Fair enough. This was just an example. Obviously, nobody should take single-prompt speeds as a benchmark anyway.

Reply

[-]

No-Refrigerator-1672@reddit

By the numbers that I have seen for M3 Ultra - yes, it will be more than 10x.

Reply

[-]

Ill_Barber8709@reddit

That level of cope...

Reply

[-]

No-Refrigerator-1672@reddit

[Learn yourself.](https://www.reddit.com/r/LocalLLaMA/comments/1pnfaqo/comment/nu7h889/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)

Reply

[-]

Ill_Barber8709@reddit

Sure buddy. 10 times faster at prompt processing only. You said it yourself.

Reply

[-]

No-Refrigerator-1672@reddit

Yeah. And prompt processing makes the most impact, because in most usecases generation length is multiple times shorter than prompt.

Reply

[-]

egomarker@reddit

No it isn't, especially with reasoning models. So what is the final speed increase if prompt size is equal to output, or prompt size is 3 times less than output.

Reply

[-]

Ill_Barber8709@reddit

Your usage, not mine. I don't throw 30K tokens of context each time I make a prompt. So most of the time the prompt is already processed.

Reply

[-]

egomarker@reddit

But have you seen numbers from 6x3090.

Reply

[-]

WitAndWonder@reddit

DDR on its own right now would be $1000

Reply

[-]

No-Refrigerator-1672@reddit

Nope. For all-gpu setup you don't need much DDR, 16GB will run fine. You only care about RAM prices when you are doing CPU offloading.

Reply

[-]

WitAndWonder@reddit

That's a valid point as long as you're not running MoE models. But those would want to load the full model weights in the RAM while the experts are active in the VRAM. At least for highest efficiency in $ / performance. Though anyone just interested in running a single LLM might as well run a dense model with 6 3090s to work off of (but someone like me running multiple LLMs alongside other agents benefits from having some number of those being MoEs, as opposed to one large model.)

Reply

[-]

No-Refrigerator-1672@reddit

We are in the comment thread that discusses loading models completely into GPUs. CPU offloading is irrelevant in this particular case.

Reply

[-]

LocoMod@reddit

Macbook Pro's are workstation class machines that outperform almost any consumer PC hardware unless you are willing to part with a kidney.

Reply

[-]

PraxisOG@reddit

The numbers I'm seeing for 32b models are \~18t/s on m4 max(546GB/s) vs 55t/s on 3090(936GB/s) with 2x tensor parallel. So about 3x faster, maybe 4x faster with 4xTP.

Reply

[-]

Ill_Barber8709@reddit

Your numbers are wrong. 30B is an MoE, not a dense model. Qwen3-32B MLX 4Bit runs at 23tps on the same Mac.

Reply

[-]

PraxisOG@reddit

32b is a dense model. It’s interesting our numbers are different, I used performance numbers with qwq so maybe that’s it?

Reply

[-]

Ill_Barber8709@reddit

I'm using MLX version, which is known to be 20% faster than GGUF. I have 65tps on the 30B version, which is an MoE, not a dense model. 23tps on 32B Qwen3

Reply

[-]

No-Refrigerator-1672@reddit

I've expanded my take with real numbers in [this reply](https://www.reddit.com/r/LocalLLaMA/comments/1pnfaqo/comment/nu7h889/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button).

Reply

[-]

CheatCodesOfLife@reddit

>Price a desktop with 128GB of 546GB/s memory compared to Mac Studio M4 Max. 4xMI50's (1TB/s each)

Reply

[-]

Ill_Barber8709@reddit

Price?

Reply

[-]

CheatCodesOfLife@reddit

Depends on the area, like $300 each now I think? They're on Aliexpress. Don't buy 'em unless you're happy to spend a day setting them up, fucking around with rocm setup, etc. They're not plug and play like Nvidia or a Mac. They are much faster than a Mac for big dense models.

Reply

[-]

mi_throwaway3@reddit

That M4 Mac with 128gb is like 5k. The 5090 is going to eat up 2500 and the memory another $1350 (yikes to the ram market). You've still got enough money to round out the rest of the system. It might be slightly more, but it will be easily 2x as fast. Nobody cares about power efficiency in a workstation.

Reply

[-]

Successful_Tap_3655@reddit

The m4 Mac does better in most use case over 5090

Reply

[-]

Ill_Barber8709@reddit

> That M4 Mac with 128gb is like 5k M4 Max Mac Studio with 128GB is $3500. > You've still got enough money to round out the rest of the system. It might be slightly more, but it will be easily 2x as fast. How many RTX 5090 do you need to get 128GB of VRAM? You're saying shit. > Nobody cares about power efficiency in a workstation. Speak for yourself. > You're fighting physics and computation. There's no magic formula that gets apple free matrix operations. I don't even know what this is supposed to mean.

Reply

[-]

PraxisOG@reddit

You could probably throw together 4x AMD V620(32gb@512GB/s) on an EATX x299 board for $3000 off of ebay. It won't have drivers for nearly as long, suck back way more power, and would sound like a jet engine with the blower fans on those server cards, but would train faster. Maybe I'm biased, my rig is basically that but half the price cause I got a crazy deal on the gpus :P

Reply

[-]

LocoMod@reddit

Yea but fitting gpt-oss120b in a loaded Macbook is better than not running it at all in my RTX5090.

Reply

[-]

No-Refrigerator-1672@reddit

A "loaded macbook" that can run 120B at all will cost you like $3000. For that money you can assemble a PC that not only will load the same 120B model completely into GPUs (4-bit quantized, of course), but will also run in multiple times faster for a single request, and orders of magnitude faster if you have agentic workload that can take advantage of parallel processing.

Reply

[-]

egomarker@reddit

So he is talking about a laptop and you suggest to get a wardrobe of GPUs instead.

Reply

[-]

No-Refrigerator-1672@reddit

So the whole post is about "macbook is better than a workstation". A "warderobe of GPUs" is, in fact, a workstation, and fits this particular conversation perfectly.

Reply

[-]

egomarker@reddit

You are taking a meme too seriously.

Reply

[-]

LocoMod@reddit

Really? Drop the parts list. Let’s validate that.

Reply

[-]

No-Refrigerator-1672@reddit

4x RTX 3080 20GB will cost $2000 including tax and shipping. $1000 for a pc case, motherboard, cpu, ram and psu is totally enough, you don't need any top-of-the-line components for that. The performance of this setup will be 10x of M3 Ultra in PP, and some 2x or more for TG for single request.

Reply

[-]

LocoMod@reddit

Nah. I’ll take the MacBook Pro rather than some jank modded 20GB 3080’s.

Reply

[-]

No-Refrigerator-1672@reddit

Sure, if that's your preference. Just don't go around the interned claiming that a macbook is faster, or that an equivalently-priced PC can't fit same-sized models.

Reply

[-]

LocoMod@reddit

No one claimed a MacBook is faster. I use both. My RTX5090 gets used quite often for embeddings models cause that’s about the only useful thing one can run with such a small pool of memory. There’s no usable LLM for any serious work that will fit in one. My point was that it’s better to run larger more capable models than not at all. I can fit gpt-oss-120b in my M3 and it’s almost useable as a daily driver. Almost.

Reply

[-]

No-Refrigerator-1672@reddit

The original meme OP posted is claiming that macs are faster, I'm referring to that.

Reply

[-]

LocoMod@reddit

OP is referring to people building PC frankenbuilds and doing CPU/RAM offloading to squeeze large models into the shared pool of resources. At that point you’re not leveraging the speed of CUDA. Once you leave the GPU it’s a different ballgame. In these scenarios a high end Mac will outperform a PC with offloading.

Reply

[-]

Turbulent_Pin7635@reddit

M3 Ultra owner here. The only downside I see on Mac is video generation. Being capable of get full models running on it is amazing! The speed, prompt loading times are nothing truly crazy slow. It is ok, specially when it is running with a fraction of power, NOT A SINGLE NOISE or hear issue. Also, is important to say that even without CUDA (is a major downgrade, I know) things are getting better for metal. My doubt know is if I buy a second one to get to the sweet spot of 1Tb of RAM, wait for the next Ultra or invest in a minimum machine with a single 6000 pro to generate videos + images (accept configuration suggestions to the last one).

Reply

[-]

YouKilledApollo@reddit

\> NOT A SINGLE NOISE or hear issue Is this really an issue for people? I have my workstation with a RTX Pro 6000 right next to me, I just ran a benchmark with lmstudio-community/GLM-4.5-Air-GGUF (glm4moe 106B.A12B Q4\_K) and even as the GPU is hitting max temperatures of \~72°C, I can barely hear it. For configuration suggestions, you meant like hardware? It pairs nicely with Threadripper 9970X and fast I/O :)

Reply

[-]

Turbulent_Pin7635@reddit

Thx. I have mental issues, cacophony can drive me crazy. I know that the 6000 is more silent than the 3090/4090/5090. Thx for Threadripper. What I mean is that I want a minimum (€€€) body to run the 6000 without spending that much money.

Reply

[-]

YouKilledApollo@reddit

> What I mean is that I want a minimum (€€€) body to run the 6000 without spending that much money. To be honest, if you're just doing LLM inference, the other hardware won't matter much. You could be on PCIe 4, on AM4 with DDR4 and you won't notice much difference compared with everything on Gen 5 or even moving to sTR5 (I would know, literally just did this move some weeks ago). Just make sure the PSU is good, and you have sufficient cooling, because this beast generates hit like no other GPU I've had before.

Reply

[-]

Turbulent_Pin7635@reddit

Great advices, thx my friend! I want the 6000 for videos/images. For text inference I'm using the M3 Ultra.

Reply

[-]

ayu-ya@reddit

How bad is the video gen speed? Something like the 14B WAN, 720p 5s? I'm planning to buy a Mac Studio in the future mostly to run LLMs and I heard it's horrible for videos, but is it 'takes an hour' bad or 'will overheat, explode and not gen anything in the end' bad?

Reply

[-]

CryptoCryst828282@reddit

Prompt times not slow? Are you kidding me i tried running GLM 4.5 Air on my M3 and it was taking 2-3 min / prompt. Sure if you are using it as a chat bot its not bad, want to use it for anything major they are useless. I have a set of 5060ti's that smoke it in every way and cost me 1/2 as much.

Reply

[-]

Turbulent_Pin7635@reddit

It will take 15 minutes, to things a 4090 would takes 2-3 minutes. I never see my MacStudio emit a single noise or heat. Lol

Reply

[-]

ayu-ya@reddit

Some people told me it would really take over an hour for one animation, but if they can just keep mulling it over with no issues... I can start the gen, walk the dogs, come back and it's done haha I used to run dense models with heavy CPU offloading when getting into locals, so time doesn't scare me as long as the hardware doesn't suffer too much 🥹

Reply

[-]

Turbulent_Pin7635@reddit

I hate apple smartphones. This is my first apple. But, I need to say. The thing is a tank. I would say military grad quality. It is build to last. In my lab that is one 20 years old, still working, another one 11 years old, it looks like just popped out of the box. Mine is always under 480 Gb+ RAM or 100% CPU use (bioinformatics) and it barely sweat. Don't evaluate apple PCs by apple disposable gimmicks, they aren't the same. Bonus note, you have full control over it, not a single headache over drivers.

Reply

[-]

wittlewayne@reddit

I sold 3 of my old MBP's to buy 1 MPB with an M1 chip..... I feel your pain also

Reply

[-]

Whispering-Depths@reddit

suck my rtx pro 6k 96gb and 192gb ram lol tell me a fucking apple product is better off

Reply

[-]

waescher@reddit

OK, run Qwen3 Coder 480b Q8

Reply

[-]

Whispering-Depths@reddit

lemme know when you can do that on a mac.

Reply

[-]

waescher@reddit

nearly a year since Apple released the M3 Ultra

Reply

[-]

Whispering-Depths@reddit

TIL you can buy a home mac with 512GB of basically vram(?) - it's half the VRAM speed of an rtx pro 6k but the fuck lol that's still insane? for $10k to $16k? Not bad at all. I wonder if you can get one of those and run linux on it.

Reply

[-]

waescher@reddit

Might not be the perfect LLM device but a memory rich one. And one that idles at under 10 and maxes out under 270 Watts while staying silent. There are those freaks (in the best possible way) at Asahi Linux that reverse engineered the Mac drivers and rebuilt them for Linux. I actually run a MacBook M1 Max on Asahi Fedora and it runs great. Unfortunately they only cover the M1 and M2 family yet. But then I guess you’ll loose MLX support which is a boost in model performance on Mac.

Reply

[-]

Whispering-Depths@reddit

my pro-6k can go at 450 watts and it's still silent :D I wonder if mac has stuff that automatically quantizes the model during runtime? That would suck

Reply

[-]

waescher@reddit

I know, these are amazing. Would absolutely love to stack some.

Reply

[-]

waescher@reddit

Oh, two things here: Apple added support for stacking multiple Mac’s with „RDMA over Thunderbold“ lately so you could multiply these 512GB. https://www.jeffgeerling.com/blog/2025/15-tb-vram-on-mac-studio-rdma-over-thunderbolt-5 And the next chip generation M5 is expected to bring extra neural accelerators within the GPUs https://9to5mac.com/2025/11/20/apple-shows-how-much-faster-the-m5-runs-local-llms-compared-to-the-m4/

Reply

[-]

ForsookComparison@reddit (OP)

Easy tiger

Reply

[-]

FullOf_Bad_Ideas@reddit

I have 7t/s TG and 140 t/s at 60k ctx with Devstral 2 123B 2.5bpw exl3 (it seems like quality is reasonable thanks to EXL3 quantization but I am not 100% sure yet). Can Mac do that?

Reply

[-]

-dysangel-@reddit

iirc my M3 Ultra was getting 6-7tps at q4, but the output was garbage compared to GLM, so I just deleted it.

Reply

[-]

FullOf_Bad_Ideas@reddit

Nice, this 6-7 tps is once you loaded it up with context already, right?

Reply

[-]

-dysangel-@reddit

no probably not, I deleted it very quickly

Reply

[-]

Gringe8@reddit

It really depends on what you're trying to do. MacBooks work ok on MOE models, but dense models not so much. My 5090+4080 pc is much faster with 70b models than what you can do with macs.

Reply

[-]

-dysangel-@reddit

What 70b model is worth using? I've never found one that was good at coding.

Reply

[-]

Gringe8@reddit

Thats why it depends on what youre trying to do. I use it for roleplaying and haven't found any of the moe models better than the 70b model i use. https://huggingface.co/Steelskull/L3.3-Shakudo-70b Id imagine the newer moe models are better for coding.

Reply

[-]

Successful_Tap_3655@reddit

Except you can’t do high context but sure

Reply

[-]

Longjumping-Boot1886@reddit

and for that we have M5.

Reply

[-]

getmevodka@reddit

Yes, i can run a qwen3 235b moe in q6_xl and its really nice for the expense i made. For comfy with qwen image it still performs but my old 3090 runs laps around it while already being downvolted to 245watts xD

Reply

[-]

Artistic_Unit_5570@reddit

I have a MacBook Pro M4 Pro; the MLX models are ridiculously optimized, using the GPU at absolute 100%.

Reply

[-]

getmevodka@reddit

I have a m3 ultra 🤣🤣🤣 its nice... Lets stay at that. Hehehehehehe

Reply

[-]

Pale_Reputation_511@reddit

I have M4 Max and yeah, MLX its the way

Reply

[-]

getmevodka@reddit

I hope they offer a m5 max with 256gb so that i can travel with my fav local model. Would be a dream come true

Reply

[-]

-dysangel-@reddit

yeah the M5 era is going to get juicy. Especially with linear attention models

Reply

[-]

__no_author__@reddit

Just remember how much more money they paid than you.

Reply

[-]

shokuninstudio@reddit

You just need to download RAM Doubler. Install two copies of it and your RAM will quadruple. https://preview.redd.it/86like0z1f7g1.jpeg?width=200&format=pjpg&auto=webp&s=7592a47f02d8b2a025e37f1cad502be8604245d4

Reply

[-]

aaronsb@reddit

Ran out of disk space installing more than four copies of ram doubler. Can I use Disk Doubler? https://preview.redd.it/3mra854h4f7g1.png?width=1136&format=png&auto=webp&s=397e3889c6435e80fcf8de301ea7013f6f1821a1

Reply

[-]

TokenRingAI@reddit

Hey, joke all you want, but Stacker was legit, I would have never survived the 90s without stacker and the plethora of Adaptec controllers and bad sector disk drives I pulled out of the dumpsters of silicon valley.

Reply

[-]

Real-Technician831@reddit

Stacker actually made your computer run faster as HDD was so slow that reading the compressed file an unpacking it was way faster than reading same file from raw disk.

Reply

[-]

Korenchkin12@reddit

Same here,doublespace and qemm if i remember the name correctly...today's noobs would never understand,why you need those drivers in high memory

Reply

[-]

Pishnagambo@reddit

Yeah stacker was great 😃

Reply

[-]

fuzzy-thoughts345@reddit

It was great. I had a 20MB Seagate with mostly text, so it really compressed well.

Reply

[-]

_bones__@reddit

Are you me? Of course, any time you got a compressed file it took up twice the size.

Reply

[-]

shokuninstudio@reddit

The doubleception has arrived before GTA VI.

Reply

[-]

ThomasterXXL@reddit

If you want more RAM, all you need is EWE.

Reply

[-]

mikael110@reddit

Fun fact unlike the whole "Download more RAM" meme, Ram Doubler software was a real thing back in those days, and they did actually increase how stuff you could fit in RAM. The way they worked was by dynamically compressing and decompressed data as it came in and out of RAM. Nowadays RAM compression is built into basically all modern operating systems so it would no longer do anything, but back then it made a real difference.

Reply

[-]

TokenRingAI@reddit

Some people reminisce about Woodstock, I reminisce about waiting in line at Fry's electronics to get Windows 95 at 12:01AM The kids will never understand.

Reply

[-]

joelasmussen@reddit

That's rad. I wonder if we'll ever get a modern equivalent. Lowly consumers getting something exciting.

Reply

[-]

jaysedai@reddit

I remember, I stood outside with a sign with a Mac icon and the words "Windows 95, welcome to 1984".

Reply

[-]

TokenRingAI@reddit

Which Fry's?

Reply

[-]

tehfrod@reddit

When I got engaged we were trying to set a date and August 24th came up. I said, "Perfect! I'll never forget our anniversary. It's the day Windows 95 was released!" We're divorced now.

Reply

[-]

shokuninstudio@reddit

I was waiting in line at midnight for the release of Mac OS X. Club goers passed by and asked what we were doing. When we explained why they laughed at us like we were all dorks.

Reply

[-]

mehum@reddit

Sorry to break the news mate, but they weren’t wrong!

Reply

[-]

shokuninstudio@reddit

"No I'm here waiting for the choppa" https://i.redd.it/rjzk88xl1g7g1.gif

Reply

[-]

Alternative-Sea-1095@reddit

It really didn't do any ram compression, windows 95 did that. Yes, windows 95 did ram compression and those "ram doubler" just used placebo and doubling the page size by 2x. That's it...

Reply

[-]

mikael110@reddit

The original [Ram Doubler](https://apple.fandom.com/wiki/RAM_Doubler) wasn't for Windows 95 though, it was for classic Mac OS and Windows 3.1. Neither of which had RAM compression built in. You might be confusing Ram Dobuler for [SoftRAM](https://en.wikipedia.org/wiki/SoftRAM), which was indeed just a scam. That was developed by an entirely different company though.

Reply

[-]

pixel_of_moral_decay@reddit

Yup. Ram Doubler was the real deal. Came at the cost of a little cpu, but that was a point in time most systems were more memory bound than cpu bound. 4-16mb of memory but 66-200mhz CPU. Taking a couple percent to add memory was a huge win, compared to virtual memory on slow 5200 rpm ide hard drives.

Reply

[-]

Coldaine@reddit

Oh yeah, having 4mb of ram was a ridiculous constraint bro. You don't understand, I NEEDED to be able to run SCURK. https://www.mobygames.com/game/2422/simcity-2000-urban-renewal-kit/

Reply

[-]

Alternative-Sea-1095@reddit

I was! Thank you.

Reply

[-]

SilentLennie@reddit

Linux has ZRAM which provides a compressed RAM disk and you can put swap on it, thus compression RAM.

Reply

[-]

FurrySkeleton@reddit

Disk doubler, too! It really did work.

Reply

[-]

audioen@reddit

Yeah. Doing both disk and ram compression today, routinely. Even on servers.

Reply

[-]

pscoutou@reddit

It's 2025, we have the cloud now. https://downloadmoreram.com

Reply

[-]

YoloSwag4Jesus420fgt@reddit

Didn't this actually work by compressing the ram or something? I know it wasn't 2x but it was better if I recall than nothing? I swear I saw a yt vid on this once

Reply

[-]

shokuninstudio@reddit

It slowed the computer down anyway because the CPU ran at less than 50Mhz and only had one core. It had to do on the fly compression and decompression while running your apps and OS.

Reply

[-]

astrange@reddit

Memory compression is a lot more than 2x effective but mostly because memory is mostly zeroes.

Reply

[-]

The_frozen_one@reddit

For real though I can run the deepest of seeks [since I downloaded more RAM to my CPU.](https://downloadmoreram.com/)

Reply

[-]

grimjim@reddit

Unironically, we do have a parallel for LLMs; the bitsandbytes library can perform 4-bit quantization while loading a model.

Reply

[-]

AuspiciousApple@reddit

I already have it. Just send me your RAM and I'll send double back

Reply

[-]

shokuninstudio@reddit

That sounds like a free buttcoin scam.

Reply

[-]

Trick-Force11@reddit

can i install 50 copies for 1125899906842624 times more ram or is there a limit

Reply

[-]

shokuninstudio@reddit

The AI powered activation server will detect your pirate serial numbers and ask Copilot to delete all your files and System32 directory.

Reply

[-]

Trick-Force11@reddit

worth a shot

Reply

[-]

Background_Essay6429@reddit

What Mac configuration gets you the best tokens/sec?

Reply

[-]

Little-Put6364@reddit

I'm always more concerned about quality over speed. Sure speed is nice, but throwing more compute at the model won't make it magically better at answering

Reply

[-]

Southern_Sun_2106@reddit

The future of local home AI is a small box on the table.

Reply

[-]

CryptoCryst828282@reddit

Future of home ai will be cloud, we will resist but it will be cloud. The hardware will price us all out eventually.

Reply

[-]

Southern_Sun_2106@reddit

I specifically said, the future of \*local\* home AI... will be a small box on the table. Sure, there will be lots of things provided from the cloud. However... history has showed that people still prefer to actually 'own' their stuff. There was a time in the 70's where people thought personal computing will be done on terminals connected to the huge machines of the time stationed elsewhere, and look what happened (spoiler, a personal computer in all of ours pockets plus portable laptops, etc). We have a natural drive towards distributed processing, just like we don't have a hive mind, etc. It's who we are. We go the cloud way only if that's the only option, but when other options are available - we go for distributed processing.

Reply

[-]

CryptoCryst828282@reddit

I know you did... I am saying the future of ALL ai will be cloud. History has shown that it really doesnt matter what you want, cost dictates everything. At some point home GPUs will not run the ai models and if you think you will be able to afford a datacenter tpu or whatever they run on in 10 years i have a bridge to sell you. In the 70s not EVERYTHING was going to a service, now it all is. You will own nothing, it will all be a license, and sadly, not a single person will do

Reply

[-]

Southern_Sun_2106@reddit

History also showed that we are not running excel and word on supercomputers via terminals. We opt for not the most powerful, but more convenient solutions. Anyway, this discussion is pointless, because I won't change your mind, and you won't change mine; and none of our theories are provable at the moment, just speculation. But thank you for engaging, anyway.

Reply

[-]

Ytijhdoz54@reddit

The mac mini’s are a hell of a value starting out but the lack of Cuda at least for me makes it useless for anything serious.

Reply

[-]

iMrParker@reddit

I'm willing to bet most people on this sub haven't ventured past inference so posts like this are r/iamverysmart

Reply

[-]

egomarker@reddit

You can't train anything serious without a wardrobe of gpus anyway. Might as well just rent.

Reply

[-]

RedParaglider@reddit

EXACTLY.. that's why bang for the buck a 128gb strix halo was my goto even though I could have afforded a spark or whatever. I'm just going to use this for inference, local testing, and enrichment processes. If I get really serious about training or whatever renting for a short span is a much better option.

Reply

[-]

Fi3nd7@reddit

100%. People who think buying their own hardware while these companies are literally burning money is insane. Renting is so insanely subsidized right now. It's not worth buying. One could argue that now is the time to buy before hardware gets insanely expensive and everyone pulls out of consumer GPUs. But honestly, if I was really really serious about running intense stuff local. I'd probably drop 10-15k on a real AI machine.

Reply

[-]

RedParaglider@reddit

Valid. Honestly built this project, so I had a good reason for a local inference box. For anything other than enrichment type stuff I use big models. [https://github.com/vmlinuzx/llmc/](https://github.com/vmlinuzx/llmc/) Local stuff is a shit ton of fun to me to do learning on, and also build systems which require engineering imagination to work perfectly under constraints.

Reply

[-]

SamWest98@reddit

Yeah. People over here comparing $10k macs to $300k nvda rigs bc they heard about cuda on twitter

Reply

[-]

FullOf_Bad_Ideas@reddit

I got my finetune featured in a LLM safety paper from Stanford/Berkeley, it was trained on single local 3090 Ti and it was actually in the top 3 for open weight models in their eval - I think my dataset was simply well fit for their benchmark. >However, on larger base models the best fine-tuning methods are able to improve rule-following, such as Qwen1.5 72B Chat, Yi-34B-200K-AEZAKMI-v2 *(that's my finetune)*, and Tulu-2 70B (fine-tuned from Llama-2 70B), among others as shown in Appendix B. https://arxiv.org/abs/2311.04235

Reply

[-]

iMrParker@reddit

If you're doing base model training then yes. But if you're fine tuning 7b, 12b models you can get away with most consumer Nvidia GPUs. The same fine tuning probably takes 5 or 10 times longer with MLX-lm

Reply

[-]

LocoMod@reddit

Congratulations to the 3 people on here training models from scratch that no one will ever use. For everyone else, MLX can do everything, including fine tuning.

Reply

[-]

iMrParker@reddit

Who said people are training models for mass users? Most model training and fine-tuning is done for personal, college, or internal enterprise reasons. MLX-LM can do \*some\* of the things that CUDA-accelerated libraries like Unsloth/PEFT/tortchtune/Tensorflow can do, but WAY slower. It's disingenuous for you to pretend that no one does this and that MLX is just as capable or performant

Reply

[-]

BumbleSlob@reddit

lol why does everyone have to participate in fine tuning or training exactly? What a dumb ass gatekeeping hot take. This would be like a carpentry sub trying to pretend that only REAL carpenters build their own saws and tools from scratch. In other words, you sound like an idiot.

Reply

[-]

iMrParker@reddit

Point to me where I made any gatekeeping statements. My point is that people like OP don't consider the full range of this industry / hobby when they make blanket statements about which hardware is best

Reply

[-]

Monkey_1505@reddit

There is SO much you cannot do without CUDA.

Reply

[-]

LocoMod@reddit

Do tell. What can't you do without CUDA? We can run infence, fine tuning, diffusion models, tts, stt, embeddings, etc without CUDA. I suppose for the 0.000000001% of the world that is training models from scratch then CUDA matters.

Reply

[-]

Monkey_1505@reddit

The full version of automatic1111 with all the extensions, a whole host of txt to image software clients, many different LLM clients, txt to image merging, LLM lora training...of the top of my head. I'm sure the list is like a page long. It's a lot of stuff.

Reply

[-]

LocoMod@reddit

That's an implementation detail of automatic1111 and nothing to do with what CUDA is or isnt capable of vs other platforms. I can use the same image gen models youre using on Apple hardware. Sure the generation times are slower, but it can still be done without CUDA. You're still using automatic1111? The platform dinosaurs were using back when SD1 was popular?

Reply

[-]

Monkey_1505@reddit

The entire advantage of using CUDA is that it makes it easier for developers, hence why yes, it's used in a lot of software. So, what software do you use to run z-image on mac?

Reply

[-]

_VirtualCosmos_@reddit

And not just Cuda. The blackwell hardware is very needed for training full FP8 at least for now. But I have put hopes in ROCm, it's open source and promising.

Reply

[-]

ai-christianson@reddit

MLX?

Reply

[-]

Bogaigh@reddit

Right…“my loud, hot, expensive Linux box is faster in benchmarks and images, therefore your quiet unified-memory machine that lets you think deeply without friction is bad.”

Reply

[-]

ForsookComparison@reddit (OP)

Reread meme for better results

Reply

[-]

Specific-Goose4285@reddit

I wouldn't call myself a normie. Thing is even before the RAM shortage 128GB vRAM is crazy expensive and is attached to power hungry devices. The unified memory has the advantage that it fits and the speeds are just about good enough for certain tasks.

Reply

[-]

Novel-Mechanic3448@reddit

M5 has tensor cores, its only a matter of time

Reply

[-]

vdiallonort@reddit

I would be really happy to be a "normies" if i had the money to be ;-) I have a mac book pro m3 with 24 go from work,you need to spend way more than that (which was already expensive to my taste) and speed is disappointing. In my dream there is cheap m5 ultra,in my dream....

Reply

[-]

egomarker@reddit

https://preview.redd.it/6ay76woq2f7g1.jpeg?width=1102&format=pjpg&auto=webp&s=bb85be5ff527e800efb201de0a94e997c86ce4f6 Hop in kids

Reply

[-]

msc1@reddit

I'd get in that van!

Reply

[-]

FaceDeer@reddit

I think that van would backfire on kidnappers, they'd find themselves instantly surrounded by a mob of ravenous savages tearing the van apart to get at the RAM in there.

Reply

[-]

Latter_Virus7510@reddit

Absolutely! 😅

Reply

[-]

ashirviskas@reddit

With screwdrivers!

Reply

[-]

CasualtyOfCausality@reddit

Yeah, this is akin to laying down in an anthill and ant-whispering that you are actually covered in delicious honey.

Reply

[-]

Important-Novel1546@reddit

oh, without a second doubt

Reply

[-]

Chrono978@reddit

There is always a chance they’re saying the truth and with those prices, it’s a worthy risk.

Reply

[-]

ThisWillPass@reddit

Your not my daddy…

Reply

[-]

Kirigaya_Mitsuru@reddit

I cant affort PC to play last games that come out even EU5 is laggy sometimes let alone an local LLM. \*sigh\* Im pro AI but more like i think the future is Local and not Cloud, so i hop into van too!

Reply

[-]

nachoaverageplayer@reddit

I upgraded my M1 pro with 16GB ram to an M4 max with 48GB for this very reason. It’s just so performant at anything I throw at it and portable that it’s worth the apple tax imo.

Reply

[-]

clduab11@reddit

People finish assembling their perfect workstation? 😬

Reply

[-]

juggarjew@reddit

Normies with macs dont have lots of RAM, they have an 8-32GB MAC lol

Reply

[-]

qwen_next_gguf_when@reddit

Currently , you just need a few 3090s and as much RAM as possible.

Reply

[-]

10minOfNamingMyAcc@reddit

I assume you're talking about DDR5? I'm struggling with 3600MHz DDR4... (64 GB VRAM but still, can barely run a 70B model at Q4\_K\_M gguf at decent speed for fast inference \~30-60 seconds below 16k even... Is koboldcpp botclenecking me?) I should've upgraded earlier, but went for a new monitor...

Reply

[-]

Lissanro@reddit

70B dense model is quite slow if cannot fully fit in VRAM... For example, Kimi K2 is 1T model but has just 32B active parameters so in case of CPU-only inference it will be faster than 70B model. And based on 3600 MHz speed, you likely have dual channel RAM, it is almost four times slower than 8-channel DDR4 3200MHz RAM. In any case, to efficiently run model its context cache needs to be entirely in VRAM. Then prompt processing will be done on GPUs and text generation speed will be much faster too.

Reply

[-]

Lissanro@reddit

Well, I seem to satisfy the requirements. I have four 3090, they are sufficient to hold 160K context at Q8 with four full layers of Kimi K2 IQ4 quant (or alternatively could hold 256K context without full layers), and 1 TB of RAM. Seems to be sufficient for now. Good thing I purchased it at the beginning of this year while prices where good... otherwise at current RAM prices upgrading would be tough.

Reply

[-]

Wrong-Historian@reddit

> a few 3090s Okay, cool > and as much RAM as possible. Whaaaaaaaaaaa

Reply

[-]

_VirtualCosmos_@reddit

Nice, that would be +1300 euros per used 3090 and +1000 euros per 64 GM ram lol

Reply

[-]

ShameDecent@reddit

A few 3090s won't be cool at all, they would substitute for a room heater.

Reply

[-]

RedParaglider@reddit

It's not enough to be able to drive a phat ass girl around town and show her off, you gotta be able to lift her into the truck. AKA ram :D.

Reply

[-]

LocoMod@reddit

Who needs two kidneys amirite?

Reply

[-]

jwr@reddit

Can relate, I am the normie. I own a M4 Max (64GB) laptop and I kept wondering why people have to go to such lengths and expenses to run those 30B models, until I realized the reasons.

Reply

[-]

Tenkinn@reddit

Not sure about the significantly better speeds, I guess for the same price you have a faster setup with nvidias gpu but for sure it's WAY EASIER to buy, install, setup, cost less to run, doesn't consume a billion watt, nor replace your heater, makes way less noise and takes less way space

Reply

[-]

PMvE_NL@reddit

What? You can do research in a day and assemble it in one day. I would say... Skill issue

Reply

[-]

One_of_Won@reddit

This is so mis leading. My dual 3090 setup blows my Mac mini out the water

Reply

[-]

1Soundwave3@reddit

Okay, it's good that you have both because I have some questions. How much vram do you get out of your dual 3090 setup? Also, do you really need that? Because from what I've seen gpt oss 20b is the first model that I can call decent and I can run it on my gaming PC no problem. And it's a MOE one. So I'm just thinking: MOE sounds like the biggest bang for the buck. Mac mini sounds like the biggest bang for the buck as well. If you combine them and hope that there will be better MOE models - it seems like a good choice for a small local setup, that does pretty much anything you need locally if you can't use a cloud model for some reason.

Reply

[-]

BusRevolutionary9893@reddit

It makes no sense. If it said something about being able to run larger models and left out normies, that might work. Normies don't have 512 GB of unified memory.

Reply

[-]

20ol@reddit

In price or energy use?

Reply

[-]

PerfectReflection155@reddit

People are affording the latest Mac books? On credit card right?

Reply

[-]

txdv@reddit

Normies are not dropping 10k on a mac with 512gb of ram

Reply

[-]

RabbitEater2@reddit

The only thing worse than slow generation is slow prompt processing. And at least windows can run way more AI/ML stuff if you're into that. Can't say I'm jealous tbh.

Reply

[-]

ai-christianson@reddit

This is the main reason I got a MBP 128GB... well, that & mobile video editing. I say this as a long-time Linux user. I still miss Linux as a daily driver, but can't argue with the local model capability of this laptop.

Reply

[-]

TechnoByte_@reddit

Why not use Asahi Linux?

Reply

[-]

noiserr@reddit

> I still miss Linux as a daily driver Strix Halo was an option.

Reply

[-]

AmpEater@reddit

Same!

Reply

[-]

riceinmybelly@reddit

A second hand Mac Studio M2 96GB is super affordable and is hard to beat. The pricier beelink GTR9 Pro 128 GB is left in the dust

Reply

[-]

ElephantWithBlueEyes@reddit

i gave up on local LLMs

Reply

[-]

Ok-Future4532@reddit

This can't be serious right? This can't be true. Is it because of the bottlenecks related to using multiple GPUs? Is there something else I'm missing? GDDR6/7 VRAM is so much higher speed than unified memory. , how can macbooks be faster than custom multiGPU setups?

Reply

[-]

VegetableSense@reddit

https://i.redd.it/znvx4acbci7g1.gif

Reply

[-]

apetersson@reddit

i have yet to decide between a \~10k mac ultra (m5/m3/m1) ? and a custom build. my impression is that "small" models could be a bit faster on a custom build but any "larger" model will quickly fall behind because a 10k GPU based build just won't be able to hold it proerly. educate me.

Reply

[-]

StaysAwakeAllWeek@reddit

If you're looking at 10K you're close to affording a RTX Pro 6000, which will demolish any Mac by about 10x for any model that fits into 96GB VRAM But if you overflow that 96GB it can fall down as far as 1/4 as fast, limited by the PCIe bandwidth If you're into gaming the pro 6000 is also the fastest gaming gpu on earth, so there's that

Reply

[-]

apetersson@reddit

thanks for the input - ok, so why should it be "10x" faster for smaller models? i'm thinking RTX Pro 6000 1.6TiB/sec mem bandwidth vs 0.8 TiB/sec on a Mac Studio Ultra should be about 2x. what am i missing?

Reply

[-]

mi_throwaway3@reddit

it's not just about memory bandwidth, loading shit into memory is just one part of the equation

Reply

[-]

StaysAwakeAllWeek@reddit

Mac Studio does not support data types smaller than 16 bit. By running 8 bit quants you can double your effective bandwidth and memory capacity on nvidia cards, and if you're OK with losing some output quality a 4 bit quant increases it another 2x

Reply

[-]

RandomCSThrowaway01@reddit

It depends on what you consider to be a larger model. Because yes, 9.5k Mac Ultra M3 has 512GB shared memory and nothing comes close to it at this price point. It's arguably the cheapest way to actually load stuff like Qwen3 480B, Deepseek and the likes. But the problem is that the larger the model and the more context you put in the slower it goes. M3 Ultra has 800GB/s bandwidth which is decent but you are also loading a giant model. So, for instance, I probably wouldn't use it for live coding assistance. On the other hand at 10k budget there's 72GB RTX 5500 or you are around a 1000 off from a complete PC with 96GB RTX Pro 6000. The latter is 1.8TB/s and also processes tokens much faster. It won't fit largest models but it will let you use 80-120B models with a large context window at a very good speed. So it depends on your use case. If it's more of a "make a question and wait for the response" then Mac Studio makes a lot of sense as it does let you load the best model. But if you want live interactions (eg. code assistance, autocomplete etc) then I would prefer to go for a GeForce and a smaller model but at higher speed. Imho if you really want a Mac Studio with this kind of hardware I would wait until M5 Ultra is out too. Because it should have like 1.2-1.3TB/s memory bandwidth (based by the fact base M5 beats base M4 by like 30% and Max/Ultra is just a scaled up version) and at that point you just might have both capacity and speeds to take advantage of it.

Reply

[-]

StaysAwakeAllWeek@reddit

>It's arguably the cheapest way to actually load stuff like Qwen3 480B, Deepseek and the likes. It's the cheapest *reasonable* way to do it. The actual cheapest way to do it is to pick up a used Xeon Scalable server (eg Dell R740) and stick 768GB of DDR4 in it. You get 6 memory channels for ~130GB/s bandwidth per cpu, and up to 4 CPUs per node, for an all out cost of barely $2000 (most of that being for the RAM, the cpus are less than $50). You can even put GPUs in them to run small high speed subagent models in parallel, or upgrade to as much as 6TB of RAM. The lrimary downside is it will sound like 10 vacuum cleaners having an argument with 6 hairdryers. They are super cheap right now because they are right around the age where the hyperscalers liquidate them to upgrade. Pretty soon they will probably start rising again if the AI frenzy keeps going

Reply

[-]

JockY@reddit

> any "larger" model will quickly fall behind because a 10k GPU based build just won't be able to hold it proerly [sic] Based on this sentence alone I recommend *not* trying to understand screwdrivers and instead just buy the nice shiny Apple box. Plug in. Go brrr.

Reply

[-]

holchansg@reddit

RTX XX90, is not even close.

Reply

[-]

_hypochonder_@reddit

My 4x AMD MI50s 32GB works fine for me for llm inference stuff. How much money cost a Apple product with 128GB usable VRAM again?

Reply

[-]

mi_throwaway3@reddit

it's literally 5k

Reply

[-]

TokenRingAI@reddit

It's worse than that, the new iPhone has roughly the same memory bandwidth as a top-end Ryzen desktop.

Reply

[-]

mi_throwaway3@reddit

you'd think apple was in here astroturfing that memory bandwidth and power consumption were the two leading concerns with LLM usage

Reply

[-]

Calamero@reddit

What are they doing with all that power though? Siri can’t be it. Probably just listening and giving out social scores…

Reply

[-]

ForsookComparison@reddit (OP)

Server racks would look much neater if they were just iPhone slabs and type-C cables

Reply

[-]

TokenRingAI@reddit

One day OpenAI will do a public tour of their datacenter and we'll realize it's been super-intelligent monkeys doing math problems on iPhones all along

Reply

[-]

ImJacksLackOfBeetus@reddit

Only thing I learned from this thread is that nobody knows what they're talking about according to somebody else, and that the old Mac vs. PC (or in this case, GPU) wars are still very much alive and kicking. lol

Reply

[-]

ForsookComparison@reddit (OP)

Let us have some fun

Reply

[-]

ImJacksLackOfBeetus@reddit

Don't mind me, I'm sitting here popcorn at the ready. Have at it! lol

Reply

[-]

crazymonezyy@reddit

It's similar to doing a month of research to find the best android camera only for people around you to prefer their iphones for photos because they're more Instagram friendly.

Reply

[-]

wh33t@reddit

Dollar for dollar + token for token ... nah Plus ... how do you upgrade a mac?

Reply

[-]

a_beautiful_rhind@reddit

Just wait till you find out what you can get in 2-3 years. Their macbook is gonna look like shit, womp womp. Such is life, hardware advances.

Reply

[-]

El_Danger_Badger@reddit

Honestly, I don't see the issue with running local on Mac at all. The machines happen to almost purpose built to run inference. Everyone started at zero, two years ago with this stuff and really, AInis the only true expert at AI. Have the biggest rig on the block, or a Camry running locally on a Mini, the end result is local first, local only. Privacy, sovereignty, some form of digital dignity, and some semblance of control in an disturbingly surveiled world. Five years from now, they will just sell boxes to deal with it all on our behalf. But however you slice it, hosting your own isn't easy and isn't cheap. So if anyone can make it work, more power to them. To quote the immortal words of, well, both east and west coast rappers, "we're all in the same gang".

Reply

[-]

CheatCodesOfLife@reddit

Yeah if you're only doing sparse MoEs with a single user, get a mac.

Reply

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

Reply

[-]

aeroumbria@reddit

Really depends on your use case. Macs still cannot do PyTotch development or ComfyUI well enough. And if you wanna do some gaming on the side, it is the golden age for dual GPU builds right now.

Reply

[-]

jeffwadsworth@reddit

Money talks baby

Reply

[-]

H0vis@reddit

Normies don't have the latest Macbook in this economy.

Reply

[-]

Rockclimber88@reddit

It's because of NVIDIA's gatekeeping of VRAM and charging obscene amounts for relevant GPUs like RTX 6000 PRO with barely 96GB

Reply

[-]

the-mehsigher@reddit

So it makes sense now why there are so many cool new “Free” open source models.

Reply

[-]

esamueb32@reddit

Not for stable diffusion lol Macbook are so much slower

Reply

[-]

holchansg@reddit

https://preview.redd.it/mpgrwbod2f7g1.png?width=1159&format=png&auto=webp&s=3bdeca312d3e4126d2628fc2d3894d7a862925b5 This is the normie one... can't get better than this... only the MX Ultra and Max has more bandwidth, and dont have as near as much TOPs in the NPU.

Reply

[-]

Ill_Barber8709@reddit

M4 Pro has 270GB/s memory bandwidth. As far as know, AI Max is 250GB/s

Reply

[-]

getmevodka@reddit

M3 ultra has 819GB/s, whats the point of argue here ? I dont get it

Reply

[-]

holchansg@reddit

The Max/Ultra ones are fast, but yet an dedicated GPU is better.

Reply

[-]

getmevodka@reddit

Im thinking pro 6000 max q for christmas rn

Reply

[-]

holchansg@reddit

And I'm covered in jealously.

Reply

[-]

getmevodka@reddit

😅 i live next to a university with an ai and robotics lab. The wants are very strong cause of that xD. My two 3090 would do some time still i gues

Reply

[-]

Ill_Barber8709@reddit

M3 Ultra has barely less memory bandwidth than the RTX 3090, but MLX being 20% faster than GGUF, it performs better.

Reply

[-]

holchansg@reddit

In some setups they are tied, in MOES, or multi-modal is not even close... also the price difference is massive, 3090 is dirty cheap.

Reply

[-]

holchansg@reddit

But has less TOP/s, far less...

Reply

[-]

Ill_Barber8709@reddit

And? Would you train a model on that thing? I guess not. Most of people here are using local LLMs for inference. Not training. And most people making actual model training won't use a house workstation anyway...

Reply

[-]

holchansg@reddit

Do you know they state TOPs on INT8 right? Meant for INFERENCE.

Reply

[-]

Ill_Barber8709@reddit

Don't care. Compute power is not the bottleneck for inference. M4 Max has 546GB/s memory bandwidth.

Reply

[-]

holchansg@reddit

Yes it does, not so much, this is true, but only for some cases, try to run some multi-modal models an see by yourself.

Reply

[-]

onetimeiateaburrito@reddit

I'm just sitting here in my box-home made of poverty with an RTX 3070 and 8gb. Run some cool prompts for me, boys. Lol

Reply

[-]

Bozhark@reddit

I am in this picture, twice

Reply

[-]

CMDR-Bugsbunny@reddit

Yeah, I just sold my 2nd RTX A6000 from my Threadripper LLM Server. My stupid $2k refurbished MacBook Pro M2 Max with 96Gb RAM was fast enough. While 100+ T/s was cool - 30-40 T/s is still plenty fast enough and a LOT cheaper.

Reply

[-]

PotentialFunny7143@reddit

https://preview.redd.it/5l0hetujmf7g1.png?width=800&format=png&auto=webp&s=4d5d922c732c5dd6b4d510d8573a2b6e99a81c36 The situation is out control!

Reply

[-]

Expensive-Paint-9490@reddit

Who cares, I am not installing a closed source OS on my personal machine.

Reply

[-]

ForsookComparison@reddit (OP)

right on

Reply

[-]

SamuelL421@reddit

(*Shhh, don't tell the normies, but half the fun of LocalLLaMA is getting an excuse to spend months assembling a workstation.*)

Reply

[-]

ForsookComparison@reddit (OP)

*(I know)*

Reply

[-]

ThatCrankyGuy@reddit

Man, fuck macs.. but also.. M-Chips.. How the hell has no one caught up to the caliber of these chips?

Reply

[-]

ForsookComparison@reddit (OP)

I'm sure it's more complicated than that, but my feeble consumer understanding is that Windows-on-ARM is souring the experience and any mass-appeal that Qualcomm PC's could have - and so we keep getting these ludicrously expensive low-volume laptops that make no sense and a half-assed effort from everyone involved.

Reply

[-]

CSharpSauce@reddit

My computer has been unstable as hell since I put the second 3090 in, but I don't think I could live without it now... :(

Reply

[-]

tarruda@reddit

I'm far from a "normie" and never once before had bought a single Apple product. But it is a fact that Apple Silicon simply the most cost effective way to run LLMs at home, so last year I bit the bullet and got a used Mac Studio M1 Ultra with 128GB on eBay for $2500. One of the best purchases I have ever made: This thing uses less than 100w and runs 123B dense 6-bit LLM at 5 toks/second (measured 80w peak with asitop). Just to have an idea of how far Apple is ahead of the competition: M1 Ultra was released on March 2022 and is still provides superior LLM inference speed than Ryzen AI MAX 395+ which was released in 2025. And Ryzen is the only real competition for the "LLM in a small box" hardware, I don't consider these monster machines with 4 RTX 3090 to be competing as it uses many times the amount of power. I truly hope AMD or Intel can catch up so I can use Linux as my main LLM machine. But it is not looking like it will happen anytime soon, so I will just keep my M1 ultra for the foreseeable future.

Reply

[-]

InspirationSrc@reddit

Is it? Maybe I'm wrong (please tell me if I'm, so I can go and buy mac), but everywhere I look people say macbook isn't that fast for interference for 30b+ models and you better use two or more 3090. And it's not going to work for tuning at all. And you can't even connect GPU via thunderbolt, it only works on Intel and AMD.

Reply

[-]

DataGOGO@reddit

lol… no.

Reply

[-]

Apprehensive-End7926@reddit

Ooooh you’ve really triggered some folks with this one

Reply

[-]

ForsookComparison@reddit (OP)

Gotta get that engagement.

Reply

[-]

tgwombat@reddit

Oh boy, wait until this guy hears about the steady march of progress!

Reply

[-]

Denny_Pilot@reddit

That's probably because the vram gets overflown and the CPU starts doing the work? In that case mac would really give a better speed just because for the price you can't get as much vram. Otherwise idk, the dedicated gpus are faster

Reply

[-]