TheaterFire

I'm strong enough to admit that this bugs the hell out of me

Posted by ForsookComparison@reddit | LocalLLaMA | View on Reddit | 420 comments

I'm strong enough to admit that this bugs the hell out of me

Reply to Post

420 Comments

SadEntertainer9808@reddit

It's almost as if a device that's designed end-to-end by the same firm, largely stamped out of a single die, and built with a robotic degree of precision hand-assembly could achieve, using parts too finicky to be user-serviceable, might actually just be better, and that the guys assembling a "perfect" workstation are maximizing an aesthetic instead of maximizing performance. If you saw a guy gluing together a car in his garage and actually thinking it might beat a Porsche you wouldn't even think he was insane. You'd just think he was stupid.
View on Reddit #81215537

ForsookComparison@reddit (OP)

Thread is older than I am and my knees hurt
View on Reddit #81234716

Cergorach@reddit

If this is the case, someone sucks at assembling a 'perfect' workstation. ;) Sidenote: Owner of a Mac Mini M4 Pro 64GB.
View on Reddit #73580654

o5mfiHTNsH748KVq@reddit

Im pretty happy with my 512gb m3 ultra compared to what I’d need to do for the same amount of vram with 3090s. Spent a lot of money for it, but it sits on my desk as a little box instead of whirring like a jet engine and heating my office. I wish I could do a cuda setup though. I feel like I’m constantly working around the limitations of my architecture instead of being productive building things.
View on Reddit #73582170

bigh-aus@reddit

I really think the jumps up the stack are: 512gb m3 $10k 4x RTX6000 pro Maxq $36k 4x H200NVL $130k (or at this point you're into second hand DGX stations) I think it depends on your usecase and what you want to do. $10k is a bit to spend on privacy (now), I'm hopeful that IF the public models ramp their cost, there will be more hardware options to run locally then.
View on Reddit #76043690

Suitable-Program-181@reddit

Cuda is trash, asahi unlocks M chip better than cuda will with any rtx.
View on Reddit #75683483

Cergorach@reddit

I agree, your M3 Ultra 512GB is a LOT more energy efficient and cheaper then 21x 3090... But it's not faster then that 3090 card. Which is what the meme is hinting at.
View on Reddit #73585950

o5mfiHTNsH748KVq@reddit

Right, yeah, it's definitely not faster.
View on Reddit #73588021

ArtfulGenie69@reddit

If mac wasn't so slow I'd have got one too. All the hardware is weirdly only taking care of one aspect of the ai. Like Mac can run a big model slowly but is expensive. Nvidia has speed and good drivers and many projects take advantage of cuda but the price per GB of vram is very high. AMD sucks at drivers and almost none of the new gits work out of box with it and it is slow but it's like half the price of Nvidia and you don't need the big bucks a Mac investment would take. 
View on Reddit #73668849

Gallagger@reddit

"AMD sucks at drivers and almost none of the new gits work out of box with it" The new gits work with Mac though? Thought they don't work because they're cuda based = needs Nvidia.
View on Reddit #73865545

ArtfulGenie69@reddit

Maybe it's about the same for amd/mac? Like with both you are buying only inference. I think when you stay in the realm of comfyui and the toys that have had some time to mature they work ok after you get past the rocm hurdles but they are much slower on the amd and Mac. I'm kinda thinking more about when the new models drop or using some extra fast inference like vllm or exl3 or all the tts models that come out. They are all only going to work through Nvidia hardware from the jump as they were all developed on it, at least that's what it looks like when I'm installing them all the time.  Like I was saying I don't think you would want to train wan/qwen/flux or any of the tts models on an amd card or a mac. It sounds like hell just getting amd to infer but maybe it's just rocm. It's like every one of their cards has some different reason why this or that version of rocm doesn't work. It sounds so fucking tedious. 
View on Reddit #73904171

blazze@reddit

The M5 Ultra is going to match the RTX 3090.
View on Reddit #73654734

CryptoCryst828282@reddit

Not even close. I have a M3 Ultra and t/s isnt bad but once you load up on context the PP time is just stupid slow and no one really talks about that part. I dont know what makes it so bad but its garbage at higher context.
View on Reddit #73804053

blazze@reddit

Also all M series pre M5 are also "stupid slow" on Flux and stable diffusion. Those NPU cares make a difference.
View on Reddit #73806913

CryptoCryst828282@reddit

I will say I havent tried the newest stuff, but it always felt different to me. They were great for chatbots and stuff like that, if you want to do anything agentic, they seemed to be a bit lacking.
View on Reddit #73820355

Cergorach@reddit

Possibly, but by the time that comes out the 3090 is 6 years old and a 5090 will still be 2x the memory bandwidth. And a 6090 not that far away (or already out)... Neither is inherently a better solution then the other, each has their use. The point here was 'faster'... The Mac solution is a lot of things, but faster isn't one of them.
View on Reddit #73657670

OrneryMinimum8801@reddit

Isn't the m3 ultra 815 bps bandwidth? That's less than a 3090 or 4090, but you should make up for the bandwidth issues by the monster amount of ram allowing a huge context window to be held in memory (dollar for dollar). I mean what does a system cost for 15x 4090 networked together?
View on Reddit #73761933

blazze@reddit

M5 a generational shift that will come close to challenging Nvidia GPU dominance. Similar competition like Google's TPU.
View on Reddit #73658788

ErisLethe@reddit

Your 3090 costs over $1,000. The performance per dollar favors Metal.
View on Reddit #73598357

Nepherpitu@reddit

3090 costs around $500, not even close to 1000. Otherwise yes, it's impossible to assemble 20+ gpus cheaper than single mac.
View on Reddit #73616066

polikles@reddit

depending on where do you live. In my location 3090 can be between $700 and $1200, provided you find any in working condition
View on Reddit #73623534

ArtfulGenie69@reddit

I'm guessing they are under selling the price, for a used one it's been riding between $700-$1000
View on Reddit #73668983

polikles@reddit

yeah, I think so. There is a subtle difference between how much does thing cost and how much someone feels it 'should' cost had just checked and in my location most of the 3090s are $800+ (after conversion) right now. Mind that there isn't a lot of them available, as they're all used cards I got mine 3090 for slightly less $700 last year, as it had a broken fan and caked cooler. I've replaced the fans and thermal pads and it's as good as new
View on Reddit #73697722

ArtfulGenie69@reddit

I did the exact same thing just a less lucky buy on some of them. On had a fan issue that the seller was trying to hide but they had dropped the price to like $800. A new full set of fans is under $20 and because the fans were bad I did thermal pads too. Boy does it drop a lot more heat with the new pads. Put them on the back plate too. I was lucky with the other 3 3090s and didn't have to put in that effort though. 
View on Reddit #73739536

Cergorach@reddit

No, a 3090 is a lot better for performance per dollar, Macs are expensive and the 3090 is very fast! No, Apple is king on memory per dollar vs VRAM. And neither is what the meme was saying.
View on Reddit #73627098

Individual-Source618@reddit

no it is not, if was the case you would have mac ultra in datacenter.
View on Reddit #73701556

Cergorach@reddit

The amount of companies actually running 3090 cards in datacenters is *extremely* limited, there are probably some, but nothing I would call 'professional' or 'Enterprise'. A H200 server with 8 cards costs half a million, has 1128GB of VRAM, uses about 7+kW, but is utilized in most cases almost 100%. Those servers are great for multi user loads. A M3 Ultra isn't great at multiuser LLM loads, it's mostly used for either lab work or single user work, and even then the load is a fraction what a dedicated H200 server does. That H200 server at idle draws more power then the M3 Ultra under load. 21x 3090 cards (+ hardware) are even more power hungry then a single H200 server. Way slower and less versatile. Not surprising for 5 year old hardware. Hence I talked about energy efficiency, not per token, but over the whole time those machines are on and used by a single user (workstation). A 3090/4090/5090 or even a 6000 Pro can be a great choice in certain scenarios where the model fits in the VRAM and produces good enough output. But in most cases, this is not the case. Thus you're generally better off with a solution that has more memory, but is slightly slower. Unless of course you have the money for H200 servers, then the equation becomes totally different. We're also not talking about datacenter servers, we're talking about home workstations specifically.
View on Reddit #73707706

The_Hardcard@reddit

Is there a workstation setup that can hold, power, and orchestrate enough 3090s for 512 GB RAM? I can see getting 6 6000 Pros in a rig for significantly more money than an M3 Ultra.
View on Reddit #73607239

BumbleSlob@reddit

Don’t discount how much power it takes for the Apple chip vs the 22 3090s it would take to get equivalent VRAM. Back of the napkin math it would take 22 3090s at 350watts a piece so 8,800 watts. Versus I think the m3 ultra maxes out around 300 itself. 
View on Reddit #73583831

cyberdork@reddit

Running 3090 at 350W is plain stupid. Needs no more than 275W with very little performance hit.
View on Reddit #73647140

CryptoCryst828282@reddit

200 isnt even that bad of a hit
View on Reddit #73804350

TokenRingAI@reddit

Yes, but with 24x the memory bandwidth and compute.
View on Reddit #73586713

Ill_Barber8709@reddit

Memory bandwidth doesn't scale like that... Single card compute is useless already for inference. Imagine 22 times more compute. 22 times more useless.
View on Reddit #73591678

CheatCodesOfLife@reddit

Was this logic generated by an LLM that fits on a 1060 3GB?
View on Reddit #73632760

MitsotakiShogun@reddit

> instead of whirring like a jet engine and heating my office ngl, summers are tough... but I haven't turned on the apartment's heating for the last 2 winters, and I'm getting refunds on the utility bill at the end of the year.
View on Reddit #73583357

Sufficient-Past-9722@reddit

I solved this with... putting that beast in the basement and running a single Ethernet cable to it. 
View on Reddit #73600945

MitsotakiShogun@reddit

No basement, just a single, small apartment that I vastly overpay for. Like a proper European 🇪🇺
View on Reddit #73602706

Sufficient-Past-9722@reddit

Haha I just moved out of Europe where I had a good basement, to Asia where I'm hoping to find a 40m2 place for 4 people. Mayyybe I'll get a balcony for the server to live on.
View on Reddit #73603392

polikles@reddit

balcony for server? Aren't you afraid of dust?
View on Reddit #73623498

Sufficient-Past-9722@reddit

Yeah it complicates things for sure. Air cooling would require intake filters and probably a custom case, and condensation could be an issue if I'm not using it 24/7.  That said the balconies here are enclosed, so the exposure is a lot lower. Currently I'm leaning towards a split setup like an air conditioner uses, with radiator (big ass Mora IV) outside and water hoses going through the wall.
View on Reddit #73625257

Medium_Ordinary_2727@reddit

Wouldn’t a GPU be a less-efficient heater for your apartment, than a heater? Your utility bills should go up if kept at the same temperature.
View on Reddit #73596674

Soggy_Audience_6706@reddit

the meme is about a REAL PC computer not overpriced Mac joke shit
View on Reddit #74388240

Novel-Mechanic3448@reddit

you dont have tensor cores. m5 does and is 3-5x faster
View on Reddit #74109563

apVoyocpt@reddit

there is quite a difference in speed: M4 (base) 120 GB/s M4 Pro 273 GB/s M4 Max 546 GB/s Ultra would be around 900GB/s and the faster the toughput the faster is inference
View on Reddit #73591634

Cergorach@reddit

The this year released M3 Ultra runs at 819.3 GB/s, the 5 year old RTX 3090 runs at 936 GB/s. This years 5090 runs at 1.8 TB/s...
View on Reddit #73594032

apVoyocpt@reddit

Yes I know, but try running the 120b OpenAI model on it. Or linking two together to get more ram. 
View on Reddit #73594918

Cergorach@reddit

Maybe, but you have an even bigger issue with clustering multiple Mac Studios together over Thunderbolt 5 (or worse, over 10Gbit networking) and trying to run a Deepseek full model on there (without quantization). It's always about the right tool for the job. And using a model that doesn't fit in the VRAM or unified memory is not using the right tool.
View on Reddit #73626999

apVoyocpt@reddit

https://www.youtube.com/watch?v=4l4UWZGxvoc
View on Reddit #73846584

mxforest@reddit

24 GB available at 936 vs 512 GB available at 819.3. The cliff after the small memory of 1 or many 3090s fills up is pretty sharp. And the models that do fit are not smart enough for professional workload other than some very basic stuff. I run 8bit GLM 4.6 at 200k context and still have 62 GB left for everything else. It's a beast.
View on Reddit #73638221

Rabo_McDongleberry@reddit

I own the basic M4 mini. And on that machine i do basic hobby stuff and peeing my niece and nephew learn AI (under admin supervision). Fort that kind of stuff it's great. But I wouldn't push it beyond that...or can't.
View on Reddit #73581448

holchansg@reddit

Yeah... M4 Mini bandwidth is 120gb/s. The only Mac that is worth are the Max and Ultra. AMD AI 395 is cheaper and have the same bandwidth as the Pro, without the con of being ARM, dedicated TPU...
View on Reddit #73582136

pastelfemby@reddit

Con of being arm? It hasnt been a con in some time unless you're a windows user.
View on Reddit #73601800

CheatCodesOfLife@reddit

Easier for the vendor to lock it down.
View on Reddit #73632837

zipzag@reddit

An Apple user is going to choose a Mac, and the Pro version at a minimum. Even the 800gb/s in my M3 Ultra isn't fast. 120gb/s for chat is rough. I expect a lot of people are disappointed. There no point in buying a shared memory machine and running an 8B because its the size that feels fast enough. Just by the video card.
View on Reddit #73585659

calcium@reddit

> I expect a lot of people are disappointed. I doubt many people are running an LLM locally? Even when I run them on my M1 Pro MBP I get around 17 tokens/s which is sufficient for my needs - it's able to generate text faster than I can read it.
View on Reddit #73598988

recoverygarde@reddit

Depends gpt oss flies as do the Qwen VL models
View on Reddit #73595251

holchansg@reddit

Yeah, not only the size, the context size, at huge context sizes is painfully slow.
View on Reddit #73586281

recoverygarde@reddit

AI 365 is slower than M4 Pro and even the base M3 is decent depending on what you’re using it for
View on Reddit #73595190

holchansg@reddit

Yeah, since M4 they bumped the Pro Bandwidth.
View on Reddit #73595294

Ill_Barber8709@reddit

> AMD AI 395 is cheaper Cheaper than what? How many VRAM? What memory bandwidth? M4 Max Mac Studio with 128GB of 546GB/s memory is $3499
View on Reddit #73583464

holchansg@reddit

Thats why i stated Mini and Pro... you only have more bandwidth with the Max and Ultra... and then a rig with RTX XX90's blow it out of the water.
View on Reddit #73583587

Ill_Barber8709@reddit

> and then a rig with RTX XX90's blow it out of the water. LOL, no.
View on Reddit #73583866

holchansg@reddit

Are you crazy? Just use the testimonials on this very own topic... RTX XX90 is a 500w+ specialized chip with way more memory bandwidth...
View on Reddit #73584100

Ill_Barber8709@reddit

> blow it out of the water. As I said, my own 32GB M2 Max runs Qwen3-30B-A3B 4Bit MLX at 65tps, while the desktop 5090 runs the same model Q4 at 135-200tps. Not exactly blown out of the water.
View on Reddit #73584817

holchansg@reddit

Thats only a slice of the whole picture, is not only about TPS, is also about TTFT, prefill, E2E...
View on Reddit #73585448

RoomyRoots@reddit

They would probably learn better if you stopped peeing on them.
View on Reddit #73581617

DerFreudster@reddit

He mean flush it beyond that.
View on Reddit #73586248

Rabo_McDongleberry@reddit

Fixed the typo. Lol
View on Reddit #73581682

SpicyWangz@reddit

Please fix your typo
View on Reddit #73581616

Rabo_McDongleberry@reddit

Lol. Fixed! 
View on Reddit #73581655

fredandlunchbox@reddit

That’s not even close to the best performing mac. 
View on Reddit #73581437

Cergorach@reddit

Even the M3 Ultra 512GB is not faster then even a consumer 3090. And even a MacBook only fits an M4 Max, which is only faster if you're building your LLM 'workstation' with RTX 5060 cards...
View on Reddit #73585398

Ill_Barber8709@reddit

LOL. First, go run 600B+ parameters models on 3090s... Which you can on a single M3 Ultra 512GB. Second, 3090 **TI** is 1000GB/s, 3090 is 900GB/s. M3 Ultra is 800GB/s but MLX is 20% faster than GGUF. Third, M4 Max is a laptop chip. Show me a laptop with 128GB of 540GB/s memory... You're just saying shit.
View on Reddit #73592014

Automatic-Arm8153@reddit

You’re missing the point entirely. Own both Mac and NVIDIA. NVIDIA wins, no competition. Mac OS a good generalist/ beginner device though no one can hold them against that. To sum it up simply Better value proposition: Mac Better performance: NVIDIA Until you get to serious AI use then. NVIDIA wins both
View on Reddit #73598631

j_osb@reddit

Even the M3 ultra doesn't hold a candle to proper accelerators.
View on Reddit #73581490

fredandlunchbox@reddit

M4 Max also considerably outperforms the pro.
View on Reddit #73582045

Cergorach@reddit

Yes, it does, but that's not the point. I'm a 'normy' with a Mac and even I say that if you think a Mac is faster then a properly build LLM workstation, you *really* suck at building LLM workstations. A 5 year old GPU (3090) is still faster then any Apple silicon in the LLM space. The advantage that an Apple brings is a LOT of memory for a reasonable price compared to VRAM with 'decent' performance and a very low power footprint (high efficiency).
View on Reddit #73585723

wdsoul96@reddit

Why take months? Is he mining iron from the ground? right?
View on Reddit #73586639

Academic-Lead-5771@reddit

lmfao what
View on Reddit #73581517

african-stud@reddit

Try processing a 16k prompt
View on Reddit #73581283

ForsookComparison@reddit (OP)

Can anyone with an M4 Max give some perspective on how long this usually takes with certain models?
View on Reddit #73581425

__JockY__@reddit

Macbook M4 Max 128GB, LM Studio, 14,000 tokens (not bytes) prompt, measuring time to first token ("ttft"): - GLM 4.5 Air 6-bit MLX: 117 seconds. - Qwen3 32b 8-bit MLX: 106 seconds. - gpt-oss-120b native MXFP4: 21 seconds. - Qwen3 30B A3B 2507 8-bit MLX: 17 seconds.
View on Reddit #73585390

bigh-aus@reddit

It'd be great if you could give some comparisons between this and your RTX rig. :)
View on Reddit #76042738

iMrParker@reddit

On the bright side, you can go fill up your coffee in between prompts
View on Reddit #73585642

__JockY__@reddit

Yeah, trying to work under those conditions would be painful. Wow. Luckily I also have a quad RTX 6000 PRO rig, which does not suffer any such slow nonsense... and it also heats my coffee for me.
View on Reddit #73586021

Beneficial-Shame-483@reddit

Quad RTX 6000 PRO ?? With that gpu power you don't even need a cpu anymore. Just run stable diffusion on machine code to run it
View on Reddit #73618429

abnormal_human@reddit

I don't know how to say this but we might be the same person.
View on Reddit #73589517

__JockY__@reddit

Paw-paw?
View on Reddit #73598374

comfyui_user_999@reddit

Jesus, you two. Save some VRAM and some watts for the rest of us.
View on Reddit #73618058

10minOfNamingMyAcc@reddit

* gpt-oss-120b native MXFP4: 21 seconds. I'm jealous, and not evena little bit. (64 GB VRAM here)
View on Reddit #73597270

__JockY__@reddit

It might just fit. Seriously. It comes quantized with MXFP4 from OpenAI and needs ~ 60GB. I dunno for sure, but it might just work with tiny contexts!
View on Reddit #73598767

vertical_computer@reddit

It’s shared memory though, you need to leave some room for the OS Between that and context… gonna be real tough (borderline impossible)
View on Reddit #73609175

Its_Powerful_Bonus@reddit

And you can run minimax m2 q3 dwq MLX, which is beast! My favorite lately. gpt-oss-120b 2nd place, since it is blazing fast.
View on Reddit #73600793

Bozhark@reddit

Shit I’ve been on 20b, is 120b worth the extra seconds? 
View on Reddit #73592139

__JockY__@reddit

I never tried the 20b, too much like a toy ;-)
View on Reddit #73598406

Bozhark@reddit

Hmmpsired
View on Reddit #73599799

Sufficient_Prune3897@reddit

2 minutes is crazy
View on Reddit #73596303

waescher@reddit

M4 Max 128GB https://preview.redd.it/85d5l0p9dk7g1.png?width=2719&format=png&auto=webp&s=300b2f325dcc6de322238bac9c81bf57d9da8b6a Sauce: [https://www.reddit.com/r/LocalLLaMA/comments/1o08igx/improved\_time\_to\_first\_token\_in\_lm\_studio/#lightbox](https://www.reddit.com/r/LocalLLaMA/comments/1o08igx/improved_time_to_first_token_in_lm_studio/#lightbox)
View on Reddit #73633593

koffieschotel@reddit

...you don't know? then why did you create this post?
View on Reddit #73581961

ForsookComparison@reddit (OP)

It's Monday and Jira is bugging me
View on Reddit #73583519

koffieschotel@reddit

lol if that itsn't a valid reason I don't know what is!
View on Reddit #73584287

SpicyWangz@reddit

I would just wait another month or two to see how M5 pro/max perform with PP
View on Reddit #73581705

ForsookComparison@reddit (OP)

I'm not in the market for any hardware right now, just curious on how things have changed.
View on Reddit #73581785

SpicyWangz@reddit

Standard M5 chips have added matmul acceleration, which significantly speeds up the prompt processing. You'd have to look for posts actually benchmarking M4 vs M5, but it was pretty impressive. Actual token generation should be sped up as well, but prompt processing will be multiple times more efficient now
View on Reddit #73582261

Ill_Barber8709@reddit

M5 is 4 times faster than M4 at prompt processing.
View on Reddit #73583225

Aggressive_Dream_294@reddit

someone had posted about this before [https://www.reddit.com/r/LocalLLaMA/comments/1mt3epi/m4\_max\_generation\_speed\_vs\_context\_size/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/LocalLLaMA/comments/1mt3epi/m4_max_generation_speed_vs_context_size/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
View on Reddit #73583042

twisted_nematic57@reddit

I do it all the time with Qwen3 32B on my i5-1334U on a single stick of 48GB DDR5-5200. Takes like an hour to start responding and another hour to craft enough response for me to do something with it but it works alright. <1 tok/s.
View on Reddit #73619222

LocoMod@reddit

gpt-oss-120b will sneeze at that and happily do it. Go higher.
View on Reddit #73604702

apifree@reddit

Same here
View on Reddit #75909770

popsumbong@reddit

Honestly the real winners are the cloud users
View on Reddit #73589241

TrumanCompote@reddit

s/users/providers
View on Reddit #75768456

JoanofArc0531@reddit

So just go with a Mac instead?
View on Reddit #74521121

ForsookComparison@reddit (OP)

Was a joke from a while ago. Thread is dead
View on Reddit #74521533

oodelay@reddit

You guys spend too much time looking at other guy's dicks to compare. My system works great and does what I ask it to. It's petty and sad.
View on Reddit #73584341

Soggy_Audience_6706@reddit

there because all girls have them now
View on Reddit #74388345

ForsookComparison@reddit (OP)

Meme
View on Reddit #73584484

Noiselexer@reddit

I rather take model thst fits in my 5090 see who's fasten then...
View on Reddit #73587716

Soggy_Audience_6706@reddit

there because all girls have them now
View on Reddit #74388321

ForsookComparison@reddit (OP)

If I had ~2TB/s VRAM I would do backflips to make my pipeline fit in 32GB.
View on Reddit #73587758

No-Refrigerator-1672@reddit

If by "perfect workstation" you mean no cpu offload, then Mac aren't anywhere near what full GPU setup can do.
View on Reddit #73580727

egomarker@reddit

And nowhere near that power consumption figures either.
View on Reddit #73581837

Super_Sierra@reddit

'my 3090 setup is much faster and only cost a little more than the 512gb macbook!' \>didn't mention that they had to rewire their house
View on Reddit #73584829

Lissanro@reddit

I did not have rewire my house but for my 4x3090 worstation I had to get 6 kW online UPS, since previous one was only 900W. And 5 kW diesel generator as a backup, but I already had it. The rig itself during text generation with K2 or DeepSeek consumes about 1.2 kW, under full load (especially during image generation on all GPUs) can be about 2 kW. But important part, that I built my rig gradually... for example, in the beginning of this year I got 1 TB of RAM for $1600, and when I was upgrading to EPYC, I already had PSUs and 4x3090, which I bought one by one. I also highly prefer Linux, and need my rig for other things besides LLMs, including Blender and 3D modeling/rendering that can take advantage of 4x3090 very well and do some tasks that benefit from large disk cache in RAM or require high amounts of memory. So, I wouldn't exchange my rig to a pair of 512 GB Macs with similar total memory, besides, my workstation in total spent costs is still less than even a single one. Of course, a lot depends on use cases, personal preferences, and local electricity costs. In my case, electricity costs are small enough to not matter much, but in some countries they are so high that using not so energy efficient hardware may not be an option.
View on Reddit #73599687

egomarker@reddit

So how fast is GLM4.6 on your rig.
View on Reddit #73704109

YouKilledApollo@reddit

No one is running GLM4.6 on 4x3090. My estimates places GLM4.7 around 300-400GB of memory needed. GLM Air would run fine though, even with just a RTX Pro 6000, then you'll get \~2872 t/s for prompt processing, and \~112 t/s for decoding.
View on Reddit #74128556

egomarker@reddit

And it was exactly my point. But you can run it on 512Gb mac.
View on Reddit #74132522

YouKilledApollo@reddit

But again, prompt processing will be so terribly slow, that it won't matter! That is my exact point. Save the money, don't buy Apple hardware based on decoding speeds when prompt processing is what slows that hardware down, so much it's not usable for day-to-day work. Instead, get real hardware that is meant for ML proper, and with support of the community at large, and run smaller models but run them multiple times, or with a harness with tools that makes it better. Don't go by online public benchmarks, make your own benchmarks, create the tooling, and smaller models won't matter. Not sure how people aren't getting this yet.
View on Reddit #74133325

egomarker@reddit

"Slow prompt processing" vs. "can't run at all", hmmmmm, let me think.
View on Reddit #74133549

YouKilledApollo@reddit

What's the point of being to run something when it'll take 5 minutes to get a very simple reply? Anyways, I'm using my hardware for paid work for clients, I can't be half-assing things or waiting hours for stuff to finish. If you want to go for Apple hardware, do it, I'm not your boss :) But if you're aiming to do serious ML development or otherwise actually need performance from what you're able to run, you won't be using Apple hardware, at least today. Maybe in the future.
View on Reddit #74134164

egomarker@reddit

"5mins" is a lie, it will run at 11-16tks and will be more than enough for chit-chat.
View on Reddit #74135345

YouKilledApollo@reddit

Again, please do share the prompt processing speed, which is exactly the part that is bog slow on Apple hardware. But every-time you ask a Apple-fanboi about the prompt processing speed, they tend to stop responding, hoping you won't disappoint me like all the others before you.
View on Reddit #74136785

egomarker@reddit

You are simply deflecting from the fact you can't run it at all. As I've said, prompt processing speed is enough for chat use.
View on Reddit #74137642

YouKilledApollo@reddit

> As I've said, prompt processing speed is enough for chat use Yeah? What is the concrete numbers? As is typical for Apple fanbois, everyone says it's "fast enough" yet no one is providing concrete numbers. Lets compare, if you have the hardware in front of you. For `GLM-4.5-Air-Q4_K_M` as a slightly related example, I'm getting ~2836 t/s for pp512 on llama-bench (`build: 7f8ef50cc (7209)`) and 110 t/s for tg128, on a RTX Pro 6000. Please do provide actual concrete numbers so we can actually compare data instead of your vibes, otherwise please kindly stop trying to propagate the myth that Apple hardware makes sense for inference with the bigger models. But of course, it's highly unlikely you'll actually come back with concrete numbers, and you probably have some excuse for it that makes sense. Which is exactly the problem, Apple fanbois keeping saying "It's fast enough!" yet not a single individual have been able to give me some concrete comparable numbers. Please be one of the reasonable people who can provide actual data!
View on Reddit #74138431

egomarker@reddit

You just can't stop deflecting from the fact you can't run something at all, right? And yeah, you know you can drop the numbers yourself, if you think "apple fanboiz are covering up". Personally I simply don't give a single fuck about pp speed if it's a "can run - can't run" situation. Enjoy your imaginary prompt processing speed numbers on your imaginary nvidia rig lol. They are high as fuck.
View on Reddit #74139078

YouKilledApollo@reddit

Haha, still no numbers? Shocking I tell you, never could have expected that.
View on Reddit #74139244

egomarker@reddit

>you know you can drop the numbers yourself Did you miss this part of my message. And don't forget to compare numbers you post with zero tks you get on your rig.
View on Reddit #74139342

YouKilledApollo@reddit

> you know you can drop the numbers yourself I don't know what that's supposed to even mean, you mean I intentionally lowered them? That wouldn't make sense, so surely you meant something else. But anyways, what's the point? You clearly don't have the hardware yourself, or are embarrassed over your purchase and refuse to actually provide any data, so yeah. Not sure what you want from me? Just stop engaging if you don't want to have a faithful conversation by providing data to backup your statements. Otherwise this is all just a waste of time for everyone involved.
View on Reddit #74139506

egomarker@reddit

And this is how far a stupid deflection can take a man. It's 140tks. As I've said, enough for chat use. Now gtfo, you are wasting my time.
View on Reddit #74139651

YouKilledApollo@reddit

> 140tks Hah, suddenly everyone understands why you have to pry out any sort of benchmarking data out of the Apple fanbois who spent $10K on a computer that does ~140tks t/s in prompt processing :| But you know what, you do you, I'm not saying everyone should get a RTX Pro 6000, just that if you're aiming to do professional ML development, you really can't be on Apple hardware, you need at the very least something CUDA compatible. But it requires you to understand software engineering to grok this, so I'd understand if it's easier to go with Apple hardware. > Now gtfo, you are wasting my time. You can stop responding whenever you feel like, not sure if you've understood how this whole "internet" thing works like or not, but just in case; you don't have to respond, no one is forcing you.
View on Reddit #74139945

egomarker@reddit

>Hah, suddenly everyone understands why you have to pry out any sort of benchmarking data out of the Apple fanbois who spent $10K on a computer that does \~140tks t/s in prompt processing :| This "prying" happens only in your head. No one is "prying" anything, I was just having fun watching you deflecting and trying to pretend there's a CONSPIRACY against you and everyone is hiding numbers that are available on a first google search. It was hillarious, made my day, thank you. So, what is your prompt processing speed for glm 4.6. Still zero? As expected? Lmao. Bye.
View on Reddit #74140256

cyberdork@reddit

4x3090 takes less power than your electric kettle.
View on Reddit #73647249

Ragerist@reddit

Must be an American thing, I'm too European to understand. Well, actually I'm an former industrial electrician. So I fully understand that most houses here as a 3x230V 20-35A supply to their houses, then often divided into 10-13A sub-groups and 16A for appliances like dryer and washer. So not really an issue. Electrical bill on the other hand is a completely different issue.
View on Reddit #73620542

Super_Sierra@reddit

To replace 512gb of system ram with vram, it is nearly 16 GPUs. You cannot run that on a normal line in a house.
View on Reddit #73648639

CryptoCryst828282@reddit

Imagine saying I can afford a M3 Ultra for no real reason other than i just want too, but cant afford to buy a NEMA 14-50 plug. If you can afford 16 gpus you can easily pay to get a 50A 240V circuit. My servers all run off 240v anyway because they are actually way more efficient.
View on Reddit #73804826

Ragerist@reddit

I normal line in the US you mean.
View on Reddit #73690935

Gudeldar@reddit

There's plenty of power going to the \*house\* in the US. Modern homes have 200A supply at 240V. However most branch circuits are only 15A at 120V.
View on Reddit #73680234

mi_throwaway3@reddit

What a stupid arse cope response. I find this response hilarious. Mac people say this like it matters. Like, who cares? Seriously. I want to get things done, don't Mac folks want to get things done? "Oh no, not if it means I'm using 40 extra watts, gee, I'd rather sit on my thumbs" Stop.
View on Reddit #73614920

YouKilledApollo@reddit

For some people electricity costs are important. Not for me personally, but I know some people already have really high bills, makes sense they try to optimize for it if that's their situation. And no, Apple hardware is excellent for some specific AI models they have specifically worked to make it run somewhat OK on their hardware, since the community isn't really focused on Apple hardware. If you want stuff that just runs, go for nvidia hardware, simple as that. And no limitations except your wallet in that case, which if you go for Apple hardware, probably already isn't a issue. Apple hardware will be dog slow and you'll regret getting it.
View on Reddit #74128633

egomarker@reddit

Imagine having a context window of 25 tokens and completely missing the fact conversation is about full gpu offload just to write another toxic comment from your throwaway account. "Extra 40 watts", lol.
View on Reddit #73622106

zipzag@reddit

True, but different tools. My Mac is always on, frequently working and holds multiple LLM in memory. 8 watts idle, 300+ watts works, never makes a sound. Big MOE models are particularly suited for shared memory machines, including MOE. I do expect I will also have a CUDA machine in the next few years. But for me, a high end mac was a good choice for learning and fun.
View on Reddit #73584800

-dysangel-@reddit

Also Deepseek 3.2 is out now, demonstrating that you can make SOTA models with close to linear prompt processing. Mac and EPYC machines with a lot of RAM are only going to become more useful over the next couple of years IMO. Especially now that you can cluster Macs effectively.
View on Reddit #73968954

Ill_Barber8709@reddit

Show me a laptop with 128GB of 546GB/s memory. Price a desktop with 128GB of 546GB/s memory compared to Mac Studio M4 Max. I won’t even talk about power efficiency. Sure, they’re not meant for training. But most of us here only use inference anyway.
View on Reddit #73581808

No-Refrigerator-1672@reddit

>Show me a laptop with 128GB of 546GB/s memory. Laptop is not a workstation. >Price a desktop with 128GB of 546GB/s memory 6x 3090 if you can get them at $500, or modded 3080 if you can't - $3000. Mobo, CPU and DDR - $100. Power supply and case - up to $500. Total: $4600. >I won’t even talk about power efficiency. If a system consumes 10x more power, but does the same task 10x faster, then it's exactly as power efficient.
View on Reddit #73582394

Ill_Barber8709@reddit

> Laptop is not a workstation. For inference? LOL > 6x 3090 if you can get them at $500, or modded 3080 if you can't - $3000. Mobo, CPU and DDR - $1000. Power supply, cpu cooler, fans and case - up to $500. Total: $4500. The M4 Max Studio 128GB cost $3,499.00 > If a system consumes 10x more power, but does the same task 10x faster, then it's exactly as power efficient. I run Qwen3-30B-A3B 4Bit MLX at 65tps on a 32GB M2 Max MacBook Pro. Best benchmarks of the desktop 5090 running this model Q4 were between 135 and 200tps. You're funny, but completely delusional. See comments here https://www.reddit.com/r/LocalLLaMA/comments/1p7wjx9/rtx_5090_qwen_30b_moe_135_toks_in_nvfp4_full/
View on Reddit #73583123

CheatCodesOfLife@reddit

> Qwen3-30B-A3B 4Bit 189t/s on my single RTX3090 just now. That's running the Q4_K_M gguf. (59t/s for a 0 context 'hi' prompt with the cpu-only build) Adding more cards won't really help for a batch of 1 since the model is so light, memory bandwidth is the constraint (58 t/s with just my CPU). Macbook is obviously the better form factor though :)
View on Reddit #73634168

Ill_Barber8709@reddit

If you follow the link I put, you’ll that’s exactly what I’m saying. OP was flexing about their results and I pointed that their numbers were nothing to flex about. TBH I’m genuinely surprised by the poor performance of this model on the 5090 considering the compute power and memory bandwidth. You should explain to him how you get much better results with a much less powerful GPU ^_^
View on Reddit #73640231

CheatCodesOfLife@reddit

Yeah I clicked the link, just wanted to let you know that GPUs aren't usually that... slow. I didn't do anything special for that ^ just ran the model in llama.cpp lol
View on Reddit #73674255

No-Refrigerator-1672@reddit

>I run Qwen3-30B-A3B 4Bit MLX at 65tps on a 32GB M2 Max MacBook Pro. Best benchmarks of the desktop 5090 running this model Q4 were between 135 and 200tps. You're funny, but completely delusional. Ah, if I had a dollar every time a person judges performance by 0-lenght prompt, I would have RTX 600 Pro by now. IRL you're now working with short context, especially not if you're paying for Max/Ultra chips; and their prompt processing is terrible. With Qwen3 30B, a very light model, 30-40k long prompt, [M3 Ultra only gets](https://www.reddit.com/r/LocalLLaMA/comments/1kvd0jr/m3_ultra_mac_studio_benchmarks_96gb_vram_60_gpu/) \~400 tok/s PP, while dular 3080 [will get](https://www.reddit.com/r/LocalLLaMA/comments/1p0bbrl/rtx_3080_20gb_a_comprehensive_review_of_chinese/) 4000 tok/s PP at the same depth. This is exactly 10x faster.
View on Reddit #73583845

Ill_Barber8709@reddit

Dude, I'm a developper. I spend my time processing big context. > prompt processing is terrible M4 and M3 generation yes. M5 architecture brings tensor cores to every GPU and prompt processing is now 4 times faster than M4. > This is exactly 10x faster. Ah, if I had a euro every time a person judges performance by one metric only, I would own a house in France by now.
View on Reddit #73584272

No-Refrigerator-1672@reddit

>M5 architecture brings tensor cores to every GPU and prompt processing is now 4 times faster than M4. Can I buy M5 with 128GB of memory? No? Come back when it becomes available, I will happily compare it to equivalently-priced Blackwell. >Ah, if I had a euro every time a person judges performance by one metric only, I would own a house in France by now. Surely, if I'm wrong, you would easily provide numbers that prove it.
View on Reddit #73584853

Ill_Barber8709@reddit

> Surely, if I'm wrong, you would easily provide numbers that prove it. You told me yourself that Nvidia were 10 times faster at **prompt processing** I've shown you that a 5090 is barely 2 to 3 times faster than an M2 Max. Hence, **one metric only**
View on Reddit #73585663

No-Refrigerator-1672@reddit

If you would look into your usage stats, you'll see that generation length is typically multiple times shorter than prompts, basically for all usecases except essays or other creative writing. This it is the metric that makes the most impact. Besides, TG also drops at long prompts, and on exactly the same links you can see it's 35 for m3 Ultra and 70 for 2x3080 at depth \~30k. The difference is immense, and I'm not even talking about 5090.
View on Reddit #73585997

__JockY__@reddit

Those might be *your* usage stats, but a lot of us do batched requests with massive prompts and very small output from the LLM. We need fast PP and don't care about TG.
View on Reddit #73586401

No-Refrigerator-1672@reddit

Dude, I literally am arguing the whole conversation that RTX has massively faster PP and it is what matter the most. You are arguing against the wrong person.
View on Reddit #73620683

egomarker@reddit

>If a system consumes 10x more power, but does the same task 10x faster, then it's exactly as power efficient. Will it really be 10x faster at concurrency 1.
View on Reddit #73582898

CheatCodesOfLife@reddit

>Will it really be 10x faster at concurrency 1. Depends what you're doing. For diffusion, VibeVoice, training then yeah, 10x faster. For a single user with sparse MoE models, maybe 3x or 4x faster.
View on Reddit #73634280

MitsotakiShogun@reddit

Depends on the model and architecture, but yes, 2^(n) 3090s with TP (and less so with other odd/even numbers), especially vLLM/sglang, can be plenty faster, even at x4 and stock drivers. Here are some numbers on a 50-60K prompt with 3 different models: https://preview.redd.it/gbim61iu6f7g1.png?width=1473&format=png&auto=webp&s=00a5b6bb35a4a832008eef27d9ae016e28af8e7b
View on Reddit #73584207

egomarker@reddit

These numbers are very exaggerated in favor of the prompt size tho. It's like "what color is the sky?" and "here's 50K personality prompt" or something. Most of the time, especially in agentic use with reasoning models, ratio is 5:1 or higher in favor of generation size. And I'm looking at generation outputs... They are around mac level, give or take.
View on Reddit #73586686

MitsotakiShogun@reddit

Fair enough. This was just an example. Obviously, nobody should take single-prompt speeds as a benchmark anyway.
View on Reddit #73587068

No-Refrigerator-1672@reddit

By the numbers that I have seen for M3 Ultra - yes, it will be more than 10x.
View on Reddit #73582977

Ill_Barber8709@reddit

That level of cope...
View on Reddit #73585021

No-Refrigerator-1672@reddit

[Learn yourself.](https://www.reddit.com/r/LocalLLaMA/comments/1pnfaqo/comment/nu7h889/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
View on Reddit #73585192

Ill_Barber8709@reddit

Sure buddy. 10 times faster at prompt processing only. You said it yourself.
View on Reddit #73585343

No-Refrigerator-1672@reddit

Yeah. And prompt processing makes the most impact, because in most usecases generation length is multiple times shorter than prompt.
View on Reddit #73585592

egomarker@reddit

No it isn't, especially with reasoning models. So what is the final speed increase if prompt size is equal to output, or prompt size is 3 times less than output.
View on Reddit #73585821

Ill_Barber8709@reddit

Your usage, not mine. I don't throw 30K tokens of context each time I make a prompt. So most of the time the prompt is already processed.
View on Reddit #73585813

egomarker@reddit

But have you seen numbers from 6x3090.
View on Reddit #73583098

WitAndWonder@reddit

DDR on its own right now would be $1000
View on Reddit #73590760

No-Refrigerator-1672@reddit

Nope. For all-gpu setup you don't need much DDR, 16GB will run fine. You only care about RAM prices when you are doing CPU offloading.
View on Reddit #73620726

WitAndWonder@reddit

That's a valid point as long as you're not running MoE models. But those would want to load the full model weights in the RAM while the experts are active in the VRAM. At least for highest efficiency in $ / performance. Though anyone just interested in running a single LLM might as well run a dense model with 6 3090s to work off of (but someone like me running multiple LLMs alongside other agents benefits from having some number of those being MoEs, as opposed to one large model.)
View on Reddit #73621602

No-Refrigerator-1672@reddit

We are in the comment thread that discusses loading models completely into GPUs. CPU offloading is irrelevant in this particular case.
View on Reddit #73622919

LocoMod@reddit

Macbook Pro's are workstation class machines that outperform almost any consumer PC hardware unless you are willing to part with a kidney.
View on Reddit #73604952

PraxisOG@reddit

The numbers I'm seeing for 32b models are \~18t/s on m4 max(546GB/s) vs 55t/s on 3090(936GB/s) with 2x tensor parallel. So about 3x faster, maybe 4x faster with 4xTP.
View on Reddit #73584645

Ill_Barber8709@reddit

Your numbers are wrong. 30B is an MoE, not a dense model. Qwen3-32B MLX 4Bit runs at 23tps on the same Mac.
View on Reddit #73585279

PraxisOG@reddit

32b is a dense model. It’s interesting our numbers are different, I used performance numbers with qwq so maybe that’s it?
View on Reddit #73586448

Ill_Barber8709@reddit

I'm using MLX version, which is known to be 20% faster than GGUF. I have 65tps on the 30B version, which is an MoE, not a dense model. 23tps on 32B Qwen3
View on Reddit #73589791

No-Refrigerator-1672@reddit

I've expanded my take with real numbers in [this reply](https://www.reddit.com/r/LocalLLaMA/comments/1pnfaqo/comment/nu7h889/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button).
View on Reddit #73584992

CheatCodesOfLife@reddit

>Price a desktop with 128GB of 546GB/s memory compared to Mac Studio M4 Max. 4xMI50's (1TB/s each)
View on Reddit #73634359

Ill_Barber8709@reddit

Price?
View on Reddit #73639794

CheatCodesOfLife@reddit

Depends on the area, like $300 each now I think? They're on Aliexpress. Don't buy 'em unless you're happy to spend a day setting them up, fucking around with rocm setup, etc. They're not plug and play like Nvidia or a Mac. They are much faster than a Mac for big dense models.
View on Reddit #73640382

mi_throwaway3@reddit

That M4 Mac with 128gb is like 5k. The 5090 is going to eat up 2500 and the memory another $1350 (yikes to the ram market). You've still got enough money to round out the rest of the system. It might be slightly more, but it will be easily 2x as fast. Nobody cares about power efficiency in a workstation.
View on Reddit #73615574

Successful_Tap_3655@reddit

The m4 Mac does better in most use case over 5090 
View on Reddit #73632836

Ill_Barber8709@reddit

> That M4 Mac with 128gb is like 5k M4 Max Mac Studio with 128GB is $3500. > You've still got enough money to round out the rest of the system. It might be slightly more, but it will be easily 2x as fast. How many RTX 5090 do you need to get 128GB of VRAM? You're saying shit. > Nobody cares about power efficiency in a workstation. Speak for yourself. > You're fighting physics and computation. There's no magic formula that gets apple free matrix operations. I don't even know what this is supposed to mean.
View on Reddit #73629254

PraxisOG@reddit

You could probably throw together 4x AMD V620(32gb@512GB/s) on an EATX x299 board for $3000 off of ebay. It won't have drivers for nearly as long, suck back way more power, and would sound like a jet engine with the blower fans on those server cards, but would train faster. Maybe I'm biased, my rig is basically that but half the price cause I got a crazy deal on the gpus :P
View on Reddit #73583420

LocoMod@reddit

Yea but fitting gpt-oss120b in a loaded Macbook is better than not running it at all in my RTX5090.
View on Reddit #73604879

No-Refrigerator-1672@reddit

A "loaded macbook" that can run 120B at all will cost you like $3000. For that money you can assemble a PC that not only will load the same 120B model completely into GPUs (4-bit quantized, of course), but will also run in multiple times faster for a single request, and orders of magnitude faster if you have agentic workload that can take advantage of parallel processing.
View on Reddit #73620897

egomarker@reddit

So he is talking about a laptop and you suggest to get a wardrobe of GPUs instead.
View on Reddit #73635249

No-Refrigerator-1672@reddit

So the whole post is about "macbook is better than a workstation". A "warderobe of GPUs" is, in fact, a workstation, and fits this particular conversation perfectly.
View on Reddit #73637027

egomarker@reddit

You are taking a meme too seriously. 
View on Reddit #73641888

LocoMod@reddit

Really? Drop the parts list. Let’s validate that.
View on Reddit #73633575

No-Refrigerator-1672@reddit

4x RTX 3080 20GB will cost $2000 including tax and shipping. $1000 for a pc case, motherboard, cpu, ram and psu is totally enough, you don't need any top-of-the-line components for that. The performance of this setup will be 10x of M3 Ultra in PP, and some 2x or more for TG for single request.
View on Reddit #73633734

LocoMod@reddit

Nah. I’ll take the MacBook Pro rather than some jank modded 20GB 3080’s.
View on Reddit #73634140

No-Refrigerator-1672@reddit

Sure, if that's your preference. Just don't go around the interned claiming that a macbook is faster, or that an equivalently-priced PC can't fit same-sized models.
View on Reddit #73634300

LocoMod@reddit

No one claimed a MacBook is faster. I use both. My RTX5090 gets used quite often for embeddings models cause that’s about the only useful thing one can run with such a small pool of memory. There’s no usable LLM for any serious work that will fit in one. My point was that it’s better to run larger more capable models than not at all. I can fit gpt-oss-120b in my M3 and it’s almost useable as a daily driver. Almost.
View on Reddit #73634497

No-Refrigerator-1672@reddit

The original meme OP posted is claiming that macs are faster, I'm referring to that.
View on Reddit #73634562

LocoMod@reddit

OP is referring to people building PC frankenbuilds and doing CPU/RAM offloading to squeeze large models into the shared pool of resources. At that point you’re not leveraging the speed of CUDA. Once you leave the GPU it’s a different ballgame. In these scenarios a high end Mac will outperform a PC with offloading.
View on Reddit #73634723

Turbulent_Pin7635@reddit

M3 Ultra owner here. The only downside I see on Mac is video generation. Being capable of get full models running on it is amazing! The speed, prompt loading times are nothing truly crazy slow. It is ok, specially when it is running with a fraction of power, NOT A SINGLE NOISE or hear issue. Also, is important to say that even without CUDA (is a major downgrade, I know) things are getting better for metal. My doubt know is if I buy a second one to get to the sweet spot of 1Tb of RAM, wait for the next Ultra or invest in a minimum machine with a single 6000 pro to generate videos + images (accept configuration suggestions to the last one).
View on Reddit #73581741

YouKilledApollo@reddit

\> NOT A SINGLE NOISE or hear issue Is this really an issue for people? I have my workstation with a RTX Pro 6000 right next to me, I just ran a benchmark with lmstudio-community/GLM-4.5-Air-GGUF (glm4moe 106B.A12B Q4\_K) and even as the GPU is hitting max temperatures of \~72°C, I can barely hear it. For configuration suggestions, you meant like hardware? It pairs nicely with Threadripper 9970X and fast I/O :)
View on Reddit #74128752

Turbulent_Pin7635@reddit

Thx. I have mental issues, cacophony can drive me crazy. I know that the 6000 is more silent than the 3090/4090/5090. Thx for Threadripper. What I mean is that I want a minimum (€€€) body to run the 6000 without spending that much money.
View on Reddit #74136288

YouKilledApollo@reddit

> What I mean is that I want a minimum (€€€) body to run the 6000 without spending that much money. To be honest, if you're just doing LLM inference, the other hardware won't matter much. You could be on PCIe 4, on AM4 with DDR4 and you won't notice much difference compared with everything on Gen 5 or even moving to sTR5 (I would know, literally just did this move some weeks ago). Just make sure the PSU is good, and you have sufficient cooling, because this beast generates hit like no other GPU I've had before.
View on Reddit #74137146

Turbulent_Pin7635@reddit

Great advices, thx my friend! I want the 6000 for videos/images. For text inference I'm using the M3 Ultra.
View on Reddit #74139603

ayu-ya@reddit

How bad is the video gen speed? Something like the 14B WAN, 720p 5s? I'm planning to buy a Mac Studio in the future mostly to run LLMs and I heard it's horrible for videos, but is it 'takes an hour' bad or 'will overheat, explode and not gen anything in the end' bad?
View on Reddit #73582665

CryptoCryst828282@reddit

Prompt times not slow? Are you kidding me i tried running GLM 4.5 Air on my M3 and it was taking 2-3 min / prompt. Sure if you are using it as a chat bot its not bad, want to use it for anything major they are useless. I have a set of 5060ti's that smoke it in every way and cost me 1/2 as much.
View on Reddit #73805177

Turbulent_Pin7635@reddit

It will take 15 minutes, to things a 4090 would takes 2-3 minutes. I never see my MacStudio emit a single noise or heat. Lol
View on Reddit #73596444

ayu-ya@reddit

Some people told me it would really take over an hour for one animation, but if they can just keep mulling it over with no issues... I can start the gen, walk the dogs, come back and it's done haha I used to run dense models with heavy CPU offloading when getting into locals, so time doesn't scare me as long as the hardware doesn't suffer too much 🥹
View on Reddit #73596874

Turbulent_Pin7635@reddit

I hate apple smartphones. This is my first apple. But, I need to say. The thing is a tank. I would say military grad quality. It is build to last. In my lab that is one 20 years old, still working, another one 11 years old, it looks like just popped out of the box. Mine is always under 480 Gb+ RAM or 100% CPU use (bioinformatics) and it barely sweat. Don't evaluate apple PCs by apple disposable gimmicks, they aren't the same. Bonus note, you have full control over it, not a single headache over drivers.
View on Reddit #73597692

wittlewayne@reddit

I sold 3 of my old MBP's to buy 1 MPB with an M1 chip..... I feel your pain also
View on Reddit #74097093

Whispering-Depths@reddit

suck my rtx pro 6k 96gb and 192gb ram lol tell me a fucking apple product is better off
View on Reddit #73615670

waescher@reddit

OK, run Qwen3 Coder 480b Q8
View on Reddit #73635130

Whispering-Depths@reddit

lemme know when you can do that on a mac.
View on Reddit #73671637

waescher@reddit

nearly a year since Apple released the M3 Ultra
View on Reddit #73862080

Whispering-Depths@reddit

TIL you can buy a home mac with 512GB of basically vram(?) - it's half the VRAM speed of an rtx pro 6k but the fuck lol that's still insane? for $10k to $16k? Not bad at all. I wonder if you can get one of those and run linux on it.
View on Reddit #73883518

waescher@reddit

Might not be the perfect LLM device but a memory rich one. And one that idles at under 10 and maxes out under 270 Watts while staying silent. There are those freaks (in the best possible way) at Asahi Linux that reverse engineered the Mac drivers and rebuilt them for Linux. I actually run a MacBook M1 Max on Asahi Fedora and it runs great. Unfortunately they only cover the M1 and M2 family yet. But then I guess you’ll loose MLX support which is a boost in model performance on Mac.
View on Reddit #73915707

Whispering-Depths@reddit

my pro-6k can go at 450 watts and it's still silent :D I wonder if mac has stuff that automatically quantizes the model during runtime? That would suck
View on Reddit #73939408

waescher@reddit

I know, these are amazing. Would absolutely love to stack some.
View on Reddit #74003908

waescher@reddit

Oh, two things here: Apple added support for stacking multiple Mac’s with „RDMA over Thunderbold“ lately so you could multiply these 512GB. https://www.jeffgeerling.com/blog/2025/15-tb-vram-on-mac-studio-rdma-over-thunderbolt-5 And the next chip generation M5 is expected to bring extra neural accelerators within the GPUs https://9to5mac.com/2025/11/20/apple-shows-how-much-faster-the-m5-runs-local-llms-compared-to-the-m4/
View on Reddit #73915929

ForsookComparison@reddit (OP)

Easy tiger
View on Reddit #73616226

FullOf_Bad_Ideas@reddit

I have 7t/s TG and 140 t/s at 60k ctx with Devstral 2 123B 2.5bpw exl3 (it seems like quality is reasonable thanks to EXL3 quantization but I am not 100% sure yet). Can Mac do that?
View on Reddit #73589100

-dysangel-@reddit

iirc my M3 Ultra was getting 6-7tps at q4, but the output was garbage compared to GLM, so I just deleted it.
View on Reddit #73969662

FullOf_Bad_Ideas@reddit

Nice, this 6-7 tps is once you loaded it up with context already, right?
View on Reddit #73989444

-dysangel-@reddit

no probably not, I deleted it very quickly
View on Reddit #73989779

Gringe8@reddit

It really depends on what you're trying to do. MacBooks work ok on MOE models, but dense models not so much. My 5090+4080 pc is much faster with 70b models than what you can do with macs.
View on Reddit #73584467

-dysangel-@reddit

What 70b model is worth using? I've never found one that was good at coding.
View on Reddit #73968753

Gringe8@reddit

Thats why it depends on what youre trying to do. I use it for roleplaying and haven't found any of the moe models better than the 70b model i use. https://huggingface.co/Steelskull/L3.3-Shakudo-70b Id imagine the newer moe models are better for coding.
View on Reddit #73969975

Successful_Tap_3655@reddit

Except you can’t do high context but sure 
View on Reddit #73629269

Longjumping-Boot1886@reddit

and for that we have M5.
View on Reddit #73599671

getmevodka@reddit

Yes, i can run a qwen3 235b moe in q6_xl and its really nice for the expense i made. For comfy with qwen image it still performs but my old 3090 runs laps around it while already being downvolted to 245watts xD
View on Reddit #73585933

Artistic_Unit_5570@reddit

I have a MacBook Pro M4 Pro; the MLX models are ridiculously optimized, using the GPU at absolute 100%.
View on Reddit #73585567

getmevodka@reddit

I have a m3 ultra 🤣🤣🤣 its nice... Lets stay at that. Hehehehehehe
View on Reddit #73585653

Pale_Reputation_511@reddit

I have M4 Max and yeah, MLX its the way
View on Reddit #73586078

getmevodka@reddit

I hope they offer a m5 max with 256gb so that i can travel with my fav local model. Would be a dream come true
View on Reddit #73586215

-dysangel-@reddit

yeah the M5 era is going to get juicy. Especially with linear attention models
View on Reddit #73969876

__no_author__@reddit

Just remember how much more money they paid than you.
View on Reddit #73909583

shokuninstudio@reddit

You just need to download RAM Doubler. Install two copies of it and your RAM will quadruple. https://preview.redd.it/86like0z1f7g1.jpeg?width=200&format=pjpg&auto=webp&s=7592a47f02d8b2a025e37f1cad502be8604245d4
View on Reddit #73582129

aaronsb@reddit

Ran out of disk space installing more than four copies of ram doubler. Can I use Disk Doubler? https://preview.redd.it/3mra854h4f7g1.png?width=1136&format=png&auto=webp&s=397e3889c6435e80fcf8de301ea7013f6f1821a1
View on Reddit #73583129

TokenRingAI@reddit

Hey, joke all you want, but Stacker was legit, I would have never survived the 90s without stacker and the plethora of Adaptec controllers and bad sector disk drives I pulled out of the dumpsters of silicon valley.
View on Reddit #73586545

Real-Technician831@reddit

Stacker actually made your computer run faster as HDD was so slow that reading the compressed file an unpacking it was way faster than reading same file from raw disk.
View on Reddit #73863912

Korenchkin12@reddit

Same here,doublespace and qemm if i remember the name correctly...today's noobs would never understand,why you need those drivers in high memory
View on Reddit #73683623

Pishnagambo@reddit

Yeah stacker was great 😃 
View on Reddit #73597074

fuzzy-thoughts345@reddit

It was great. I had a 20MB Seagate with mostly text, so it really compressed well.
View on Reddit #73614610

_bones__@reddit

Are you me? Of course, any time you got a compressed file it took up twice the size.
View on Reddit #73617477

shokuninstudio@reddit

The doubleception has arrived before GTA VI.
View on Reddit #73584661

ThomasterXXL@reddit

If you want more RAM, all you need is EWE.
View on Reddit #73727854

mikael110@reddit

Fun fact unlike the whole "Download more RAM" meme, Ram Doubler software was a real thing back in those days, and they did actually increase how stuff you could fit in RAM. The way they worked was by dynamically compressing and decompressed data as it came in and out of RAM. Nowadays RAM compression is built into basically all modern operating systems so it would no longer do anything, but back then it made a real difference.
View on Reddit #73586488

TokenRingAI@reddit

Some people reminisce about Woodstock, I reminisce about waiting in line at Fry's electronics to get Windows 95 at 12:01AM The kids will never understand.
View on Reddit #73587151

joelasmussen@reddit

That's rad. I wonder if we'll ever get a modern equivalent. Lowly consumers getting something exciting.
View on Reddit #73680953

jaysedai@reddit

I remember, I stood outside with a sign with a Mac icon and the words "Windows 95, welcome to 1984".
View on Reddit #73662654

TokenRingAI@reddit

Which Fry's?
View on Reddit #73674725

tehfrod@reddit

When I got engaged we were trying to set a date and August 24th came up. I said, "Perfect! I'll never forget our anniversary. It's the day Windows 95 was released!" We're divorced now.
View on Reddit #73610844

shokuninstudio@reddit

I was waiting in line at midnight for the release of Mac OS X. Club goers passed by and asked what we were doing. When we explained why they laughed at us like we were all dorks.
View on Reddit #73590896

mehum@reddit

Sorry to break the news mate, but they weren’t wrong!
View on Reddit #73595628

shokuninstudio@reddit

"No I'm here waiting for the choppa" https://i.redd.it/rjzk88xl1g7g1.gif
View on Reddit #73596480

Alternative-Sea-1095@reddit

It really didn't do any ram compression, windows 95 did that. Yes, windows 95 did ram compression and those "ram doubler" just used placebo and doubling the page size by 2x. That's it...
View on Reddit #73602365

mikael110@reddit

The original [Ram Doubler](https://apple.fandom.com/wiki/RAM_Doubler) wasn't for Windows 95 though, it was for classic Mac OS and Windows 3.1. Neither of which had RAM compression built in. You might be confusing Ram Dobuler for [SoftRAM](https://en.wikipedia.org/wiki/SoftRAM), which was indeed just a scam. That was developed by an entirely different company though.
View on Reddit #73603229

pixel_of_moral_decay@reddit

Yup. Ram Doubler was the real deal. Came at the cost of a little cpu, but that was a point in time most systems were more memory bound than cpu bound. 4-16mb of memory but 66-200mhz CPU. Taking a couple percent to add memory was a huge win, compared to virtual memory on slow 5200 rpm ide hard drives.
View on Reddit #73618724

Coldaine@reddit

Oh yeah, having 4mb of ram was a ridiculous constraint bro. You don't understand, I NEEDED to be able to run SCURK. https://www.mobygames.com/game/2422/simcity-2000-urban-renewal-kit/
View on Reddit #73642324

Alternative-Sea-1095@reddit

I was! Thank you.
View on Reddit #73603313

SilentLennie@reddit

Linux has ZRAM which provides a compressed RAM disk and you can put swap on it, thus compression RAM.
View on Reddit #73631948

FurrySkeleton@reddit

Disk doubler, too! It really did work.
View on Reddit #73602823

audioen@reddit

Yeah. Doing both disk and ram compression today, routinely. Even on servers.
View on Reddit #73592209

pscoutou@reddit

It's 2025, we have the cloud now. https://downloadmoreram.com
View on Reddit #73630103

YoloSwag4Jesus420fgt@reddit

Didn't this actually work by compressing the ram or something? I know it wasn't 2x but it was better if I recall than nothing? I swear I saw a yt vid on this once
View on Reddit #73610445

shokuninstudio@reddit

It slowed the computer down anyway because the CPU ran at less than 50Mhz and only had one core. It had to do on the fly compression and decompression while running your apps and OS.
View on Reddit #73618325

astrange@reddit

Memory compression is a lot more than 2x effective but mostly because memory is mostly zeroes.
View on Reddit #73614428

The_frozen_one@reddit

For real though I can run the deepest of seeks [since I downloaded more RAM to my CPU.](https://downloadmoreram.com/)
View on Reddit #73616821

grimjim@reddit

Unironically, we do have a parallel for LLMs; the bitsandbytes library can perform 4-bit quantization while loading a model.
View on Reddit #73613736

AuspiciousApple@reddit

I already have it. Just send me your RAM and I'll send double back
View on Reddit #73604341

shokuninstudio@reddit

That sounds like a free buttcoin scam.
View on Reddit #73606065

Trick-Force11@reddit

can i install 50 copies for 1125899906842624 times more ram or is there a limit
View on Reddit #73585945

shokuninstudio@reddit

The AI powered activation server will detect your pirate serial numbers and ask Copilot to delete all your files and System32 directory.
View on Reddit #73586224

Trick-Force11@reddit

worth a shot
View on Reddit #73586434

Background_Essay6429@reddit

What Mac configuration gets you the best tokens/sec?
View on Reddit #73837583

Little-Put6364@reddit

I'm always more concerned about quality over speed. Sure speed is nice, but throwing more compute at the model won't make it magically better at answering
View on Reddit #73836711

Southern_Sun_2106@reddit

The future of local home AI is a small box on the table.
View on Reddit #73621609

CryptoCryst828282@reddit

Future of home ai will be cloud, we will resist but it will be cloud. The hardware will price us all out eventually.
View on Reddit #73805387

Southern_Sun_2106@reddit

I specifically said, the future of \*local\* home AI... will be a small box on the table. Sure, there will be lots of things provided from the cloud. However... history has showed that people still prefer to actually 'own' their stuff. There was a time in the 70's where people thought personal computing will be done on terminals connected to the huge machines of the time stationed elsewhere, and look what happened (spoiler, a personal computer in all of ours pockets plus portable laptops, etc). We have a natural drive towards distributed processing, just like we don't have a hive mind, etc. It's who we are. We go the cloud way only if that's the only option, but when other options are available - we go for distributed processing.
View on Reddit #73808839

CryptoCryst828282@reddit

I know you did... I am saying the future of ALL ai will be cloud. History has shown that it really doesnt matter what you want, cost dictates everything. At some point home GPUs will not run the ai models and if you think you will be able to afford a datacenter tpu or whatever they run on in 10 years i have a bridge to sell you. In the 70s not EVERYTHING was going to a service, now it all is. You will own nothing, it will all be a license, and sadly, not a single person will do
View on Reddit #73820047

Southern_Sun_2106@reddit

History also showed that we are not running excel and word on supercomputers via terminals. We opt for not the most powerful, but more convenient solutions. Anyway, this discussion is pointless, because I won't change your mind, and you won't change mine; and none of our theories are provable at the moment, just speculation. But thank you for engaging, anyway.
View on Reddit #73836294

Ytijhdoz54@reddit

The mac mini’s are a hell of a value starting out but the lack of Cuda at least for me makes it useless for anything serious.
View on Reddit #73580936

iMrParker@reddit

I'm willing to bet most people on this sub haven't ventured past inference so posts like this are r/iamverysmart
View on Reddit #73581583

egomarker@reddit

You can't train anything serious without a wardrobe of gpus anyway. Might as well just rent.
View on Reddit #73582125

RedParaglider@reddit

EXACTLY.. that's why bang for the buck a 128gb strix halo was my goto even though I could have afforded a spark or whatever. I'm just going to use this for inference, local testing, and enrichment processes. If I get really serious about training or whatever renting for a short span is a much better option.
View on Reddit #73582864

Fi3nd7@reddit

100%. People who think buying their own hardware while these companies are literally burning money is insane. Renting is so insanely subsidized right now. It's not worth buying. One could argue that now is the time to buy before hardware gets insanely expensive and everyone pulls out of consumer GPUs. But honestly, if I was really really serious about running intense stuff local. I'd probably drop 10-15k on a real AI machine.
View on Reddit #73726628

RedParaglider@reddit

Valid. Honestly built this project, so I had a good reason for a local inference box. For anything other than enrichment type stuff I use big models. [https://github.com/vmlinuzx/llmc/](https://github.com/vmlinuzx/llmc/) Local stuff is a shit ton of fun to me to do learning on, and also build systems which require engineering imagination to work perfectly under constraints.
View on Reddit #73728263

SamWest98@reddit

Yeah. People over here comparing $10k macs to $300k nvda rigs bc they heard about cuda on twitter
View on Reddit #73620517

FullOf_Bad_Ideas@reddit

I got my finetune featured in a LLM safety paper from Stanford/Berkeley, it was trained on single local 3090 Ti and it was actually in the top 3 for open weight models in their eval - I think my dataset was simply well fit for their benchmark. >However, on larger base models the best fine-tuning methods are able to improve rule-following, such as Qwen1.5 72B Chat, Yi-34B-200K-AEZAKMI-v2 *(that's my finetune)*, and Tulu-2 70B (fine-tuned from Llama-2 70B), among others as shown in Appendix B. https://arxiv.org/abs/2311.04235
View on Reddit #73588345

iMrParker@reddit

If you're doing base model training then yes. But if you're fine tuning 7b, 12b models you can get away with most consumer Nvidia GPUs. The same fine tuning probably takes 5 or 10 times longer with MLX-lm
View on Reddit #73583064

LocoMod@reddit

Congratulations to the 3 people on here training models from scratch that no one will ever use. For everyone else, MLX can do everything, including fine tuning.
View on Reddit #73605298

iMrParker@reddit

Who said people are training models for mass users? Most model training and fine-tuning is done for personal, college, or internal enterprise reasons. MLX-LM can do \*some\* of the things that CUDA-accelerated libraries like Unsloth/PEFT/tortchtune/Tensorflow can do, but WAY slower. It's disingenuous for you to pretend that no one does this and that MLX is just as capable or performant
View on Reddit #73606579

BumbleSlob@reddit

lol why does everyone have to participate in fine tuning or training exactly? What a dumb ass gatekeeping hot take.  This would be like a carpentry sub trying to pretend that only REAL carpenters build their own saws and tools from scratch. In other words, you sound like an idiot. 
View on Reddit #73583977

iMrParker@reddit

Point to me where I made any gatekeeping statements.  My point is that people like OP don't consider the full range of this industry / hobby when they make blanket statements about which hardware is best
View on Reddit #73584130

Monkey_1505@reddit

There is SO much you cannot do without CUDA.
View on Reddit #73584005

LocoMod@reddit

Do tell. What can't you do without CUDA? We can run infence, fine tuning, diffusion models, tts, stt, embeddings, etc without CUDA. I suppose for the 0.000000001% of the world that is training models from scratch then CUDA matters.
View on Reddit #73605164

Monkey_1505@reddit

The full version of automatic1111 with all the extensions, a whole host of txt to image software clients, many different LLM clients, txt to image merging, LLM lora training...of the top of my head. I'm sure the list is like a page long. It's a lot of stuff.
View on Reddit #73606500

LocoMod@reddit

That's an implementation detail of automatic1111 and nothing to do with what CUDA is or isnt capable of vs other platforms. I can use the same image gen models youre using on Apple hardware. Sure the generation times are slower, but it can still be done without CUDA. You're still using automatic1111? The platform dinosaurs were using back when SD1 was popular?
View on Reddit #73610214

Monkey_1505@reddit

The entire advantage of using CUDA is that it makes it easier for developers, hence why yes, it's used in a lot of software. So, what software do you use to run z-image on mac?
View on Reddit #73629282

_VirtualCosmos_@reddit

And not just Cuda. The blackwell hardware is very needed for training full FP8 at least for now. But I have put hopes in ROCm, it's open source and promising.
View on Reddit #73605659

ai-christianson@reddit

MLX?
View on Reddit #73581532

Bogaigh@reddit

Right…“my loud, hot, expensive Linux box is faster in benchmarks and images, therefore your quiet unified-memory machine that lets you think deeply without friction is bad.”
View on Reddit #73692867

ForsookComparison@reddit (OP)

Reread meme for better results
View on Reddit #73710441

Specific-Goose4285@reddit

I wouldn't call myself a normie. Thing is even before the RAM shortage 128GB vRAM is crazy expensive and is attached to power hungry devices. The unified memory has the advantage that it fits and the speeds are just about good enough for certain tasks.
View on Reddit #73701552

Novel-Mechanic3448@reddit

M5 has tensor cores, its only a matter of time
View on Reddit #73688146

vdiallonort@reddit

I would be really happy to be a "normies" if i had the money to be ;-) I have a mac book pro m3 with 24 go from work,you need to spend way more than that (which was already expensive to my taste) and speed is disappointing. In my dream there is cheap m5 ultra,in my dream....
View on Reddit #73686721

egomarker@reddit

https://preview.redd.it/6ay76woq2f7g1.jpeg?width=1102&format=pjpg&auto=webp&s=bb85be5ff527e800efb201de0a94e997c86ce4f6 Hop in kids
View on Reddit #73582416

msc1@reddit

I'd get in that van!
View on Reddit #73599110

FaceDeer@reddit

I think that van would backfire on kidnappers, they'd find themselves instantly surrounded by a mob of ravenous savages tearing the van apart to get at the RAM in there.
View on Reddit #73602281

Latter_Virus7510@reddit

Absolutely! 😅
View on Reddit #73661493

ashirviskas@reddit

With screwdrivers!
View on Reddit #73626065

CasualtyOfCausality@reddit

Yeah, this is akin to laying down in an anthill and ant-whispering that you are actually covered in delicious honey.
View on Reddit #73612735

Important-Novel1546@reddit

oh, without a second doubt
View on Reddit #73608985

Chrono978@reddit

There is always a chance they’re saying the truth and with those prices, it’s a worthy risk.
View on Reddit #73632112

ThisWillPass@reddit

Your not my daddy…
View on Reddit #73612363

Kirigaya_Mitsuru@reddit

I cant affort PC to play last games that come out even EU5 is laggy sometimes let alone an local LLM. \*sigh\* Im pro AI but more like i think the future is Local and not Cloud, so i hop into van too!
View on Reddit #73600993

nachoaverageplayer@reddit

I upgraded my M1 pro with 16GB ram to an M4 max with 48GB for this very reason. It’s just so performant at anything I throw at it and portable that it’s worth the apple tax imo.
View on Reddit #73658021

clduab11@reddit

People finish assembling their perfect workstation? 😬
View on Reddit #73644506

juggarjew@reddit

Normies with macs dont have lots of RAM, they have an 8-32GB MAC lol
View on Reddit #73641676

qwen_next_gguf_when@reddit

Currently , you just need a few 3090s and as much RAM as possible.
View on Reddit #73581342

10minOfNamingMyAcc@reddit

I assume you're talking about DDR5? I'm struggling with 3600MHz DDR4... (64 GB VRAM but still, can barely run a 70B model at Q4\_K\_M gguf at decent speed for fast inference \~30-60 seconds below 16k even... Is koboldcpp botclenecking me?) I should've upgraded earlier, but went for a new monitor...
View on Reddit #73597569

Lissanro@reddit

70B dense model is quite slow if cannot fully fit in VRAM... For example, Kimi K2 is 1T model but has just 32B active parameters so in case of CPU-only inference it will be faster than 70B model. And based on 3600 MHz speed, you likely have dual channel RAM, it is almost four times slower than 8-channel DDR4 3200MHz RAM. In any case, to efficiently run model its context cache needs to be entirely in VRAM. Then prompt processing will be done on GPUs and text generation speed will be much faster too.
View on Reddit #73635189

Lissanro@reddit

Well, I seem to satisfy the requirements. I have four 3090, they are sufficient to hold 160K context at Q8 with four full layers of Kimi K2 IQ4 quant (or alternatively could hold 256K context without full layers), and 1 TB of RAM. Seems to be sufficient for now. Good thing I purchased it at the beginning of this year while prices where good... otherwise at current RAM prices upgrading would be tough.
View on Reddit #73634851

Wrong-Historian@reddit

> a few 3090s Okay, cool > and as much RAM as possible. Whaaaaaaaaaaa
View on Reddit #73581431

_VirtualCosmos_@reddit

Nice, that would be +1300 euros per used 3090 and +1000 euros per 64 GM ram lol
View on Reddit #73605897

ShameDecent@reddit

A few 3090s won't be cool at all, they would substitute for a room heater.
View on Reddit #73596307

RedParaglider@reddit

It's not enough to be able to drive a phat ass girl around town and show her off, you gotta be able to lift her into the truck. AKA ram :D.
View on Reddit #73583154

LocoMod@reddit

Who needs two kidneys amirite?
View on Reddit #73605420

jwr@reddit

Can relate, I am the normie. I own a M4 Max (64GB) laptop and I kept wondering why people have to go to such lengths and expenses to run those 30B models, until I realized the reasons.
View on Reddit #73633902

Tenkinn@reddit

Not sure about the significantly better speeds, I guess for the same price you have a faster setup with nvidias gpu but for sure it's WAY EASIER to buy, install, setup, cost less to run, doesn't consume a billion watt, nor replace your heater, makes way less noise and takes less way space
View on Reddit #73633419

PMvE_NL@reddit

What? You can do research in a day and assemble it in one day. I would say... Skill issue
View on Reddit #73631600

One_of_Won@reddit

This is so mis leading. My dual 3090 setup blows my Mac mini out the water
View on Reddit #73582188

1Soundwave3@reddit

Okay, it's good that you have both because I have some questions. How much vram do you get out of your dual 3090 setup? Also, do you really need that? Because from what I've seen gpt oss 20b is the first model that I can call decent and I can run it on my gaming PC no problem. And it's a MOE one. So I'm just thinking: MOE sounds like the biggest bang for the buck. Mac mini sounds like the biggest bang for the buck as well. If you combine them and hope that there will be better MOE models - it seems like a good choice for a small local setup, that does pretty much anything you need locally if you can't use a cloud model for some reason.
View on Reddit #73628039

BusRevolutionary9893@reddit

It makes no sense. If it said something about being able to run larger models and left out normies, that might work. Normies don't have 512 GB of unified memory. 
View on Reddit #73613872

20ol@reddit

In price or energy use?
View on Reddit #73586829

PerfectReflection155@reddit

People are affording the latest Mac books? On credit card right?
View on Reddit #73628027

txdv@reddit

Normies are not dropping 10k on a mac with 512gb of ram
View on Reddit #73627761

RabbitEater2@reddit

The only thing worse than slow generation is slow prompt processing. And at least windows can run way more AI/ML stuff if you're into that. Can't say I'm jealous tbh.
View on Reddit #73624890

ai-christianson@reddit

This is the main reason I got a MBP 128GB... well, that & mobile video editing. I say this as a long-time Linux user. I still miss Linux as a daily driver, but can't argue with the local model capability of this laptop.
View on Reddit #73581516

TechnoByte_@reddit

Why not use Asahi Linux?
View on Reddit #73624331

noiserr@reddit

> I still miss Linux as a daily driver Strix Halo was an option.
View on Reddit #73583422

AmpEater@reddit

Same! 
View on Reddit #73581760

riceinmybelly@reddit

A second hand Mac Studio M2 96GB is super affordable and is hard to beat. The pricier beelink GTR9 Pro 128 GB is left in the dust
View on Reddit #73622426

ElephantWithBlueEyes@reddit

i gave up on local LLMs
View on Reddit #73619411

Ok-Future4532@reddit

This can't be serious right? This can't be true. Is it because of the bottlenecks related to using multiple GPUs? Is there something else I'm missing? GDDR6/7 VRAM is so much higher speed than unified memory. , how can macbooks be faster than custom multiGPU setups?
View on Reddit #73619316

VegetableSense@reddit

https://i.redd.it/znvx4acbci7g1.gif
View on Reddit #73618533

apetersson@reddit

i have yet to decide between a \~10k mac ultra (m5/m3/m1) ? and a custom build. my impression is that "small" models could be a bit faster on a custom build but any "larger" model will quickly fall behind because a 10k GPU based build just won't be able to hold it proerly. educate me.
View on Reddit #73581322

StaysAwakeAllWeek@reddit

If you're looking at 10K you're close to affording a RTX Pro 6000, which will demolish any Mac by about 10x for any model that fits into 96GB VRAM But if you overflow that 96GB it can fall down as far as 1/4 as fast, limited by the PCIe bandwidth If you're into gaming the pro 6000 is also the fastest gaming gpu on earth, so there's that
View on Reddit #73582095

apetersson@reddit

thanks for the input - ok, so why should it be "10x" faster for smaller models? i'm thinking RTX Pro 6000 1.6TiB/sec mem bandwidth vs 0.8 TiB/sec on a Mac Studio Ultra should be about 2x. what am i missing?
View on Reddit #73582975

mi_throwaway3@reddit

it's not just about memory bandwidth, loading shit into memory is just one part of the equation
View on Reddit #73616057

StaysAwakeAllWeek@reddit

Mac Studio does not support data types smaller than 16 bit. By running 8 bit quants you can double your effective bandwidth and memory capacity on nvidia cards, and if you're OK with losing some output quality a 4 bit quant increases it another 2x
View on Reddit #73586324

RandomCSThrowaway01@reddit

It depends on what you consider to be a larger model. Because yes, 9.5k Mac Ultra M3 has 512GB shared memory and nothing comes close to it at this price point. It's arguably the cheapest way to actually load stuff like Qwen3 480B, Deepseek and the likes. But the problem is that the larger the model and the more context you put in the slower it goes. M3 Ultra has 800GB/s bandwidth which is decent but you are also loading a giant model. So, for instance, I probably wouldn't use it for live coding assistance. On the other hand at 10k budget there's 72GB RTX 5500 or you are around a 1000 off from a complete PC with 96GB RTX Pro 6000. The latter is 1.8TB/s and also processes tokens much faster. It won't fit largest models but it will let you use 80-120B models with a large context window at a very good speed. So it depends on your use case. If it's more of a "make a question and wait for the response" then Mac Studio makes a lot of sense as it does let you load the best model. But if you want live interactions (eg. code assistance, autocomplete etc) then I would prefer to go for a GeForce and a smaller model but at higher speed. Imho if you really want a Mac Studio with this kind of hardware I would wait until M5 Ultra is out too. Because it should have like 1.2-1.3TB/s memory bandwidth (based by the fact base M5 beats base M4 by like 30% and Max/Ultra is just a scaled up version) and at that point you just might have both capacity and speeds to take advantage of it.
View on Reddit #73582943

StaysAwakeAllWeek@reddit

>It's arguably the cheapest way to actually load stuff like Qwen3 480B, Deepseek and the likes. It's the cheapest *reasonable* way to do it. The actual cheapest way to do it is to pick up a used Xeon Scalable server (eg Dell R740) and stick 768GB of DDR4 in it. You get 6 memory channels for ~130GB/s bandwidth per cpu, and up to 4 CPUs per node, for an all out cost of barely $2000 (most of that being for the RAM, the cpus are less than $50). You can even put GPUs in them to run small high speed subagent models in parallel, or upgrade to as much as 6TB of RAM. The lrimary downside is it will sound like 10 vacuum cleaners having an argument with 6 hairdryers. They are super cheap right now because they are right around the age where the hyperscalers liquidate them to upgrade. Pretty soon they will probably start rising again if the AI frenzy keeps going
View on Reddit #73587267

__JockY__@reddit

> any "larger" model will quickly fall behind because a 10k GPU based build just won't be able to hold it proerly [sic] Based on this sentence alone I recommend *not* trying to understand screwdrivers and instead just buy the nice shiny Apple box. Plug in. Go brrr.
View on Reddit #73585913

holchansg@reddit

RTX XX90, is not even close.
View on Reddit #73582706

_hypochonder_@reddit

My 4x AMD MI50s 32GB works fine for me for llm inference stuff. How much money cost a Apple product with 128GB usable VRAM again?
View on Reddit #73588154

mi_throwaway3@reddit

it's literally 5k
View on Reddit #73615977

TokenRingAI@reddit

It's worse than that, the new iPhone has roughly the same memory bandwidth as a top-end Ryzen desktop.
View on Reddit #73584700

mi_throwaway3@reddit

you'd think apple was in here astroturfing that memory bandwidth and power consumption were the two leading concerns with LLM usage
View on Reddit #73615949

Calamero@reddit

What are they doing with all that power though? Siri can’t be it. Probably just listening and giving out social scores…
View on Reddit #73599960

ForsookComparison@reddit (OP)

Server racks would look much neater if they were just iPhone slabs and type-C cables
View on Reddit #73584820

TokenRingAI@reddit

One day OpenAI will do a public tour of their datacenter and we'll realize it's been super-intelligent monkeys doing math problems on iPhones all along
View on Reddit #73585613

ImJacksLackOfBeetus@reddit

Only thing I learned from this thread is that nobody knows what they're talking about according to somebody else, and that the old Mac vs. PC (or in this case, GPU) wars are still very much alive and kicking. lol
View on Reddit #73603387

ForsookComparison@reddit (OP)

Let us have some fun
View on Reddit #73605976

ImJacksLackOfBeetus@reddit

Don't mind me, I'm sitting here popcorn at the ready. Have at it! lol
View on Reddit #73612710

crazymonezyy@reddit

It's similar to doing a month of research to find the best android camera only for people around you to prefer their iphones for photos because they're more Instagram friendly.
View on Reddit #73608485

wh33t@reddit

Dollar for dollar + token for token ... nah Plus ... how do you upgrade a mac?
View on Reddit #73606721

a_beautiful_rhind@reddit

Just wait till you find out what you can get in 2-3 years. Their macbook is gonna look like shit, womp womp. Such is life, hardware advances.
View on Reddit #73605507

El_Danger_Badger@reddit

Honestly, I don't see the issue with running local on Mac at all. The machines happen to almost purpose built to run inference. Everyone started at zero, two years ago with this stuff and really, AInis the only true expert at AI. Have the biggest rig on the block, or a Camry running locally on a Mini, the end result is local first, local only. Privacy, sovereignty, some form of digital dignity, and some semblance of control in an disturbingly surveiled world. Five years from now, they will just sell boxes to deal with it all on our behalf. But however you slice it, hosting your own isn't easy and isn't cheap. So if anyone can make it work, more power to them. To quote the immortal words of, well, both east and west coast rappers, "we're all in the same gang".
View on Reddit #73603316

CheatCodesOfLife@reddit

Yeah if you're only doing sparse MoEs with a single user, get a mac.
View on Reddit #73601754

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*
View on Reddit #73601424

aeroumbria@reddit

Really depends on your use case. Macs still cannot do PyTotch development or ComfyUI well enough. And if you wanna do some gaming on the side, it is the golden age for dual GPU builds right now.
View on Reddit #73598860

jeffwadsworth@reddit

Money talks baby
View on Reddit #73598017

H0vis@reddit

Normies don't have the latest Macbook in this economy.
View on Reddit #73596437

Rockclimber88@reddit

It's because of NVIDIA's gatekeeping of VRAM and charging obscene amounts for relevant GPUs like RTX 6000 PRO with barely 96GB
View on Reddit #73593963

the-mehsigher@reddit

So it makes sense now why there are so many cool new “Free” open source models.
View on Reddit #73593340

esamueb32@reddit

Not for stable diffusion lol Macbook are so much slower
View on Reddit #73592736

holchansg@reddit

https://preview.redd.it/mpgrwbod2f7g1.png?width=1159&format=png&auto=webp&s=3bdeca312d3e4126d2628fc2d3894d7a862925b5 This is the normie one... can't get better than this... only the MX Ultra and Max has more bandwidth, and dont have as near as much TOPs in the NPU.
View on Reddit #73582338

Ill_Barber8709@reddit

M4 Pro has 270GB/s memory bandwidth. As far as know, AI Max is 250GB/s
View on Reddit #73583724

getmevodka@reddit

M3 ultra has 819GB/s, whats the point of argue here ? I dont get it
View on Reddit #73586124

holchansg@reddit

The Max/Ultra ones are fast, but yet an dedicated GPU is better.
View on Reddit #73586395

getmevodka@reddit

Im thinking pro 6000 max q for christmas rn
View on Reddit #73587335

holchansg@reddit

And I'm covered in jealously.
View on Reddit #73587862

getmevodka@reddit

😅 i live next to a university with an ai and robotics lab. The wants are very strong cause of that xD. My two 3090 would do some time still i gues
View on Reddit #73592429

Ill_Barber8709@reddit

M3 Ultra has barely less memory bandwidth than the RTX 3090, but MLX being 20% faster than GGUF, it performs better.
View on Reddit #73589626

holchansg@reddit

In some setups they are tied, in MOES, or multi-modal is not even close... also the price difference is massive, 3090 is dirty cheap.
View on Reddit #73590057

holchansg@reddit

But has less TOP/s, far less...
View on Reddit #73583898

Ill_Barber8709@reddit

And? Would you train a model on that thing? I guess not. Most of people here are using local LLMs for inference. Not training. And most people making actual model training won't use a house workstation anyway...
View on Reddit #73584518

holchansg@reddit

Do you know they state TOPs on INT8 right? Meant for INFERENCE.
View on Reddit #73585314

Ill_Barber8709@reddit

Don't care. Compute power is not the bottleneck for inference. M4 Max has 546GB/s memory bandwidth.
View on Reddit #73585494

holchansg@reddit

Yes it does, not so much, this is true, but only for some cases, try to run some multi-modal models an see by yourself.
View on Reddit #73585873

onetimeiateaburrito@reddit

I'm just sitting here in my box-home made of poverty with an RTX 3070 and 8gb. Run some cool prompts for me, boys. Lol
View on Reddit #73592218

Bozhark@reddit

I am in this picture, twice
View on Reddit #73592026

CMDR-Bugsbunny@reddit

Yeah, I just sold my 2nd RTX A6000 from my Threadripper LLM Server. My stupid $2k refurbished MacBook Pro M2 Max with 96Gb RAM was fast enough. While 100+ T/s was cool - 30-40 T/s is still plenty fast enough and a LOT cheaper.
View on Reddit #73591730

PotentialFunny7143@reddit

https://preview.redd.it/5l0hetujmf7g1.png?width=800&format=png&auto=webp&s=4d5d922c732c5dd6b4d510d8573a2b6e99a81c36 The situation is out control!
View on Reddit #73590717

Expensive-Paint-9490@reddit

Who cares, I am not installing a closed source OS on my personal machine.
View on Reddit #73590374

ForsookComparison@reddit (OP)

right on
View on Reddit #73590407

SamuelL421@reddit

(*Shhh, don't tell the normies, but half the fun of LocalLLaMA is getting an excuse to spend months assembling a workstation.*)
View on Reddit #73589758

ForsookComparison@reddit (OP)

*(I know)*
View on Reddit #73589890

ThatCrankyGuy@reddit

Man, fuck macs.. but also.. M-Chips.. How the hell has no one caught up to the caliber of these chips?
View on Reddit #73589670

ForsookComparison@reddit (OP)

I'm sure it's more complicated than that, but my feeble consumer understanding is that Windows-on-ARM is souring the experience and any mass-appeal that Qualcomm PC's could have - and so we keep getting these ludicrously expensive low-volume laptops that make no sense and a half-assed effort from everyone involved.
View on Reddit #73589858

CSharpSauce@reddit

My computer has been unstable as hell since I put the second 3090 in, but I don't think I could live without it now... :(
View on Reddit #73589031

tarruda@reddit

I'm far from a "normie" and never once before had bought a single Apple product. But it is a fact that Apple Silicon simply the most cost effective way to run LLMs at home, so last year I bit the bullet and got a used Mac Studio M1 Ultra with 128GB on eBay for $2500. One of the best purchases I have ever made: This thing uses less than 100w and runs 123B dense 6-bit LLM at 5 toks/second (measured 80w peak with asitop). Just to have an idea of how far Apple is ahead of the competition: M1 Ultra was released on March 2022 and is still provides superior LLM inference speed than Ryzen AI MAX 395+ which was released in 2025. And Ryzen is the only real competition for the "LLM in a small box" hardware, I don't consider these monster machines with 4 RTX 3090 to be competing as it uses many times the amount of power. I truly hope AMD or Intel can catch up so I can use Linux as my main LLM machine. But it is not looking like it will happen anytime soon, so I will just keep my M1 ultra for the foreseeable future.
View on Reddit #73586531

InspirationSrc@reddit

Is it? Maybe I'm wrong (please tell me if I'm, so I can go and buy mac), but everywhere I look people say macbook isn't that fast for interference for 30b+ models and you better use two or more 3090. And it's not going to work for tuning at all. And you can't even connect GPU via thunderbolt, it only works on Intel and AMD.
View on Reddit #73586367

DataGOGO@reddit

lol… no. 
View on Reddit #73585583

Apprehensive-End7926@reddit

Ooooh you’ve really triggered some folks with this one
View on Reddit #73584719

ForsookComparison@reddit (OP)

Gotta get that engagement.
View on Reddit #73584770

tgwombat@reddit

Oh boy, wait until this guy hears about the steady march of progress!
View on Reddit #73583063

Denny_Pilot@reddit

That's probably because the vram gets overflown and the CPU starts doing the work? In that case mac would really give a better speed just because for the price you can't get as much vram. Otherwise idk, the dedicated gpus are faster
View on Reddit #73582941

JLeonsarmiento@reddit

Team 🐺 here.
View on Reddit #73582047

Not_your_guy_buddy42@reddit

iirc she took really long to wolf out
View on Reddit #73581369

ForsookComparison@reddit (OP)

I don't watch the show but the template is gold
View on Reddit #73581456

Wrong-Historian@reddit

still PP to be ashamed of
View on Reddit #73581294