It's almost as if a device that's designed end-to-end by the same firm, largely stamped out of a single die, and built with a robotic degree of precision hand-assembly could achieve, using parts too finicky to be user-serviceable, might actually just be better, and that the guys assembling a "perfect" workstation are maximizing an aesthetic instead of maximizing performance.
If you saw a guy gluing together a car in his garage and actually thinking it might beat a Porsche you wouldn't even think he was insane. You'd just think he was stupid.
Im pretty happy with my 512gb m3 ultra compared to what I’d need to do for the same amount of vram with 3090s.
Spent a lot of money for it, but it sits on my desk as a little box instead of whirring like a jet engine and heating my office.
I wish I could do a cuda setup though. I feel like I’m constantly working around the limitations of my architecture instead of being productive building things.
I really think the jumps up the stack are:
512gb m3 $10k
4x RTX6000 pro Maxq $36k
4x H200NVL $130k (or at this point you're into second hand DGX stations)
I think it depends on your usecase and what you want to do.
$10k is a bit to spend on privacy (now), I'm hopeful that IF the public models ramp their cost, there will be more hardware options to run locally then.
I agree, your M3 Ultra 512GB is a LOT more energy efficient and cheaper then 21x 3090... But it's not faster then that 3090 card. Which is what the meme is hinting at.
If mac wasn't so slow I'd have got one too. All the hardware is weirdly only taking care of one aspect of the ai. Like Mac can run a big model slowly but is expensive.
Nvidia has speed and good drivers and many projects take advantage of cuda but the price per GB of vram is very high.
AMD sucks at drivers and almost none of the new gits work out of box with it and it is slow but it's like half the price of Nvidia and you don't need the big bucks a Mac investment would take.
"AMD sucks at drivers and almost none of the new gits work out of box with it"
The new gits work with Mac though? Thought they don't work because they're cuda based = needs Nvidia.
Maybe it's about the same for amd/mac? Like with both you are buying only inference. I think when you stay in the realm of comfyui and the toys that have had some time to mature they work ok after you get past the rocm hurdles but they are much slower on the amd and Mac. I'm kinda thinking more about when the new models drop or using some extra fast inference like vllm or exl3 or all the tts models that come out. They are all only going to work through Nvidia hardware from the jump as they were all developed on it, at least that's what it looks like when I'm installing them all the time.
Like I was saying I don't think you would want to train wan/qwen/flux or any of the tts models on an amd card or a mac. It sounds like hell just getting amd to infer but maybe it's just rocm. It's like every one of their cards has some different reason why this or that version of rocm doesn't work. It sounds so fucking tedious.
Not even close. I have a M3 Ultra and t/s isnt bad but once you load up on context the PP time is just stupid slow and no one really talks about that part. I dont know what makes it so bad but its garbage at higher context.
I will say I havent tried the newest stuff, but it always felt different to me. They were great for chatbots and stuff like that, if you want to do anything agentic, they seemed to be a bit lacking.
Possibly, but by the time that comes out the 3090 is 6 years old and a 5090 will still be 2x the memory bandwidth. And a 6090 not that far away (or already out)...
Neither is inherently a better solution then the other, each has their use. The point here was 'faster'... The Mac solution is a lot of things, but faster isn't one of them.
Isn't the m3 ultra 815 bps bandwidth? That's less than a 3090 or 4090, but you should make up for the bandwidth issues by the monster amount of ram allowing a huge context window to be held in memory (dollar for dollar).
I mean what does a system cost for 15x 4090 networked together?
yeah, I think so. There is a subtle difference between how much does thing cost and how much someone feels it 'should' cost
had just checked and in my location most of the 3090s are $800+ (after conversion) right now. Mind that there isn't a lot of them available, as they're all used cards
I got mine 3090 for slightly less $700 last year, as it had a broken fan and caked cooler. I've replaced the fans and thermal pads and it's as good as new
I did the exact same thing just a less lucky buy on some of them. On had a fan issue that the seller was trying to hide but they had dropped the price to like $800. A new full set of fans is under $20 and because the fans were bad I did thermal pads too. Boy does it drop a lot more heat with the new pads. Put them on the back plate too.
I was lucky with the other 3 3090s and didn't have to put in that effort though.
No, a 3090 is a lot better for performance per dollar, Macs are expensive and the 3090 is very fast! No, Apple is king on memory per dollar vs VRAM. And neither is what the meme was saying.
The amount of companies actually running 3090 cards in datacenters is *extremely* limited, there are probably some, but nothing I would call 'professional' or 'Enterprise'. A H200 server with 8 cards costs half a million, has 1128GB of VRAM, uses about 7+kW, but is utilized in most cases almost 100%. Those servers are great for multi user loads.
A M3 Ultra isn't great at multiuser LLM loads, it's mostly used for either lab work or single user work, and even then the load is a fraction what a dedicated H200 server does. That H200 server at idle draws more power then the M3 Ultra under load.
21x 3090 cards (+ hardware) are even more power hungry then a single H200 server. Way slower and less versatile. Not surprising for 5 year old hardware.
Hence I talked about energy efficiency, not per token, but over the whole time those machines are on and used by a single user (workstation).
A 3090/4090/5090 or even a 6000 Pro can be a great choice in certain scenarios where the model fits in the VRAM and produces good enough output. But in most cases, this is not the case. Thus you're generally better off with a solution that has more memory, but is slightly slower. Unless of course you have the money for H200 servers, then the equation becomes totally different.
We're also not talking about datacenter servers, we're talking about home workstations specifically.
Is there a workstation setup that can hold, power, and orchestrate enough 3090s for 512 GB RAM?
I can see getting 6 6000 Pros in a rig for significantly more money than an M3 Ultra.
Don’t discount how much power it takes for the Apple chip vs the 22 3090s it would take to get equivalent VRAM.
Back of the napkin math it would take 22 3090s at 350watts a piece so 8,800 watts. Versus I think the m3 ultra maxes out around 300 itself.
Memory bandwidth doesn't scale like that...
Single card compute is useless already for inference. Imagine 22 times more compute. 22 times more useless.
> instead of whirring like a jet engine and heating my office
ngl, summers are tough... but I haven't turned on the apartment's heating for the last 2 winters, and I'm getting refunds on the utility bill at the end of the year.
Haha I just moved out of Europe where I had a good basement, to Asia where I'm hoping to find a 40m2 place for 4 people. Mayyybe I'll get a balcony for the server to live on.
Yeah it complicates things for sure. Air cooling would require intake filters and probably a custom case, and condensation could be an issue if I'm not using it 24/7.
That said the balconies here are enclosed, so the exposure is a lot lower.
Currently I'm leaning towards a split setup like an air conditioner uses, with radiator (big ass Mora IV) outside and water hoses going through the wall.
there is quite a difference in speed:
M4 (base) 120 GB/s
M4 Pro 273 GB/s
M4 Max 546 GB/s
Ultra would be around 900GB/s and the faster the toughput the faster is inference
Maybe, but you have an even bigger issue with clustering multiple Mac Studios together over Thunderbolt 5 (or worse, over 10Gbit networking) and trying to run a Deepseek full model on there (without quantization).
It's always about the right tool for the job. And using a model that doesn't fit in the VRAM or unified memory is not using the right tool.
24 GB available at 936 vs 512 GB available at 819.3. The cliff after the small memory of 1 or many 3090s fills up is pretty sharp. And the models that do fit are not smart enough for professional workload other than some very basic stuff. I run 8bit GLM 4.6 at 200k context and still have 62 GB left for everything else. It's a beast.
I own the basic M4 mini. And on that machine i do basic hobby stuff and peeing my niece and nephew learn AI (under admin supervision). Fort that kind of stuff it's great. But I wouldn't push it beyond that...or can't.
Yeah...
M4 Mini bandwidth is 120gb/s.
The only Mac that is worth are the Max and Ultra.
AMD AI 395 is cheaper and have the same bandwidth as the Pro, without the con of being ARM, dedicated TPU...
An Apple user is going to choose a Mac, and the Pro version at a minimum. Even the 800gb/s in my M3 Ultra isn't fast.
120gb/s for chat is rough. I expect a lot of people are disappointed. There no point in buying a shared memory machine and running an 8B because its the size that feels fast enough. Just by the video card.
> I expect a lot of people are disappointed.
I doubt many people are running an LLM locally? Even when I run them on my M1 Pro MBP I get around 17 tokens/s which is sufficient for my needs - it's able to generate text faster than I can read it.
> blow it out of the water.
As I said, my own 32GB M2 Max runs Qwen3-30B-A3B 4Bit MLX at 65tps, while the desktop 5090 runs the same model Q4 at 135-200tps.
Not exactly blown out of the water.
Even the M3 Ultra 512GB is not faster then even a consumer 3090. And even a MacBook only fits an M4 Max, which is only faster if you're building your LLM 'workstation' with RTX 5060 cards...
LOL.
First, go run 600B+ parameters models on 3090s... Which you can on a single M3 Ultra 512GB.
Second, 3090 **TI** is 1000GB/s, 3090 is 900GB/s. M3 Ultra is 800GB/s but MLX is 20% faster than GGUF.
Third, M4 Max is a laptop chip. Show me a laptop with 128GB of 540GB/s memory...
You're just saying shit.
You’re missing the point entirely. Own both Mac and NVIDIA. NVIDIA wins, no competition. Mac OS a good generalist/ beginner device though no one can hold them against that.
To sum it up simply
Better value proposition: Mac
Better performance: NVIDIA
Until you get to serious AI use then. NVIDIA wins both
Yes, it does, but that's not the point. I'm a 'normy' with a Mac and even I say that if you think a Mac is faster then a properly build LLM workstation, you *really* suck at building LLM workstations. A 5 year old GPU (3090) is still faster then any Apple silicon in the LLM space. The advantage that an Apple brings is a LOT of memory for a reasonable price compared to VRAM with 'decent' performance and a very low power footprint (high efficiency).
Yeah, trying to work under those conditions would be painful. Wow.
Luckily I also have a quad RTX 6000 PRO rig, which does not suffer any such slow nonsense... and it also heats my coffee for me.
Standard M5 chips have added matmul acceleration, which significantly speeds up the prompt processing. You'd have to look for posts actually benchmarking M4 vs M5, but it was pretty impressive.
Actual token generation should be sped up as well, but prompt processing will be multiple times more efficient now
someone had posted about this before
[https://www.reddit.com/r/LocalLLaMA/comments/1mt3epi/m4\_max\_generation\_speed\_vs\_context\_size/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/LocalLLaMA/comments/1mt3epi/m4_max_generation_speed_vs_context_size/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
I do it all the time with Qwen3 32B on my i5-1334U on a single stick of 48GB DDR5-5200. Takes like an hour to start responding and another hour to craft enough response for me to do something with it but it works alright. <1 tok/s.
I did not have rewire my house but for my 4x3090 worstation I had to get 6 kW online UPS, since previous one was only 900W. And 5 kW diesel generator as a backup, but I already had it. The rig itself during text generation with K2 or DeepSeek consumes about 1.2 kW, under full load (especially during image generation on all GPUs) can be about 2 kW.
But important part, that I built my rig gradually... for example, in the beginning of this year I got 1 TB of RAM for $1600, and when I was upgrading to EPYC, I already had PSUs and 4x3090, which I bought one by one. I also highly prefer Linux, and need my rig for other things besides LLMs, including Blender and 3D modeling/rendering that can take advantage of 4x3090 very well and do some tasks that benefit from large disk cache in RAM or require high amounts of memory.
So, I wouldn't exchange my rig to a pair of 512 GB Macs with similar total memory, besides, my workstation in total spent costs is still less than even a single one. Of course, a lot depends on use cases, personal preferences, and local electricity costs. In my case, electricity costs are small enough to not matter much, but in some countries they are so high that using not so energy efficient hardware may not be an option.
No one is running GLM4.6 on 4x3090. My estimates places GLM4.7 around 300-400GB of memory needed. GLM Air would run fine though, even with just a RTX Pro 6000, then you'll get \~2872 t/s for prompt processing, and \~112 t/s for decoding.
But again, prompt processing will be so terribly slow, that it won't matter! That is my exact point. Save the money, don't buy Apple hardware based on decoding speeds when prompt processing is what slows that hardware down, so much it's not usable for day-to-day work.
Instead, get real hardware that is meant for ML proper, and with support of the community at large, and run smaller models but run them multiple times, or with a harness with tools that makes it better.
Don't go by online public benchmarks, make your own benchmarks, create the tooling, and smaller models won't matter. Not sure how people aren't getting this yet.
What's the point of being to run something when it'll take 5 minutes to get a very simple reply?
Anyways, I'm using my hardware for paid work for clients, I can't be half-assing things or waiting hours for stuff to finish. If you want to go for Apple hardware, do it, I'm not your boss :) But if you're aiming to do serious ML development or otherwise actually need performance from what you're able to run, you won't be using Apple hardware, at least today. Maybe in the future.
Again, please do share the prompt processing speed, which is exactly the part that is bog slow on Apple hardware. But every-time you ask a Apple-fanboi about the prompt processing speed, they tend to stop responding, hoping you won't disappoint me like all the others before you.
> As I've said, prompt processing speed is enough for chat use
Yeah? What is the concrete numbers? As is typical for Apple fanbois, everyone says it's "fast enough" yet no one is providing concrete numbers. Lets compare, if you have the hardware in front of you.
For `GLM-4.5-Air-Q4_K_M` as a slightly related example, I'm getting ~2836 t/s for pp512 on llama-bench (`build: 7f8ef50cc (7209)`) and 110 t/s for tg128, on a RTX Pro 6000.
Please do provide actual concrete numbers so we can actually compare data instead of your vibes, otherwise please kindly stop trying to propagate the myth that Apple hardware makes sense for inference with the bigger models.
But of course, it's highly unlikely you'll actually come back with concrete numbers, and you probably have some excuse for it that makes sense. Which is exactly the problem, Apple fanbois keeping saying "It's fast enough!" yet not a single individual have been able to give me some concrete comparable numbers. Please be one of the reasonable people who can provide actual data!
You just can't stop deflecting from the fact you can't run something at all, right?
And yeah, you know you can drop the numbers yourself, if you think "apple fanboiz are covering up". Personally I simply don't give a single fuck about pp speed if it's a "can run - can't run" situation. Enjoy your imaginary prompt processing speed numbers on your imaginary nvidia rig lol. They are high as fuck.
>you know you can drop the numbers yourself
Did you miss this part of my message.
And don't forget to compare numbers you post with zero tks you get on your rig.
> you know you can drop the numbers yourself
I don't know what that's supposed to even mean, you mean I intentionally lowered them? That wouldn't make sense, so surely you meant something else.
But anyways, what's the point? You clearly don't have the hardware yourself, or are embarrassed over your purchase and refuse to actually provide any data, so yeah. Not sure what you want from me? Just stop engaging if you don't want to have a faithful conversation by providing data to backup your statements. Otherwise this is all just a waste of time for everyone involved.
> 140tks
Hah, suddenly everyone understands why you have to pry out any sort of benchmarking data out of the Apple fanbois who spent $10K on a computer that does ~140tks t/s in prompt processing :|
But you know what, you do you, I'm not saying everyone should get a RTX Pro 6000, just that if you're aiming to do professional ML development, you really can't be on Apple hardware, you need at the very least something CUDA compatible. But it requires you to understand software engineering to grok this, so I'd understand if it's easier to go with Apple hardware.
> Now gtfo, you are wasting my time.
You can stop responding whenever you feel like, not sure if you've understood how this whole "internet" thing works like or not, but just in case; you don't have to respond, no one is forcing you.
>Hah, suddenly everyone understands why you have to pry out any sort of benchmarking data out of the Apple fanbois who spent $10K on a computer that does \~140tks t/s in prompt processing :|
This "prying" happens only in your head. No one is "prying" anything, I was just having fun watching you deflecting and trying to pretend there's a CONSPIRACY against you and everyone is hiding numbers that are available on a first google search. It was hillarious, made my day, thank you.
So, what is your prompt processing speed for glm 4.6. Still zero? As expected? Lmao.
Bye.
Must be an American thing, I'm too European to understand.
Well, actually I'm an former industrial electrician. So I fully understand that most houses here as a 3x230V 20-35A supply to their houses, then often divided into 10-13A sub-groups and 16A for appliances like dryer and washer. So not really an issue.
Electrical bill on the other hand is a completely different issue.
Imagine saying I can afford a M3 Ultra for no real reason other than i just want too, but cant afford to buy a NEMA 14-50 plug. If you can afford 16 gpus you can easily pay to get a 50A 240V circuit. My servers all run off 240v anyway because they are actually way more efficient.
What a stupid arse cope response.
I find this response hilarious. Mac people say this like it matters. Like, who cares? Seriously. I want to get things done, don't Mac folks want to get things done? "Oh no, not if it means I'm using 40 extra watts, gee, I'd rather sit on my thumbs"
Stop.
For some people electricity costs are important. Not for me personally, but I know some people already have really high bills, makes sense they try to optimize for it if that's their situation.
And no, Apple hardware is excellent for some specific AI models they have specifically worked to make it run somewhat OK on their hardware, since the community isn't really focused on Apple hardware. If you want stuff that just runs, go for nvidia hardware, simple as that. And no limitations except your wallet in that case, which if you go for Apple hardware, probably already isn't a issue. Apple hardware will be dog slow and you'll regret getting it.
Imagine having a context window of 25 tokens and completely missing the fact conversation is about full gpu offload just to write another toxic comment from your throwaway account.
"Extra 40 watts", lol.
True, but different tools. My Mac is always on, frequently working and holds multiple LLM in memory. 8 watts idle, 300+ watts works, never makes a sound.
Big MOE models are particularly suited for shared memory machines, including MOE.
I do expect I will also have a CUDA machine in the next few years. But for me, a high end mac was a good choice for learning and fun.
Also Deepseek 3.2 is out now, demonstrating that you can make SOTA models with close to linear prompt processing. Mac and EPYC machines with a lot of RAM are only going to become more useful over the next couple of years IMO. Especially now that you can cluster Macs effectively.
Show me a laptop with 128GB of 546GB/s memory.
Price a desktop with 128GB of 546GB/s memory compared to Mac Studio M4 Max.
I won’t even talk about power efficiency.
Sure, they’re not meant for training. But most of us here only use inference anyway.
>Show me a laptop with 128GB of 546GB/s memory.
Laptop is not a workstation.
>Price a desktop with 128GB of 546GB/s memory
6x 3090 if you can get them at $500, or modded 3080 if you can't - $3000. Mobo, CPU and DDR - $100. Power supply and case - up to $500. Total: $4600.
>I won’t even talk about power efficiency.
If a system consumes 10x more power, but does the same task 10x faster, then it's exactly as power efficient.
> Laptop is not a workstation.
For inference? LOL
> 6x 3090 if you can get them at $500, or modded 3080 if you can't - $3000. Mobo, CPU and DDR - $1000. Power supply, cpu cooler, fans and case - up to $500. Total: $4500.
The M4 Max Studio 128GB cost $3,499.00
> If a system consumes 10x more power, but does the same task 10x faster, then it's exactly as power efficient.
I run Qwen3-30B-A3B 4Bit MLX at 65tps on a 32GB M2 Max MacBook Pro. Best benchmarks of the desktop 5090 running this model Q4 were between 135 and 200tps. You're funny, but completely delusional.
See comments here https://www.reddit.com/r/LocalLLaMA/comments/1p7wjx9/rtx_5090_qwen_30b_moe_135_toks_in_nvfp4_full/
> Qwen3-30B-A3B 4Bit
189t/s on my single RTX3090 just now. That's running the Q4_K_M gguf.
(59t/s for a 0 context 'hi' prompt with the cpu-only build)
Adding more cards won't really help for a batch of 1 since the model is so light, memory bandwidth is the constraint (58 t/s with just my CPU).
Macbook is obviously the better form factor though :)
If you follow the link I put, you’ll that’s exactly what I’m saying. OP was flexing about their results and I pointed that their numbers were nothing to flex about.
TBH I’m genuinely surprised by the poor performance of this model on the 5090 considering the compute power and memory bandwidth.
You should explain to him how you get much better results with a much less powerful GPU ^_^
Yeah I clicked the link, just wanted to let you know that GPUs aren't usually that... slow.
I didn't do anything special for that ^ just ran the model in llama.cpp lol
>I run Qwen3-30B-A3B 4Bit MLX at 65tps on a 32GB M2 Max MacBook Pro. Best benchmarks of the desktop 5090 running this model Q4 were between 135 and 200tps. You're funny, but completely delusional.
Ah, if I had a dollar every time a person judges performance by 0-lenght prompt, I would have RTX 600 Pro by now. IRL you're now working with short context, especially not if you're paying for Max/Ultra chips; and their prompt processing is terrible. With Qwen3 30B, a very light model, 30-40k long prompt, [M3 Ultra only gets](https://www.reddit.com/r/LocalLLaMA/comments/1kvd0jr/m3_ultra_mac_studio_benchmarks_96gb_vram_60_gpu/) \~400 tok/s PP, while dular 3080 [will get](https://www.reddit.com/r/LocalLLaMA/comments/1p0bbrl/rtx_3080_20gb_a_comprehensive_review_of_chinese/) 4000 tok/s PP at the same depth. This is exactly 10x faster.
Dude, I'm a developper. I spend my time processing big context.
> prompt processing is terrible
M4 and M3 generation yes. M5 architecture brings tensor cores to every GPU and prompt processing is now 4 times faster than M4.
> This is exactly 10x faster.
Ah, if I had a euro every time a person judges performance by one metric only, I would own a house in France by now.
>M5 architecture brings tensor cores to every GPU and prompt processing is now 4 times faster than M4.
Can I buy M5 with 128GB of memory? No? Come back when it becomes available, I will happily compare it to equivalently-priced Blackwell.
>Ah, if I had a euro every time a person judges performance by one metric only, I would own a house in France by now.
Surely, if I'm wrong, you would easily provide numbers that prove it.
> Surely, if I'm wrong, you would easily provide numbers that prove it.
You told me yourself that Nvidia were 10 times faster at **prompt processing**
I've shown you that a 5090 is barely 2 to 3 times faster than an M2 Max.
Hence, **one metric only**
If you would look into your usage stats, you'll see that generation length is typically multiple times shorter than prompts, basically for all usecases except essays or other creative writing. This it is the metric that makes the most impact. Besides, TG also drops at long prompts, and on exactly the same links you can see it's 35 for m3 Ultra and 70 for 2x3080 at depth \~30k. The difference is immense, and I'm not even talking about 5090.
Those might be *your* usage stats, but a lot of us do batched requests with massive prompts and very small output from the LLM. We need fast PP and don't care about TG.
Dude, I literally am arguing the whole conversation that RTX has massively faster PP and it is what matter the most. You are arguing against the wrong person.
>If a system consumes 10x more power, but does the same task 10x faster, then it's exactly as power efficient.
Will it really be 10x faster at concurrency 1.
>Will it really be 10x faster at concurrency 1.
Depends what you're doing. For diffusion, VibeVoice, training then yeah, 10x faster.
For a single user with sparse MoE models, maybe 3x or 4x faster.
Depends on the model and architecture, but yes, 2^(n) 3090s with TP (and less so with other odd/even numbers), especially vLLM/sglang, can be plenty faster, even at x4 and stock drivers. Here are some numbers on a 50-60K prompt with 3 different models:
https://preview.redd.it/gbim61iu6f7g1.png?width=1473&format=png&auto=webp&s=00a5b6bb35a4a832008eef27d9ae016e28af8e7b
These numbers are very exaggerated in favor of the prompt size tho. It's like "what color is the sky?" and "here's 50K personality prompt" or something. Most of the time, especially in agentic use with reasoning models, ratio is 5:1 or higher in favor of generation size.
And I'm looking at generation outputs... They are around mac level, give or take.
No it isn't, especially with reasoning models.
So what is the final speed increase if prompt size is equal to output, or prompt size is 3 times less than output.
That's a valid point as long as you're not running MoE models. But those would want to load the full model weights in the RAM while the experts are active in the VRAM. At least for highest efficiency in $ / performance. Though anyone just interested in running a single LLM might as well run a dense model with 6 3090s to work off of (but someone like me running multiple LLMs alongside other agents benefits from having some number of those being MoEs, as opposed to one large model.)
The numbers I'm seeing for 32b models are \~18t/s on m4 max(546GB/s) vs 55t/s on 3090(936GB/s) with 2x tensor parallel. So about 3x faster, maybe 4x faster with 4xTP.
I'm using MLX version, which is known to be 20% faster than GGUF.
I have 65tps on the 30B version, which is an MoE, not a dense model.
23tps on 32B Qwen3
I've expanded my take with real numbers in [this reply](https://www.reddit.com/r/LocalLLaMA/comments/1pnfaqo/comment/nu7h889/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button).
Depends on the area, like $300 each now I think? They're on Aliexpress.
Don't buy 'em unless you're happy to spend a day setting them up, fucking around with rocm setup, etc.
They're not plug and play like Nvidia or a Mac. They are much faster than a Mac for big dense models.
That M4 Mac with 128gb is like 5k. The 5090 is going to eat up 2500 and the memory another $1350 (yikes to the ram market). You've still got enough money to round out the rest of the system. It might be slightly more, but it will be easily 2x as fast.
Nobody cares about power efficiency in a workstation.
> That M4 Mac with 128gb is like 5k
M4 Max Mac Studio with 128GB is $3500.
> You've still got enough money to round out the rest of the system. It might be slightly more, but it will be easily 2x as fast.
How many RTX 5090 do you need to get 128GB of VRAM? You're saying shit.
> Nobody cares about power efficiency in a workstation.
Speak for yourself.
> You're fighting physics and computation. There's no magic formula that gets apple free matrix operations.
I don't even know what this is supposed to mean.
You could probably throw together 4x AMD V620(32gb@512GB/s) on an EATX x299 board for $3000 off of ebay. It won't have drivers for nearly as long, suck back way more power, and would sound like a jet engine with the blower fans on those server cards, but would train faster. Maybe I'm biased, my rig is basically that but half the price cause I got a crazy deal on the gpus :P
A "loaded macbook" that can run 120B at all will cost you like $3000. For that money you can assemble a PC that not only will load the same 120B model completely into GPUs (4-bit quantized, of course), but will also run in multiple times faster for a single request, and orders of magnitude faster if you have agentic workload that can take advantage of parallel processing.
So the whole post is about "macbook is better than a workstation". A "warderobe of GPUs" is, in fact, a workstation, and fits this particular conversation perfectly.
4x RTX 3080 20GB will cost $2000 including tax and shipping. $1000 for a pc case, motherboard, cpu, ram and psu is totally enough, you don't need any top-of-the-line components for that. The performance of this setup will be 10x of M3 Ultra in PP, and some 2x or more for TG for single request.
Sure, if that's your preference. Just don't go around the interned claiming that a macbook is faster, or that an equivalently-priced PC can't fit same-sized models.
No one claimed a MacBook is faster. I use both. My RTX5090 gets used quite often for embeddings models cause that’s about the only useful thing one can run with such a small pool of memory. There’s no usable LLM for any serious work that will fit in one.
My point was that it’s better to run larger more capable models than not at all. I can fit gpt-oss-120b in my M3 and it’s almost useable as a daily driver. Almost.
OP is referring to people building PC frankenbuilds and doing CPU/RAM offloading to squeeze large models into the shared pool of resources. At that point you’re not leveraging the speed of CUDA. Once you leave the GPU it’s a different ballgame. In these scenarios a high end Mac will outperform a PC with offloading.
M3 Ultra owner here. The only downside I see on Mac is video generation. Being capable of get full models running on it is amazing!
The speed, prompt loading times are nothing truly crazy slow. It is ok, specially when it is running with a fraction of power, NOT A SINGLE NOISE or hear issue. Also, is important to say that even without CUDA (is a major downgrade, I know) things are getting better for metal.
My doubt know is if I buy a second one to get to the sweet spot of 1Tb of RAM, wait for the next Ultra or invest in a minimum machine with a single 6000 pro to generate videos + images (accept configuration suggestions to the last one).
\> NOT A SINGLE NOISE or hear issue
Is this really an issue for people? I have my workstation with a RTX Pro 6000 right next to me, I just ran a benchmark with lmstudio-community/GLM-4.5-Air-GGUF (glm4moe 106B.A12B Q4\_K) and even as the GPU is hitting max temperatures of \~72°C, I can barely hear it.
For configuration suggestions, you meant like hardware? It pairs nicely with Threadripper 9970X and fast I/O :)
Thx. I have mental issues, cacophony can drive me crazy. I know that the 6000 is more silent than the 3090/4090/5090. Thx for Threadripper. What I mean is that I want a minimum (€€€) body to run the 6000 without spending that much money.
> What I mean is that I want a minimum (€€€) body to run the 6000 without spending that much money.
To be honest, if you're just doing LLM inference, the other hardware won't matter much. You could be on PCIe 4, on AM4 with DDR4 and you won't notice much difference compared with everything on Gen 5 or even moving to sTR5 (I would know, literally just did this move some weeks ago).
Just make sure the PSU is good, and you have sufficient cooling, because this beast generates hit like no other GPU I've had before.
How bad is the video gen speed? Something like the 14B WAN, 720p 5s? I'm planning to buy a Mac Studio in the future mostly to run LLMs and I heard it's horrible for videos, but is it 'takes an hour' bad or 'will overheat, explode and not gen anything in the end' bad?
Prompt times not slow? Are you kidding me i tried running GLM 4.5 Air on my M3 and it was taking 2-3 min / prompt. Sure if you are using it as a chat bot its not bad, want to use it for anything major they are useless. I have a set of 5060ti's that smoke it in every way and cost me 1/2 as much.
Some people told me it would really take over an hour for one animation, but if they can just keep mulling it over with no issues... I can start the gen, walk the dogs, come back and it's done haha
I used to run dense models with heavy CPU offloading when getting into locals, so time doesn't scare me as long as the hardware doesn't suffer too much 🥹
I hate apple smartphones. This is my first apple. But, I need to say. The thing is a tank. I would say military grad quality. It is build to last. In my lab that is one 20 years old, still working, another one 11 years old, it looks like just popped out of the box. Mine is always under 480 Gb+ RAM or 100% CPU use (bioinformatics) and it barely sweat. Don't evaluate apple PCs by apple disposable gimmicks, they aren't the same. Bonus note, you have full control over it, not a single headache over drivers.
TIL you can buy a home mac with 512GB of basically vram(?) - it's half the VRAM speed of an rtx pro 6k but the fuck lol that's still insane? for $10k to $16k?
Not bad at all. I wonder if you can get one of those and run linux on it.
Might not be the perfect LLM device but a memory rich one. And one that idles at under 10 and maxes out under 270 Watts while staying silent.
There are those freaks (in the best possible way) at Asahi Linux that reverse engineered the Mac drivers and rebuilt them for Linux. I actually run a MacBook M1 Max on Asahi Fedora and it runs great. Unfortunately they only cover the M1 and M2 family yet.
But then I guess you’ll loose MLX support which is a boost in model performance on Mac.
my pro-6k can go at 450 watts and it's still silent :D
I wonder if mac has stuff that automatically quantizes the model during runtime? That would suck
Oh, two things here: Apple added support for stacking multiple Mac’s with „RDMA over Thunderbold“ lately so you could multiply these 512GB.
https://www.jeffgeerling.com/blog/2025/15-tb-vram-on-mac-studio-rdma-over-thunderbolt-5
And the next chip generation M5 is expected to bring extra neural accelerators within the GPUs
https://9to5mac.com/2025/11/20/apple-shows-how-much-faster-the-m5-runs-local-llms-compared-to-the-m4/
I have 7t/s TG and 140 t/s at 60k ctx with Devstral 2 123B 2.5bpw exl3 (it seems like quality is reasonable thanks to EXL3 quantization but I am not 100% sure yet).
Can Mac do that?
It really depends on what you're trying to do. MacBooks work ok on MOE models, but dense models not so much. My 5090+4080 pc is much faster with 70b models than what you can do with macs.
Thats why it depends on what youre trying to do. I use it for roleplaying and haven't found any of the moe models better than the 70b model i use.
https://huggingface.co/Steelskull/L3.3-Shakudo-70b
Id imagine the newer moe models are better for coding.
Yes, i can run a qwen3 235b moe in q6_xl and its really nice for the expense i made. For comfy with qwen image it still performs but my old 3090 runs laps around it while already being downvolted to 245watts xD
You just need to download RAM Doubler. Install two copies of it and your RAM will quadruple.
https://preview.redd.it/86like0z1f7g1.jpeg?width=200&format=pjpg&auto=webp&s=7592a47f02d8b2a025e37f1cad502be8604245d4
Ran out of disk space installing more than four copies of ram doubler. Can I use Disk Doubler?
https://preview.redd.it/3mra854h4f7g1.png?width=1136&format=png&auto=webp&s=397e3889c6435e80fcf8de301ea7013f6f1821a1
Hey, joke all you want, but Stacker was legit, I would have never survived the 90s without stacker and the plethora of Adaptec controllers and bad sector disk drives I pulled out of the dumpsters of silicon valley.
Stacker actually made your computer run faster as HDD was so slow that reading the compressed file an unpacking it was way faster than reading same file from raw disk.
Fun fact unlike the whole "Download more RAM" meme, Ram Doubler software was a real thing back in those days, and they did actually increase how stuff you could fit in RAM.
The way they worked was by dynamically compressing and decompressed data as it came in and out of RAM. Nowadays RAM compression is built into basically all modern operating systems so it would no longer do anything, but back then it made a real difference.
Some people reminisce about Woodstock, I reminisce about waiting in line at Fry's electronics to get Windows 95 at 12:01AM
The kids will never understand.
When I got engaged we were trying to set a date and August 24th came up. I said, "Perfect! I'll never forget our anniversary. It's the day Windows 95 was released!"
We're divorced now.
I was waiting in line at midnight for the release of Mac OS X. Club goers passed by and asked what we were doing. When we explained why they laughed at us like we were all dorks.
It really didn't do any ram compression, windows 95 did that. Yes, windows 95 did ram compression and those "ram doubler" just used placebo and doubling the page size by 2x. That's it...
The original [Ram Doubler](https://apple.fandom.com/wiki/RAM_Doubler) wasn't for Windows 95 though, it was for classic Mac OS and Windows 3.1. Neither of which had RAM compression built in.
You might be confusing Ram Dobuler for [SoftRAM](https://en.wikipedia.org/wiki/SoftRAM), which was indeed just a scam. That was developed by an entirely different company though.
Yup.
Ram Doubler was the real deal.
Came at the cost of a little cpu, but that was a point in time most systems were more memory bound than cpu bound. 4-16mb of memory but 66-200mhz CPU. Taking a couple percent to add memory was a huge win, compared to virtual memory on slow 5200 rpm ide hard drives.
Oh yeah, having 4mb of ram was a ridiculous constraint bro.
You don't understand, I NEEDED to be able to run SCURK.
https://www.mobygames.com/game/2422/simcity-2000-urban-renewal-kit/
Didn't this actually work by compressing the ram or something?
I know it wasn't 2x but it was better if I recall than nothing? I swear I saw a yt vid on this once
It slowed the computer down anyway because the CPU ran at less than 50Mhz and only had one core. It had to do on the fly compression and decompression while running your apps and OS.
I'm always more concerned about quality over speed. Sure speed is nice, but throwing more compute at the model won't make it magically better at answering
I specifically said, the future of \*local\* home AI... will be a small box on the table. Sure, there will be lots of things provided from the cloud. However... history has showed that people still prefer to actually 'own' their stuff. There was a time in the 70's where people thought personal computing will be done on terminals connected to the huge machines of the time stationed elsewhere, and look what happened (spoiler, a personal computer in all of ours pockets plus portable laptops, etc). We have a natural drive towards distributed processing, just like we don't have a hive mind, etc. It's who we are. We go the cloud way only if that's the only option, but when other options are available - we go for distributed processing.
I know you did... I am saying the future of ALL ai will be cloud. History has shown that it really doesnt matter what you want, cost dictates everything. At some point home GPUs will not run the ai models and if you think you will be able to afford a datacenter tpu or whatever they run on in 10 years i have a bridge to sell you.
In the 70s not EVERYTHING was going to a service, now it all is. You will own nothing, it will all be a license, and sadly, not a single person will do
History also showed that we are not running excel and word on supercomputers via terminals. We opt for not the most powerful, but more convenient solutions. Anyway, this discussion is pointless, because I won't change your mind, and you won't change mine; and none of our theories are provable at the moment, just speculation. But thank you for engaging, anyway.
EXACTLY.. that's why bang for the buck a 128gb strix halo was my goto even though I could have afforded a spark or whatever. I'm just going to use this for inference, local testing, and enrichment processes. If I get really serious about training or whatever renting for a short span is a much better option.
100%. People who think buying their own hardware while these companies are literally burning money is insane. Renting is so insanely subsidized right now. It's not worth buying.
One could argue that now is the time to buy before hardware gets insanely expensive and everyone pulls out of consumer GPUs. But honestly, if I was really really serious about running intense stuff local. I'd probably drop 10-15k on a real AI machine.
Valid. Honestly built this project, so I had a good reason for a local inference box. For anything other than enrichment type stuff I use big models. [https://github.com/vmlinuzx/llmc/](https://github.com/vmlinuzx/llmc/)
Local stuff is a shit ton of fun to me to do learning on, and also build systems which require engineering imagination to work perfectly under constraints.
I got my finetune featured in a LLM safety paper from Stanford/Berkeley, it was trained on single local 3090 Ti and it was actually in the top 3 for open weight models in their eval - I think my dataset was simply well fit for their benchmark.
>However, on larger base
models the best fine-tuning methods are able to improve rule-following, such as Qwen1.5
72B Chat, Yi-34B-200K-AEZAKMI-v2 *(that's my finetune)*, and Tulu-2 70B (fine-tuned from Llama-2 70B), among
others as shown in Appendix B.
https://arxiv.org/abs/2311.04235
If you're doing base model training then yes. But if you're fine tuning 7b, 12b models you can get away with most consumer Nvidia GPUs. The same fine tuning probably takes 5 or 10 times longer with MLX-lm
Congratulations to the 3 people on here training models from scratch that no one will ever use. For everyone else, MLX can do everything, including fine tuning.
Who said people are training models for mass users? Most model training and fine-tuning is done for personal, college, or internal enterprise reasons. MLX-LM can do \*some\* of the things that CUDA-accelerated libraries like Unsloth/PEFT/tortchtune/Tensorflow can do, but WAY slower.
It's disingenuous for you to pretend that no one does this and that MLX is just as capable or performant
lol why does everyone have to participate in fine tuning or training exactly? What a dumb ass gatekeeping hot take.
This would be like a carpentry sub trying to pretend that only REAL carpenters build their own saws and tools from scratch. In other words, you sound like an idiot.
Point to me where I made any gatekeeping statements.
My point is that people like OP don't consider the full range of this industry / hobby when they make blanket statements about which hardware is best
Do tell. What can't you do without CUDA? We can run infence, fine tuning, diffusion models, tts, stt, embeddings, etc without CUDA.
I suppose for the 0.000000001% of the world that is training models from scratch then CUDA matters.
The full version of automatic1111 with all the extensions, a whole host of txt to image software clients, many different LLM clients, txt to image merging, LLM lora training...of the top of my head.
I'm sure the list is like a page long. It's a lot of stuff.
That's an implementation detail of automatic1111 and nothing to do with what CUDA is or isnt capable of vs other platforms. I can use the same image gen models youre using on Apple hardware. Sure the generation times are slower, but it can still be done without CUDA.
You're still using automatic1111? The platform dinosaurs were using back when SD1 was popular?
The entire advantage of using CUDA is that it makes it easier for developers, hence why yes, it's used in a lot of software.
So, what software do you use to run z-image on mac?
And not just Cuda. The blackwell hardware is very needed for training full FP8 at least for now. But I have put hopes in ROCm, it's open source and promising.
Right…“my loud, hot, expensive Linux box is faster in benchmarks and images, therefore your quiet unified-memory machine that lets you think deeply without friction is bad.”
I wouldn't call myself a normie. Thing is even before the RAM shortage 128GB vRAM is crazy expensive and is attached to power hungry devices. The unified memory has the advantage that it fits and the speeds are just about good enough for certain tasks.
I would be really happy to be a "normies" if i had the money to be ;-) I have a mac book pro m3 with 24 go from work,you need to spend way more than that (which was already expensive to my taste) and speed is disappointing. In my dream there is cheap m5 ultra,in my dream....
I think that van would backfire on kidnappers, they'd find themselves instantly surrounded by a mob of ravenous savages tearing the van apart to get at the RAM in there.
I cant affort PC to play last games that come out even EU5 is laggy sometimes let alone an local LLM. \*sigh\* Im pro AI but more like i think the future is Local and not Cloud, so i hop into van too!
I upgraded my M1 pro with 16GB ram to an M4 max with 48GB for this very reason. It’s just so performant at anything I throw at it and portable that it’s worth the apple tax imo.
I assume you're talking about DDR5? I'm struggling with 3600MHz DDR4... (64 GB VRAM but still, can barely run a 70B model at Q4\_K\_M gguf at decent speed for fast inference \~30-60 seconds below 16k even... Is koboldcpp botclenecking me?) I should've upgraded earlier, but went for a new monitor...
70B dense model is quite slow if cannot fully fit in VRAM... For example, Kimi K2 is 1T model but has just 32B active parameters so in case of CPU-only inference it will be faster than 70B model.
And based on 3600 MHz speed, you likely have dual channel RAM, it is almost four times slower than 8-channel DDR4 3200MHz RAM.
In any case, to efficiently run model its context cache needs to be entirely in VRAM. Then prompt processing will be done on GPUs and text generation speed will be much faster too.
Well, I seem to satisfy the requirements. I have four 3090, they are sufficient to hold 160K context at Q8 with four full layers of Kimi K2 IQ4 quant (or alternatively could hold 256K context without full layers), and 1 TB of RAM. Seems to be sufficient for now. Good thing I purchased it at the beginning of this year while prices where good... otherwise at current RAM prices upgrading would be tough.
Can relate, I am the normie. I own a M4 Max (64GB) laptop and I kept wondering why people have to go to such lengths and expenses to run those 30B models, until I realized the reasons.
Not sure about the significantly better speeds, I guess for the same price you have a faster setup with nvidias gpu
but for sure it's WAY EASIER to buy, install, setup, cost less to run, doesn't consume a billion watt, nor replace your heater, makes way less noise and takes less way space
Okay, it's good that you have both because I have some questions.
How much vram do you get out of your dual 3090 setup?
Also, do you really need that? Because from what I've seen gpt oss 20b is the first model that I can call decent and I can run it on my gaming PC no problem. And it's a MOE one.
So I'm just thinking: MOE sounds like the biggest bang for the buck. Mac mini sounds like the biggest bang for the buck as well. If you combine them and hope that there will be better MOE models - it seems like a good choice for a small local setup, that does pretty much anything you need locally if you can't use a cloud model for some reason.
It makes no sense. If it said something about being able to run larger models and left out normies, that might work. Normies don't have 512 GB of unified memory.
The only thing worse than slow generation is slow prompt processing. And at least windows can run way more AI/ML stuff if you're into that. Can't say I'm jealous tbh.
This is the main reason I got a MBP 128GB... well, that & mobile video editing. I say this as a long-time Linux user. I still miss Linux as a daily driver, but can't argue with the local model capability of this laptop.
This can't be serious right? This can't be true. Is it because of the bottlenecks related to using multiple GPUs? Is there something else I'm missing? GDDR6/7 VRAM is so much higher speed than unified memory.
, how can macbooks be faster than custom multiGPU setups?
i have yet to decide between a \~10k mac ultra (m5/m3/m1) ? and a custom build. my impression is that "small" models could be a bit faster on a custom build but any "larger" model will quickly fall behind because a 10k GPU based build just won't be able to hold it proerly. educate me.
If you're looking at 10K you're close to affording a RTX Pro 6000, which will demolish any Mac by about 10x for any model that fits into 96GB VRAM
But if you overflow that 96GB it can fall down as far as 1/4 as fast, limited by the PCIe bandwidth
If you're into gaming the pro 6000 is also the fastest gaming gpu on earth, so there's that
thanks for the input - ok, so why should it be "10x" faster for smaller models? i'm thinking RTX Pro 6000 1.6TiB/sec mem bandwidth vs 0.8 TiB/sec on a Mac Studio Ultra should be about 2x. what am i missing?
Mac Studio does not support data types smaller than 16 bit. By running 8 bit quants you can double your effective bandwidth and memory capacity on nvidia cards, and if you're OK with losing some output quality a 4 bit quant increases it another 2x
It depends on what you consider to be a larger model.
Because yes, 9.5k Mac Ultra M3 has 512GB shared memory and nothing comes close to it at this price point. It's arguably the cheapest way to actually load stuff like Qwen3 480B, Deepseek and the likes.
But the problem is that the larger the model and the more context you put in the slower it goes. M3 Ultra has 800GB/s bandwidth which is decent but you are also loading a giant model. So, for instance, I probably wouldn't use it for live coding assistance.
On the other hand at 10k budget there's 72GB RTX 5500 or you are around a 1000 off from a complete PC with 96GB RTX Pro 6000. The latter is 1.8TB/s and also processes tokens much faster. It won't fit largest models but it will let you use 80-120B models with a large context window at a very good speed.
So it depends on your use case. If it's more of a "make a question and wait for the response" then Mac Studio makes a lot of sense as it does let you load the best model. But if you want live interactions (eg. code assistance, autocomplete etc) then I would prefer to go for a GeForce and a smaller model but at higher speed.
Imho if you really want a Mac Studio with this kind of hardware I would wait until M5 Ultra is out too. Because it should have like 1.2-1.3TB/s memory bandwidth (based by the fact base M5 beats base M4 by like 30% and Max/Ultra is just a scaled up version) and at that point you just might have both capacity and speeds to take advantage of it.
>It's arguably the cheapest way to actually load stuff like Qwen3 480B, Deepseek and the likes.
It's the cheapest *reasonable* way to do it.
The actual cheapest way to do it is to pick up a used Xeon Scalable server (eg Dell R740) and stick 768GB of DDR4 in it. You get 6 memory channels for ~130GB/s bandwidth per cpu, and up to 4 CPUs per node, for an all out cost of barely $2000 (most of that being for the RAM, the cpus are less than $50). You can even put GPUs in them to run small high speed subagent models in parallel, or upgrade to as much as 6TB of RAM.
The lrimary downside is it will sound like 10 vacuum cleaners having an argument with 6 hairdryers.
They are super cheap right now because they are right around the age where the hyperscalers liquidate them to upgrade. Pretty soon they will probably start rising again if the AI frenzy keeps going
> any "larger" model will quickly fall behind because a 10k GPU based build just won't be able to hold it proerly [sic]
Based on this sentence alone I recommend *not* trying to understand screwdrivers and instead just buy the nice shiny Apple box. Plug in. Go brrr.
One day OpenAI will do a public tour of their datacenter and we'll realize it's been super-intelligent monkeys doing math problems on iPhones all along
Only thing I learned from this thread is that nobody knows what they're talking about according to somebody else, and that the old Mac vs. PC (or in this case, GPU) wars are still very much alive and kicking. lol
It's similar to doing a month of research to find the best android camera only for people around you to prefer their iphones for photos because they're more Instagram friendly.
Honestly, I don't see the issue with running local on Mac at all. The machines happen to almost purpose built to run inference.
Everyone started at zero, two years ago with this stuff and really, AInis the only true expert at AI.
Have the biggest rig on the block, or a Camry running locally on a Mini, the end result is local first, local only.
Privacy, sovereignty, some form of digital dignity, and some semblance of control in an disturbingly surveiled world.
Five years from now, they will just sell boxes to deal with it all on our behalf.
But however you slice it, hosting your own isn't easy and isn't cheap. So if anyone can make it work, more power to them.
To quote the immortal words of, well, both east and west coast rappers, "we're all in the same gang".
Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW)
You've also been given a special flair for your contribution. We appreciate your post!
*I am a bot and this action was performed automatically.*
Really depends on your use case. Macs still cannot do PyTotch development or ComfyUI well enough. And if you wanna do some gaming on the side, it is the golden age for dual GPU builds right now.
https://preview.redd.it/mpgrwbod2f7g1.png?width=1159&format=png&auto=webp&s=3bdeca312d3e4126d2628fc2d3894d7a862925b5
This is the normie one... can't get better than this... only the MX Ultra and Max has more bandwidth, and dont have as near as much TOPs in the NPU.
And? Would you train a model on that thing? I guess not.
Most of people here are using local LLMs for inference. Not training.
And most people making actual model training won't use a house workstation anyway...
Yeah, I just sold my 2nd RTX A6000 from my Threadripper LLM Server. My stupid $2k refurbished MacBook Pro M2 Max with 96Gb RAM was fast enough.
While 100+ T/s was cool - 30-40 T/s is still plenty fast enough and a LOT cheaper.
I'm sure it's more complicated than that, but my feeble consumer understanding is that Windows-on-ARM is souring the experience and any mass-appeal that Qualcomm PC's could have - and so we keep getting these ludicrously expensive low-volume laptops that make no sense and a half-assed effort from everyone involved.
I'm far from a "normie" and never once before had bought a single Apple product.
But it is a fact that Apple Silicon simply the most cost effective way to run LLMs at home, so last year I bit the bullet and got a used Mac Studio M1 Ultra with 128GB on eBay for $2500. One of the best purchases I have ever made: This thing uses less than 100w and runs 123B dense 6-bit LLM at 5 toks/second (measured 80w peak with asitop).
Just to have an idea of how far Apple is ahead of the competition: M1 Ultra was released on March 2022 and is still provides superior LLM inference speed than Ryzen AI MAX 395+ which was released in 2025. And Ryzen is the only real competition for the "LLM in a small box" hardware, I don't consider these monster machines with 4 RTX 3090 to be competing as it uses many times the amount of power.
I truly hope AMD or Intel can catch up so I can use Linux as my main LLM machine. But it is not looking like it will happen anytime soon, so I will just keep my M1 ultra for the foreseeable future.
Is it? Maybe I'm wrong (please tell me if I'm, so I can go and buy mac), but everywhere I look people say macbook isn't that fast for interference for 30b+ models and you better use two or more 3090.
And it's not going to work for tuning at all.
And you can't even connect GPU via thunderbolt, it only works on Intel and AMD.
That's probably because the vram gets overflown and the CPU starts doing the work? In that case mac would really give a better speed just because for the price you can't get as much vram. Otherwise idk, the dedicated gpus are faster
420 Comments
SadEntertainer9808@reddit
ForsookComparison@reddit (OP)
Cergorach@reddit
o5mfiHTNsH748KVq@reddit
bigh-aus@reddit
Suitable-Program-181@reddit
Cergorach@reddit
o5mfiHTNsH748KVq@reddit
ArtfulGenie69@reddit
Gallagger@reddit
ArtfulGenie69@reddit
blazze@reddit
CryptoCryst828282@reddit
blazze@reddit
CryptoCryst828282@reddit
Cergorach@reddit
OrneryMinimum8801@reddit
blazze@reddit
ErisLethe@reddit
Nepherpitu@reddit
polikles@reddit
ArtfulGenie69@reddit
polikles@reddit
ArtfulGenie69@reddit
Cergorach@reddit
Individual-Source618@reddit
Cergorach@reddit
The_Hardcard@reddit
BumbleSlob@reddit
cyberdork@reddit
CryptoCryst828282@reddit
TokenRingAI@reddit
Ill_Barber8709@reddit
CheatCodesOfLife@reddit
MitsotakiShogun@reddit
Sufficient-Past-9722@reddit
MitsotakiShogun@reddit
Sufficient-Past-9722@reddit
polikles@reddit
Sufficient-Past-9722@reddit
Medium_Ordinary_2727@reddit
Soggy_Audience_6706@reddit
Novel-Mechanic3448@reddit
apVoyocpt@reddit
Cergorach@reddit
apVoyocpt@reddit
Cergorach@reddit
apVoyocpt@reddit
mxforest@reddit
Rabo_McDongleberry@reddit
holchansg@reddit
pastelfemby@reddit
CheatCodesOfLife@reddit
zipzag@reddit
calcium@reddit
recoverygarde@reddit
holchansg@reddit
recoverygarde@reddit
holchansg@reddit
Ill_Barber8709@reddit
holchansg@reddit
Ill_Barber8709@reddit
holchansg@reddit
Ill_Barber8709@reddit
holchansg@reddit
RoomyRoots@reddit
DerFreudster@reddit
Rabo_McDongleberry@reddit
SpicyWangz@reddit
Rabo_McDongleberry@reddit
fredandlunchbox@reddit
Cergorach@reddit
Ill_Barber8709@reddit
Automatic-Arm8153@reddit
j_osb@reddit
fredandlunchbox@reddit
Cergorach@reddit
wdsoul96@reddit
Academic-Lead-5771@reddit
african-stud@reddit
ForsookComparison@reddit (OP)
__JockY__@reddit
bigh-aus@reddit
iMrParker@reddit
__JockY__@reddit
Beneficial-Shame-483@reddit
abnormal_human@reddit
__JockY__@reddit
comfyui_user_999@reddit
10minOfNamingMyAcc@reddit
__JockY__@reddit
vertical_computer@reddit
Its_Powerful_Bonus@reddit
Bozhark@reddit
__JockY__@reddit
Bozhark@reddit
Sufficient_Prune3897@reddit
waescher@reddit
koffieschotel@reddit
ForsookComparison@reddit (OP)
koffieschotel@reddit
SpicyWangz@reddit
ForsookComparison@reddit (OP)
SpicyWangz@reddit
Ill_Barber8709@reddit
Aggressive_Dream_294@reddit
twisted_nematic57@reddit
LocoMod@reddit
apifree@reddit
popsumbong@reddit
TrumanCompote@reddit
JoanofArc0531@reddit
ForsookComparison@reddit (OP)
oodelay@reddit
Soggy_Audience_6706@reddit
ForsookComparison@reddit (OP)
Noiselexer@reddit
Soggy_Audience_6706@reddit
ForsookComparison@reddit (OP)
No-Refrigerator-1672@reddit
egomarker@reddit
Super_Sierra@reddit
Lissanro@reddit
egomarker@reddit
YouKilledApollo@reddit
egomarker@reddit
YouKilledApollo@reddit
egomarker@reddit
YouKilledApollo@reddit
egomarker@reddit
YouKilledApollo@reddit
egomarker@reddit
YouKilledApollo@reddit
egomarker@reddit
YouKilledApollo@reddit
egomarker@reddit
YouKilledApollo@reddit
egomarker@reddit
YouKilledApollo@reddit
egomarker@reddit
cyberdork@reddit
Ragerist@reddit
Super_Sierra@reddit
CryptoCryst828282@reddit
Ragerist@reddit
Gudeldar@reddit
mi_throwaway3@reddit
YouKilledApollo@reddit
egomarker@reddit
zipzag@reddit
-dysangel-@reddit
Ill_Barber8709@reddit
No-Refrigerator-1672@reddit
Ill_Barber8709@reddit
CheatCodesOfLife@reddit
Ill_Barber8709@reddit
CheatCodesOfLife@reddit
No-Refrigerator-1672@reddit
Ill_Barber8709@reddit
No-Refrigerator-1672@reddit
Ill_Barber8709@reddit
No-Refrigerator-1672@reddit
__JockY__@reddit
No-Refrigerator-1672@reddit
egomarker@reddit
CheatCodesOfLife@reddit
MitsotakiShogun@reddit
egomarker@reddit
MitsotakiShogun@reddit
No-Refrigerator-1672@reddit
Ill_Barber8709@reddit
No-Refrigerator-1672@reddit
Ill_Barber8709@reddit
No-Refrigerator-1672@reddit
egomarker@reddit
Ill_Barber8709@reddit
egomarker@reddit
WitAndWonder@reddit
No-Refrigerator-1672@reddit
WitAndWonder@reddit
No-Refrigerator-1672@reddit
LocoMod@reddit
PraxisOG@reddit
Ill_Barber8709@reddit
PraxisOG@reddit
Ill_Barber8709@reddit
No-Refrigerator-1672@reddit
CheatCodesOfLife@reddit
Ill_Barber8709@reddit
CheatCodesOfLife@reddit
mi_throwaway3@reddit
Successful_Tap_3655@reddit
Ill_Barber8709@reddit
PraxisOG@reddit
LocoMod@reddit
No-Refrigerator-1672@reddit
egomarker@reddit
No-Refrigerator-1672@reddit
egomarker@reddit
LocoMod@reddit
No-Refrigerator-1672@reddit
LocoMod@reddit
No-Refrigerator-1672@reddit
LocoMod@reddit
No-Refrigerator-1672@reddit
LocoMod@reddit
Turbulent_Pin7635@reddit
YouKilledApollo@reddit
Turbulent_Pin7635@reddit
YouKilledApollo@reddit
Turbulent_Pin7635@reddit
ayu-ya@reddit
CryptoCryst828282@reddit
Turbulent_Pin7635@reddit
ayu-ya@reddit
Turbulent_Pin7635@reddit
wittlewayne@reddit
Whispering-Depths@reddit
waescher@reddit
Whispering-Depths@reddit
waescher@reddit
Whispering-Depths@reddit
waescher@reddit
Whispering-Depths@reddit
waescher@reddit
waescher@reddit
ForsookComparison@reddit (OP)
FullOf_Bad_Ideas@reddit
-dysangel-@reddit
FullOf_Bad_Ideas@reddit
-dysangel-@reddit
Gringe8@reddit
-dysangel-@reddit
Gringe8@reddit
Successful_Tap_3655@reddit
Longjumping-Boot1886@reddit
getmevodka@reddit
Artistic_Unit_5570@reddit
getmevodka@reddit
Pale_Reputation_511@reddit
getmevodka@reddit
-dysangel-@reddit
__no_author__@reddit
shokuninstudio@reddit
aaronsb@reddit
TokenRingAI@reddit
Real-Technician831@reddit
Korenchkin12@reddit
Pishnagambo@reddit
fuzzy-thoughts345@reddit
_bones__@reddit
shokuninstudio@reddit
ThomasterXXL@reddit
mikael110@reddit
TokenRingAI@reddit
joelasmussen@reddit
jaysedai@reddit
TokenRingAI@reddit
tehfrod@reddit
shokuninstudio@reddit
mehum@reddit
shokuninstudio@reddit
Alternative-Sea-1095@reddit
mikael110@reddit
pixel_of_moral_decay@reddit
Coldaine@reddit
Alternative-Sea-1095@reddit
SilentLennie@reddit
FurrySkeleton@reddit
audioen@reddit
pscoutou@reddit
YoloSwag4Jesus420fgt@reddit
shokuninstudio@reddit
astrange@reddit
The_frozen_one@reddit
grimjim@reddit
AuspiciousApple@reddit
shokuninstudio@reddit
Trick-Force11@reddit
shokuninstudio@reddit
Trick-Force11@reddit
Background_Essay6429@reddit
Little-Put6364@reddit
Southern_Sun_2106@reddit
CryptoCryst828282@reddit
Southern_Sun_2106@reddit
CryptoCryst828282@reddit
Southern_Sun_2106@reddit
Ytijhdoz54@reddit
iMrParker@reddit
egomarker@reddit
RedParaglider@reddit
Fi3nd7@reddit
RedParaglider@reddit
SamWest98@reddit
FullOf_Bad_Ideas@reddit
iMrParker@reddit
LocoMod@reddit
iMrParker@reddit
BumbleSlob@reddit
iMrParker@reddit
Monkey_1505@reddit
LocoMod@reddit
Monkey_1505@reddit
LocoMod@reddit
Monkey_1505@reddit
_VirtualCosmos_@reddit
ai-christianson@reddit
Bogaigh@reddit
ForsookComparison@reddit (OP)
Specific-Goose4285@reddit
Novel-Mechanic3448@reddit
vdiallonort@reddit
egomarker@reddit
msc1@reddit
FaceDeer@reddit
Latter_Virus7510@reddit
ashirviskas@reddit
CasualtyOfCausality@reddit
Important-Novel1546@reddit
Chrono978@reddit
ThisWillPass@reddit
Kirigaya_Mitsuru@reddit
nachoaverageplayer@reddit
clduab11@reddit
juggarjew@reddit
qwen_next_gguf_when@reddit
10minOfNamingMyAcc@reddit
Lissanro@reddit
Lissanro@reddit
Wrong-Historian@reddit
_VirtualCosmos_@reddit
ShameDecent@reddit
RedParaglider@reddit
LocoMod@reddit
jwr@reddit
Tenkinn@reddit
PMvE_NL@reddit
One_of_Won@reddit
1Soundwave3@reddit
BusRevolutionary9893@reddit
20ol@reddit
PerfectReflection155@reddit
txdv@reddit
RabbitEater2@reddit
ai-christianson@reddit
TechnoByte_@reddit
noiserr@reddit
AmpEater@reddit
riceinmybelly@reddit
ElephantWithBlueEyes@reddit
Ok-Future4532@reddit
VegetableSense@reddit
apetersson@reddit
StaysAwakeAllWeek@reddit
apetersson@reddit
mi_throwaway3@reddit
StaysAwakeAllWeek@reddit
RandomCSThrowaway01@reddit
StaysAwakeAllWeek@reddit
__JockY__@reddit
holchansg@reddit
_hypochonder_@reddit
mi_throwaway3@reddit
TokenRingAI@reddit
mi_throwaway3@reddit
Calamero@reddit
ForsookComparison@reddit (OP)
TokenRingAI@reddit
ImJacksLackOfBeetus@reddit
ForsookComparison@reddit (OP)
ImJacksLackOfBeetus@reddit
crazymonezyy@reddit
wh33t@reddit
a_beautiful_rhind@reddit
El_Danger_Badger@reddit
CheatCodesOfLife@reddit
WithoutReason1729@reddit
aeroumbria@reddit
jeffwadsworth@reddit
H0vis@reddit
Rockclimber88@reddit
the-mehsigher@reddit
esamueb32@reddit
holchansg@reddit
Ill_Barber8709@reddit
getmevodka@reddit
holchansg@reddit
getmevodka@reddit
holchansg@reddit
getmevodka@reddit
Ill_Barber8709@reddit
holchansg@reddit
holchansg@reddit
Ill_Barber8709@reddit
holchansg@reddit
Ill_Barber8709@reddit
holchansg@reddit
onetimeiateaburrito@reddit
Bozhark@reddit
CMDR-Bugsbunny@reddit
PotentialFunny7143@reddit
Expensive-Paint-9490@reddit
ForsookComparison@reddit (OP)
SamuelL421@reddit
ForsookComparison@reddit (OP)
ThatCrankyGuy@reddit
ForsookComparison@reddit (OP)
CSharpSauce@reddit
tarruda@reddit
InspirationSrc@reddit
DataGOGO@reddit
Apprehensive-End7926@reddit
ForsookComparison@reddit (OP)
tgwombat@reddit
Denny_Pilot@reddit
JLeonsarmiento@reddit
Not_your_guy_buddy42@reddit
ForsookComparison@reddit (OP)
Wrong-Historian@reddit