Another how much ram should i get on an M4 Max MBP thread

Posted by flying_unicorn@reddit | LocalLLaMA | View on Reddit | 44 comments

I'm gonna get an M4 Max MBP, i'm just debating how much ram. I'm debating between 64gb and 128gb and leaning toward 64gb. My thought process is any models i could run on 128gb will just be too slow, and thus pointless, so 64gb is probably the sweet spot. Would you all agree?

As far as my use case: I play around with Local LLM's a bit as a fun hobby, and i use them a bit to help me with my small business on my rtx3080ti in my gaming rig. Cloud LLM's are an issue to do the sensitive nature of my work, but it also doesn't make sense for me to spend 10k on a dedicated llm rig.. yet.

I'm not interested in waiting for the m4 ultra mac studio.

[-]

davewolfs@reddit

Save your money seriously. Just get 48gb and skip the max if you want it purely for LLM.

[-]

MeanGift9492@reddit

We can't even run 30b models with 48gb. On what basis do you suggest this? u/davewolfs

[-]

davewolfs@reddit

Well it’s going to cost you $2000 to go from 48->64 unless you stick with a Mini and there is no model greater than 70b which will run half decently on Apple silicon. It’s all kind of a waste TBH as the only models that perform decently are 32B.

[-]

NEEDMOREVRAM@reddit

[-]

koalfied-coder@reddit

48gb to me is the best play. I purchased the 128 but I'ma return it. Either I'm getting a 48gb model or an air.

[-]

NEEDMOREVRAM@reddit

Why not the 64GB? Can you run Qwen 2.5 32b coder?

[-]

koalfied-coder@reddit

It can have enough headroom. For me 48 is sweet so I can code in jetbrains and run a smallish LLM. By the time I upgrade to 64 it's just not worth it to me. I would sooner get a Macbook air and a 3090 machine to offload inference and maybe training. Currently on my 128gb I only use about 24-36 GB on the regular for work. So for me it's not worth it. If I was like on an island with only one machine and no cloud options etc. Then I would pay extra for ram and all that. Btw the MacBook air is pretty great for most everything. However if going M4 the 14" pro is more compelling to me.

[-]

NEEDMOREVRAM@reddit

Well, if I can afford it I will probably get the 64GB M4. But will most likely get the 48GB.

I see that you're a coder...I'm considering starting to learn Python. Do you think it's too late now that AI can code?

[-]

MaxDPS@reddit

It’s not too late. At the very least, knowing how to code will help you write better prompts.

[-]

NEEDMOREVRAM@reddit

Why not the 64GB? Isn't the Pro not as powerful to run LLMs?

[-]

Vegetable_Sun_9225@reddit

128 no question. My work MB is 64 my personal MB is 128 and I get annoyed by what I can do with my work MB. Remember long context takes a lot of ram. So even if you're using smaller models you can eat up the memory real fast with long context.

[-]

VibrantOcean@reddit

Would you say 128gb is worth 50% extra cost vs 96gb? (Mac Studio)

[-]

Vegetable_Sun_9225@reddit

Can you provide links to the two things you're comparing. Are you comparing the Max with the Ultra? Cause the M2 Ultra is crazy fast. The Ultra's memory bandwidth is like double.

[-]

VibrantOcean@reddit

It was Studio Max 64gb vs Studio Ultra 128gb on the Apple Configurator

[-]

Vegetable_Sun_9225@reddit

Yeah you're getting a lot more than extra memory with the ultra. The max has 400 GB/s the Ultra is 800 GB/s. LLMs are memory bandwidth bound so your inference speed will be double. Definitely work it to pay the extra for the ultra.

[-]

VibrantOcean@reddit

Didn’t think about that. Good point. And I assume at these speeds every token per second will be a big QOL improvement.

[-]

koalfied-coder@reddit

Not for LLM a GPU cluster is a much better value. You can get a Macbook air for the rest and be straight.

[-]

Strange-History7511@reddit

Yes

[-]

zra184@reddit

I got the 128GB, no regrets. However I forgot to increase the SSD size, wish I would’ve gone up to 2TB so I could fit more than a few 70b+ models.

[-]

flying_unicorn@reddit (OP)

yeah, i'm thinking 2tb is the sweetspot for me. Anything else i will use cloud drive, or an external ssd over TB.

[-]

rythmyouth@reddit

At 2TB wishing I got 4

[-]

koalfied-coder@reddit

Umm for your use case a Mac isn't great coming from a M4 max owner. 2-4 3090s/ a5000s is about 5-10x faster when accounting for processing. While my Max is "usable" I wouldn't say pleasant. I would sooner get a GPU machine and MacBook air than a max for LLM.

[-]

flying_unicorn@reddit (OP)

Due to apple limitations and the way i want to use a mbp I'm stuck getting a max anyways. I figure i'll use it for LLM until i get to a point where building a dedicated LLM rig makes sense. I have a dual 4k display which is basically treated as 2x4k displays. And i have 4 external displays at my desk off my docking station.

[-]

koalfied-coder@reddit

However if buying Mac 64gb seems like the smart play if not 48gb. This would allow llama 3.1 7b to run quite nice.

[-]

CompetitiveYak5863@reddit

I purchased the 128GB model. The main advantage I've found is it can run Wizard 8x22B from VRAM with good performance. MoE models are uniquely suited to the mix of high VRAM vs more limited compute you get.

[-]

koalfied-coder@reddit

Please, what are your tokens a second with a large context window and prompt processing?

[-]

CompetitiveYak5863@reddit

I asked it to summarize the Declaration of Independence:

Processing Prompt [BLAS] (1906 / 1906 tokens)

Generating (281 / 512 tokens)

CtxLimit:2187/4096, Amt:281/512, Init:0.01s, Process:20.61s (10.8ms/T = 92.47T/s), Generate:21.38s (76.1ms/T = 13.15T/s), Total:41.99s (6.69T/s)

[-]

koalfied-coder@reddit

I mean 6.6 tokens a second isn't that bad. Assuming your machine is multi use it's a good value proposition.

[-]

GimmePanties@reddit

Declaration of Independence is like 1776 tokens so hardly a stretch

[-]

koalfied-coder@reddit

Exactly that's terrible now that you mention the context size

[-]

GimmePanties@reddit

yeah, I have the 64Gb, and I don't find myself using the larger models because life is too short for 6 t/s.

[-]

Life_Tea_511@reddit

how much did you pay for that?

[-]

TooCasToo@reddit

Same here.. m4max(40) 128G 4TB....

Kinda wish it was 256G....

More memory is Gooder and bestest. :)

[-]

fallingdowndizzyvr@reddit

Neither. Spend a little more and get a 192GB M2 Ultra.

[-]

koalfied-coder@reddit

https://github.com/ggerganov/llama.cpp/discussions/4167 This is a speed comparison.

[-]

koalfied-coder@reddit

Tldr Mac is slow AF. I'm selling 128gb Mac to get a 48gb for general development work and smaller models.

[-]

SniperDuty@reddit

128GB - you wont regret it.

I'm running Llama-3.1-Nemotron-70B-Instruct-HF-abliterated-i1-GGUF at 6 tokens per second at around 100GB memory usage. On paper it said it should be much less than this. Also, MPS is kicking CUDA's ass on image gen and audio gen.

32b runs at about 18-20 tokens per second on full power mode.

[-]

koalfied-coder@reddit

This is not true comparison. If the Macs has 5x the available VRAM of course it will do better. Compare it VRAM to VRAM and I would bet NVIDIA claps it. My 4 a5000 are easily 20x as fast as my M4 max with 128gb ram.

[-]

chibop1@reddit

You can increase the max GPU memory limit to 56gb and run a 72b model, but it's pretty tight. If you want to be able to keep other apps running as well as running 72B+ model with long context simultaneously, definitely get 128GB.

[-]

datbackup@reddit

Definitely 128GB. It’s context size rather than number of parameters that will increasingly become more valuable. And more context requires more RAM.

[-]

JacketHistorical2321@reddit

The most you can afford

[-]

ttkciar@reddit

If you can afford 128GB, get 128GB.

If you don't, you will eventually run up against a task which needs it, and will feel regretful and sad.

My T7910 has 256GB, and I have on occasion wished for a little more.

[-]

Dry_Parfait2606@reddit

Running an llm at 1-2t/s is still a real thing... If you get more precision out of it, I would mean that it's worth having a spare ram...

If it doesn't rip a hole into your pockets, get more performance... If you know that you are happy with 64gb models and you don't plan to test and play around with different models, then go for less ram.

It all depends on the usecase... My mindset was: give me the quality of Chat-GPT 3 and I'm happy... Im now more eager to open up for what is coming in the future...

[-]

vert1s@reddit

I have a 96GB M2 Max and I don’t find the large models too bad. You only get about 80% of that number by default so can run a 70B model at q8/6. At about 2 tokens a second. Or a 100-120B model at a lower quant and about the same speed.

64GB is going to limit you to the ~30B models, which might be fine for your needs.