Optimizing M2 Max 96GB for LLMs

Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 19 comments

Hey everyone,

I'm the happy owner of a MacBook Pro M2 Max with 96GB of unified memory. I mostly use it for local LLM deployment, and it has been running pretty well so far. However, I feel like I might be missing some optimizations to get the most out of it.

My current setup:

Backend: LM Studio (I know running llama.cpp via terminal might save a bit of RAM, but I really prefer the LM Studio interface and its ease of use)

My issues:

I've noticed that Open WebUI becomes increasingly slower as the context grows. Checking the LM Studio logs, it looks like the entire chat history is being re-processed with every new prompt. Is there a way to prevent this?
Is there a way to run macOS with less RAM headroom to free up more memory for the model? I've already increased the VRAM allocation from 75 to 93 in the settings.
Is there any way to prune the KV cache? For example, if I start a new chat in OpenCode/Open WebUI, it looks like the KV cache from the new convo is just being added on top of the previous old cache. The KV cache tends to become bigger and bigger. Also, I was wondering why OpenCode is so much faster at long contexts compared to Open WebUI.
One last thing, I don't know if this is my charger's fault, but for some reason the battery seems to be draining even though I am charging the Mac with a MagSafe and a 140W (not an Apple original with magsafe 3 cable) charger. Sometimes the charger uses more than 120 watts, and I've seen it reach 140 watts. I don't know why the Mac is sometimes stuck at just 93 watts and drains the battery.

Are there any other optimizations or settings I should tweak?

[-]

xFengle@reddit

Hi, slightly off topic, but how is thermal throttling / running temperature / noise on your 96GB macbook (I assume it is 16 inch with 38 Gpu cores?) Currently I have a chance to buy a used macbook m2 max 96GB for LLM. However, I am not just chatting with one model, but planning to do agentic workflow like a factory to run sustainably for long hours upto 24x7. Also a bit of gaming when relaxing. So I am a bit worried about the thermal behaviour of this laptop. Thanks!

[-]

No_Algae1753@reddit (OP)

I was worried too. Apple for some reason doesn't really throttle the gpu and the fans barely work even at 110c°. I installed mac fan control and set it up to look after the gpu cluster and when it reaches 80c° the fans will blow at max. It gets noisy when you run an LLM but at least the gpu doesn't reach 95+. So no you won't be thermal throttled

[-]

xFengle@reddit

Thanks for your insight. WOW, I guess both way (noise vs heat) are fine as long as it doesn’t run for too long, but either is not ideal for a few hours of sustained sessions, let alone 24x7. For this particular reason, I’m looking to buy Mac Studio, however, at my country, the used ones are impossible to find while the new models are expensive and you have to wait a few months 😅

[-]

No_Algae1753@reddit (OP)

Well I mean I do sometimes run llm longer for example agentic tasks. As long as the fans are spinning it isn't really a problem when it comes to heat. Imo I would say anything under 95 is "fine".

[-]

Kuane@reddit

Try omlx

It is amazing for Apple Silicon Silicon https://github.com/jundot/omlx

[-]

No_Algae1753@reddit (OP)

What makes it so special? why use this over lm studio ?

[-]

Kuane@reddit

It has prompt caching, which is very useful for large context that has big prompt injection. And the developer actually uses Apple Silicon and the development is super active and bugs are fixed really fast.

[-]

No_Algae1753@reddit (OP)

I installed it. Do you know if omlx supports gguf format? and also what are in your opinion recommended settings to run mdoels?

[-]

Kuane@reddit

It only runs mlx. Download a mlx model.

[-]

No_Algae1753@reddit (OP)

Hey I'm currently running it. The t/s is slightly faster. However for some reason the prompt processing takes ages even with no context.

[-]

No_Algae1753@reddit (OP)

Okay thanks !

[-]

limitedink@reddit

I have the same as you. Using LM Studio & omlx. Using mlx over gguf. I'm waiting for mlx + turboquant and Qwen3.6 models. MoE seems to run better on apple silicon. Someone posted about it recently. Get an official charger bro, have no discharge issues and im perma plugged in.

[-]