Which model for 32GB M2 Max?

Posted by segdy@reddit | LocalLLaMA | View on Reddit | 16 comments

I would like to experiment but before investing loads of money, I do have a MacBook Pro with 32GB RAM, M2 Pro.

Which model would maximize versatility given this hardware?

DeepSeek, Gemma, Qwen? Which model size and quantization?

Focus os mostly on a personal agent (OpenClaw, ZeroClaw etc), followed by a lightweight Claude/ChatGPT replacement. Software development not too important (I may just ask for help writing simple scripts here and there etc)

[-]

BC_MARO@reddit

Start with Qwen2.5 14B or Gemma 2 9B in Q4_K_M; they fit comfy on 32GB and are solid for tool/agent stuff. For speed, keep context smaller and run via llama.cpp + Metal.

[-]

thatonereddditor@reddit

Uhhh...who's gonna tell him?

[-]

BC_MARO@reddit

32GB on an M2 Max is unified memory, so a 14B in Q4 is totally fine. If you meant 32GB of discrete VRAM, yeah, different story.

[-]

thatonereddditor@reddit

Buddy. Qwen 3.6 and Gemma 4 are out.

[-]

mycallousedcock@reddit

Run this

https://github.com/AlexsJones/llmfit

[-]

michaelkeithduncan@reddit

I really want an Alex Jones llm now

[-]

segdy@reddit (OP)

wow that looks amazing!

[-]

tmvr@reddit

The two options are Wen3.6 35B A3B and the dense 3.6 27B. Both the largest quant you can fit together with your context length required. You will need to try the 27B and see if you are OK with the decode/tg speed though.

[-]

MichaelDaza@reddit

I would run qwen 3.5 9b, disable thinking and max out the parameters. Adding a knowledge base that aligns with the type of topics you want to cover, does a better job than relying in a larger parcel count. I like to disable thinking because this specific model does use up alot of resources on it.

[-]

chicky-poo-pee-paw@reddit

Gemma 24B MoE, best quant under 24GB.

[-]

nrauhauser@reddit

I have a 16GB M1 Pro and I've been using Qwen3.5 for some experiments. It's just not enough machine to really do anything.

There is a 19GB on disk version of GLM4.7 that we've been using with a 24GB RTX 4090. Having 5GB of KV space is tight but doable. Your Mac is going to have similar resources when running.

I think this is all about to change drastically thanks to DeepSeek4 and TurboQuant. There's a pretty solid 4x reduction in KV ram with TurboQuant and it compliments the amazing changes in the latest DeepSeek.

So ... look for a DeepSeek that fits and be aware that the right tooling for running it is going to make a big difference - the model has internal gains, but TurboQuant is built into the harness. It gets to llama.cpp first, but you want something smooth ... maybe the HuggingFace framework, since you're experimenting?

[-]

former_farmer@reddit

Qwen 3.6 35b-3b or 27b

[-]

Fit_Wheel5471@reddit

Gemma 4

[-]

getstackfax@reddit

With 32GB on an M2 Max, I’d treat it as a very solid experiment/local-ready machine, not a “run every huge model” box.

For your use case — personal agent, OpenClaw/ZeroClaw-style workflows, lightweight ChatGPT replacement, simple scripting — I’d start smaller and optimize for responsiveness.

I’d probably test:

- 7B–9B models for fast daily chat/tool use
- 14B-ish models if you want a stronger general assistant
- 20B–30B only if you’re okay with slower responses and tighter memory limits
- quantized models first, not full precision

The important thing is matching the model to the job:

- fast local assistant: smaller model
- simple scripts: small/medium coder model
- bigger reasoning/planning: use cloud model when needed
- agent workflow testing: prioritize speed/reliability over max model size

I wouldn’t buy more hardware yet. Use the M2 Max to learn what you actually do locally, where it feels slow, and which tasks still need cloud escalation. Then let that workload decide the upgrade.

[-]

hotsnot101@reddit

try looking at llamaperf.com for crowd-sourced benchmarks

[-]

flockonus@reddit

Echoing the sentiment of every other post here.. Qwen3.6 27B - get the highest quant you can fit, which is likely 4b / 5b in your case.