Mac Studio M3 Ultra 96GB useless?

Posted by Fluxx1001@reddit | LocalLLaMA | View on Reddit | 7 comments

I am thinking of buying a used M3 Ultra 96GB from a friend for a reasonable price. However, 96GB seems like not a natural fit for current LLM models.

For models around 70b, it looks like 128GB would be the better choice.

For smaller models around 20-30b, 96GB looks like overkill.

Should I go with it or look for a M3 Ultra or M5 Max with at least 128GB?

[-]

Prudent_Sentence@reddit

The M5's AI accelerator blocks on it's gpu would allow it to run circles around an M3 ultra. I'm personally waiting for a 128gb M5 Max or Ultra Studio to appear

[-]

Safe_Sky7358@reddit

Yeah 128 is about the sweet spot, running models bigger than that is gonna be like watching a snail crawl lol

[-]

inthesearchof@reddit

What is reasonable price? The new Mac Studio with M5 ultra will be a huge jump for AI compared to the M3 Ultra

[-]

Direct_Turn_1484@reddit

There are a lot of models around 60 to 80 GB. That leaves room for some context too. Just don’t get your expect it to run inference as fast as the H100s or anything.

[-]

TokenRingAI@reddit

The best models in that size range right now are Qwen 122B and Qwen Coder Next, and you should be able to run them at 4 bit or 6 bit respectively on that hardware.

I have an RTX 6000 with 96G VRAM, and as of 6 months ago, the models that fight in 96G were barely capable of doing good work, but now I feel like 96G is quite good, and in the next 6 months it will only get better.

[-]

GroundbreakingMall54@reddit

96gb is actually a sweet spot if you're not obsessed with running the absolute biggest models. qwen3 32b at q8 fits comfortably and honestly performs better than most 70b quants that barely squeeze into 128gb. also gemma 4 27b runs great on it. the m3 ultra bandwidth is nuts for inference so you'll get really solid tok/s

[-]

Makers7886@reddit

People keep saying things like this but you just need to find what fits. Imo the best model for that vram would be this: https://huggingface.co/ubergarm/Qwen3.5-122B-A10B-GGUF in the IQ5_KS 77.341 GiB (5.441 BPW) flavor. If you need throughput/concurrency then I'd probably be testing vllm/sglang and qwen3.5 27b in fp16 with maximum unquantized context and see how that does.