Speed on m5 pro 48Gb

Posted by Overall-Somewhere760@reddit | LocalLLaMA | View on Reddit | 3 comments

Hey guys!

How would you reckon a 30-50b model would run on a 48 GBs m5 pro? Waiting for the delivery and im a bit curious on how well it ll perform.

Haven t used mac until now for inference, only linux with rtx a5000. On this setup i have really good speeds with qwen 3.5 35b, glm 4.7 air, gemma 26b. All go around 80-100 tkps with llamacpp.

It will be a new architecture for me, so not sure what to expect. I'm guessing the unified memory can't compete with the GPU card speed.

Thanks!

[-]

defective@reddit

Check out the models you want to run and how large in GB the entire model is.

For instance, the model here: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

The 8-bit quant of it is 8.54 GB/s.

Now, take the reported memory bandwidth of the RAM you will be using (Google says 307 GB/s for M5 Pro). Divide this by the size of the model to get theoretical maximum tokens/s for generation -- 307 /8.54 = 35.9 tokens/s.

If it's a Mixture of Experts like Gemma 4 or a Qwen or most of the nice newer models, find out the active parameters, and figure out the ratio of active to total and apply that ratio to the size of the whole model to get the size to divide by.

Example, say Gemma 4 26B A4B - in a Q4 we will say 17 GB in size. 4 active parameters of 26 is 4/26 or approximately .154 (15.4%). 15.4% of the 17GB is about 2.6GB, and 307 GB/s divided by 2.6GB is about 118 tokens/s.

That's much rougher and inaccurate math because A4B could be any number between 3.50B and 4.49B, and same range with the total parameters, so there's lots of wiggle room in the ratio, but you get an estimate.

Also there's prompt processing speeds to worry about, but this should get you started.

[-]

Overall-Somewhere760@reddit (OP)

Wow that was really really helpful. Thank you so much !

[-]

LeRobber@reddit

Go look at your exact chip bandwidth compared to your old card.