Request: Someone with an M4 Macbook Pro Max 64GB

Posted by NEEDMOREVRAM@reddit | LocalLLaMA | View on Reddit | 20 comments

I know this thread is going to get downvoted to hell and back...

I'm trying to decide between the Macbook Pro 48GB and 64GB model.

If you have an M4 Macbook Pro Max with 64GB, can you download the 50GB Q5_K_M model: https://huggingface.co/mradermacher/Llama-3.1-Nemotron-70B-Instruct-HF-i1-GGUF

And let me know what your token and time-to-first token speed are? And can you have a ~8000 token conversation with it to see just how quickly it slows down?

If I could run the Nemotron Q5_K_M quant on a Macbook Pro at even ~4 tokens per second---there would be no reason to spin up the noisy and electricity guzzling AI server in the home office.

Thanks and I give you good karma thoughts for taking the time from your busy day to help out.

[-]

ChimataNoKami@reddit

I don’t want to download nemotron but I get 8tps on an M2 Max running lama 3.1 70b (40GB)

[-]

Guilty-Support-584@reddit

Hey, how many GB RAM does your MacBook have?

[-]

Guilty-Support-584@reddit

What context window size do you use for it to get 8 tokens per second?

[-]

Retnik@reddit

Backend: Koboldcpp the latest Frontend: Silly Tavern

At 8k context it took up 61gb of memory. KV cache actually didn't reduce it by much, down to 60gb. This might not fit on a 64gb. The system is using just under 11gb without anything loaded. I am pretty new to Macs, so maybe someone who knows might be able to squeeze it in a 64gb?

On an existing chat with 8k context filled: First generation with full context took 158 seconds (1.87 t/s). With context shift, the next generation took 38 seconds (6.22 t/s).

So context shift helps so much to make this a pleasant experience.

On a new chat: It took 88 seconds to generate (5.51 t/s).

Looks like I get mid 5-6 t/s generation speeds outside of prompt processing. Hope this was what you wanted to know!

[-]

NEEDMOREVRAM@reddit (OP)

Thanks. I don't know what to do. Save money and get the 48GB model and then hope the M5 Macbook Pro is better. Or just get the 64GB now and accept the speeds as is.

[-]

chibop1@reddit

Running the model you mentioned in q5_K_M on my m3Max 64GB, I get 58.13 tokens per second for prompt processing and 4.91 tokens per second for generation.

Honestly 5tk/s is not too bad.

Also I can load maximum 26k context length with flash attention. You can stretch to 27k if you don't mind computer freezing during inference. lol

% ./llama-cli -m ../models/Llama-3.1-Nemotron-70B-Instruct-HF.i1-Q5_K_M.gguf -c 25000 -n 1000 --temp 0.1 --top_p 0.9 --seed 1234 -fa -f ../text/llama-prompt.txt
llm_load_tensors: offloaded 81/81 layers to GPU
generate: n_ctx = 25088, n_batch = 2048, n_predict = 1000, n_keep = 1
......
llama_perf_sampler_print:    sampling time =      21.08 ms /  7978 runs   (    0.00 ms per token, 378498.91 tokens per second)
llama_perf_context_print:        load time =    1995.41 ms
llama_perf_context_print: prompt eval time =  129188.83 ms /  7510 tokens (   17.20 ms per token,    58.13 tokens per second)
llama_perf_context_print:        eval time =   95149.24 ms /   467 runs   (  203.75 ms per token,     4.91 tokens per second)
llama_perf_context_print:       total time =  224391.01 ms /  7977 tokens

[-]

NEEDMOREVRAM@reddit (OP)

lol thank you. Ok, yeah 5tk/s is halfway decent for my needs.

[-]

Retnik@reddit

Honestly, once it loads the context, even 3-4 t/s is pretty usable. I've been running Mistral Large Q5 at about 3.8 t/s and it's been a blast. But not everyone has the same patience as I do. It really shines when using MOE models, I'm just not sure if that's going to be a thing in the future over dense models.

Basically, it's really slow loading the prompt the first time. Context shifting really makes that irrelevant, unless you have to change the chat you are running.

Oh, and the thing is super quiet and cool. Hammering the little thing with 118gb used, it was only just warm to the touch. That's really where it shines.

And on a completely different side note for those reading this in the future, trying to decide if the next Mac is worth it, it runs games like Total War Warhammer 3 at an easy 90+ fps with settings maxed out on 1080p resolution.

[-]

chibop1@reddit

What was 1.87 t/s for? Was it typo? Do you have Flash Attention enabled?

I have m3 (not m4) 65GB, and I get 58.13tk/s for prompt processing and 4.91 for generation. Using 70b-nemotron-q5_K_M model, I can also pack up to 25k context length in 64GB with flash attention.

I wonder what speed bump you would get with flash attention if it wasn't enabled when you tested. Here's my log:

% ./llama-cli -m ../models/Llama-3.1-Nemotron-70B-Instruct-HF.i1-Q5_K_M.gguf -c 25000 -n 1000 --temp 0.1 --top_p 0.9 --seed 1234 -fa -f ../text/llama-prompt.txt
llm_load_tensors: offloaded 81/81 layers to GPU
generate: n_ctx = 25088, n_batch = 2048, n_predict = 1000, n_keep = 1
......
llama_perf_sampler_print:    sampling time =      21.08 ms /  7978 runs   (    0.00 ms per token, 378498.91 tokens per second)
llama_perf_context_print:        load time =    1995.41 ms
llama_perf_context_print: prompt eval time =  129188.83 ms /  7510 tokens (   17.20 ms per token,    58.13 tokens per second)
llama_perf_context_print:        eval time =   95149.24 ms /   467 runs   (  203.75 ms per token,     4.91 tokens per second)
llama_perf_context_print:       total time =  224391.01 ms /  7977 tokens

[-]

epigen01@reddit

Did it run cool or were the fans roaring?

[-]

Retnik@reddit

Fans were going pretty good, but it's much quieter than even a single 4090. When I lifted the mac from the desk, the desk underneath it was barely warm. I used to own a gaming laptop, that thing felt like the surface of the sun after I was done with it.

[-]

NEEDMOREVRAM@reddit (OP)

Ok, thanks. Was wondering if 50GB was too big for the 64GB model and if it will slow it down.

[-]

ChimataNoKami@reddit

You can run 50gb models but you need to bump the macOS user space vram limit and tune down the context size

[-]

NEEDMOREVRAM@reddit (OP)

and tune down the context size

How low? Like 4096?

[-]

ChimataNoKami@reddit

I don’t remember. You can also use KV quantization

[-]

bobby-chan@reddit

Performance of llama.cpp on Apple Silicon M-series #4167