Laptop inference speed on Llama 3.3 70B
Posted by siegevjorn@reddit | LocalLLaMA | View on Reddit | 75 comments
Hi I would like to start a thread for sharing laptop inference speed of running llama3.3 70B, just for fun, and for resources to lay out some baselines of 70B inferencing.
Mine has AMD 7 series CPU with 64GB DDR5 4800Mhz RAM, and RTX 4070 mobile 8GB VRAM.
Here is my stats for ollama:
NAME SIZE PROCESSOR UNTIL
llama3.3:70b 47 GB 84%/16% CPU/GPU 29 seconds from now
total duration: 8m37.784486758s load duration: 21.44819ms prompt eval count: 33 token(s) prompt eval duration: 3.57s prompt eval rate: 9.24 tokens/s eval count: 561 token(s) eval duration: 8m34.191s eval rate: 1.09 tokens/s
How does your laptop perform?
jacekpc@reddit
I ran this prompt on my CPU E5-2680 v4 with quad channel memory (512 GB of DDR4 in total).
I only have some ancient GPU so that the system posts - it was not used by olama.
total duration: 12m32.572623411s
load duration: 25.530167ms
prompt eval count: 26 token(s)
prompt eval duration: 9.631s
prompt eval rate: 2.70 tokens/s
eval count: 958 token(s)
eval duration: 12m22.915s
eval rate: 1.29 tokens/s
siegevjorn@reddit (OP)
That's not bad at all, I think.
https://www.intel.com/content/www/us/en/products/sku/91754/intel-xeon-processor-e52680-v4-35m-cache-2-40-ghz/specifications.html
Maximum memory throughput is 76.8 GB/s, which is quite decent.
You should try running deepseekv3 with 512 ram!
jacekpc@reddit
will do. In the meantime I tested my another PC (mini pc) with Ryzen 5 5600G and 64 GB (dual channel). I got the below results.
total duration: 12m19.013924279s
load duration: 17.715242ms
prompt eval count: 143 token(s)
prompt eval duration: 10.926s
prompt eval rate: 13.09 tokens/s
eval count: 747 token(s)
eval duration: 12m8.068s
eval rate: 1.03 tokens/s
They are not that far off from my workstation (e5 2680v4).
Prompt:
Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent.
Model
llama3.3 70b Q4_K_M
drsupermrcool@reddit
Do you have flash attention turned on?
siegevjorn@reddit (OP)
I didn't have it turn on. How can you do it in ollama?
Ok_Time806@reddit
Yeah, via environment variable: OLLAMA_FLASH_ATTENTION: true.
I think there was a PR to make it true by default, but haven't checked recently.
Educational_Gap5867@reddit
Damn the MacBook maybe slow compared to desktop Nvidias but it eats other cpu bound laptops for dinner. But unfortunately I can’t test I don’t have enough RAM for this. If you’re up for testing 32B I’d be down.
siegevjorn@reddit (OP)
Sure thing. Which 32B do you want to try?
Educational_Gap5867@reddit
Let’s do Qwen Coder?
MrPecunius@reddit
Macbook Pro, binned (12/16) M4 Pro, 48GB, using LM Studio
Qwen2.5-coder-14B-Instruct-MLX-4bit (\~7.75GB model size):
- .41s to first token, 722 tokens, 27.11 t/s
Qwen2.5-coder-32B-Instruct-GGUF-Q5_K_M (\~21.66GB model size):
- 1.32s to first token, 769 tokens, 6.46 t/s
Educational_Gap5867@reddit
That’s really nice. I had seen some benchmarks where the MLX improvement were marginal like 10% compared to GGUFs.
MrPecunius@reddit
There doesn't seem to be a difference with MLX on the M4 (non Pro, which I have in a Mac Mini), while it's a solid 10-15% gain on my now-traded-in M2 Macbook Air.
I haven't done any MLX/GGUF comparisons on the M4 Pro yet.
I'm quite pleased with the performance and the ability to run any reasonable model at usable speeds.
Educational_Gap5867@reddit
Oh damn you were comparing 14B to 32B my bad. I thought you got 30t/s on a 32B model lol 😂
MrPecunius@reddit
Overclocked to approximately lime green on the EM spectrum, maybe. :-D
Ruin-Capable@reddit
Fun fact, green light has approximately the same numerical value for both frequency and wavelength when frequency is measured in THz and wavelength is measured in nm.
siegevjorn@reddit (OP)
Sounds good. Here's my prompt:
Will follow up with the stats soon.
Educational_Gap5867@reddit
u/siegevjorn have you tried testing with speculative decoding? I don't know if they have speculative decoding in Ollama.
RichNugget@reddit
soon https://github.com/ollama/ollama/pull/8134/
siegevjorn@reddit (OP)
No idea either. Will look into it!
Educational_Gap5867@reddit
total duration: 1m22.526419s
load duration: 27.578958ms
prompt eval count: 45 token(s)
prompt eval duration: 4.972s
prompt eval rate: 9.05 tokens/s
eval count: 738 token(s)
eval duration: 1m17.366s
eval rate: 9.54 tokens/s This isn't bad it was like watching someone type really really fast.
siegevjorn@reddit (OP)
That looks great. Can you share the spec of your macbook?
Educational_Gap5867@reddit
M4 Pro (12 core) 48GB RAM.
siegevjorn@reddit (OP)
Thanks! It's interesting that M4 pros have similar speed to M4 max MBPs reported here.
brotie@reddit
Nah I have an m4 mac and I get 20-30t/s response rate from qwen coder 2.5
siegevjorn@reddit (OP)
Oops that's my mistake. M4 max use case was for llama3 70B. I'll delete my prev comment. Confusing.
brotie@reddit
Yeah that’s what I mean with my m4 max t/s sorry autocorrect switched it to max but the performance boost is noticeable and the gpu is legit too bad I can’t use it for MSFS so I still need a 4070ti super 😂
bornsupercharged@reddit
MacBook Pro M4 Max 128GB ram, running models off a Teamgroup MP44 NVMe in a TB4 enclosure.
Model
architecture qwen2
parameters 32.8B
context length 32768
embedding length 5120
quantization Q4_K_M
Prompt:
"Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent."
total duration: 52.365282792s
load duration: 24.173792ms
prompt eval count: 45 token(s)
prompt eval duration: 9.95s
prompt eval rate: 4.52 tokens/s
eval count: 780 token(s)
eval duration: 42.217s
eval rate: 18.48 tokens/s
NEEDMOREVRAM@reddit
I realize this may be asking a bit much of you...
But are you able to run this: https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF/tree/main/Mistral-Large-Instruct-2407-Q5_K_M
At 4-5 tokens per second? I'm debating between the 64GB model and the 128GB M4 Pro Max.
bornsupercharged@reddit
MacBook Pro M4 Max 128GB ram, running models off a Teamgroup MP44 NVMe in a TB4 enclosure.
Model
architecture llama
parameters 122.6B
context length 131072
embedding length 12288
quantization Q5_K_M
Prompt:
"Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent."
total duration: 7m19.590320167s
load duration: 13.527167ms
prompt eval count: 21 token(s)
prompt eval duration: 4.335s
prompt eval rate: 4.84 tokens/s
eval count: 1424 token(s)
eval duration: 7m15.239s
eval rate: 3.27 tokens/s
NEEDMOREVRAM@reddit
Wow...so for an 86GB model it's only 3.27 tokens per second? And thank you for taking the time to download and perform the experiment. I was planning on selling my 3090s...but now not so sure. I might just get the 64GB Pro Max and then just run what I can...which maybe might be a 50GB quant?
bornsupercharged@reddit
It might be slightly higher, if I weren't running a ton of other apps (software engineer). I'm typically using qwen2.5-coder 32B Q4, which I find to be a good trade off of speed and usability.
The nice thing about Mac is the ram is shared with the gpu, but the M4 Max (up to 546 GB/s) is not nearly as much bandwidth as a 3090 (Up to 936.2 GB/s). Essentially, you can fit larger models in the ram than the 24GB 3090, but it's slower. If time isn't a huge factor it does work heh.
NEEDMOREVRAM@reddit
Can you define:
lol
I just need it to help me do grammar stuff for blog posts and client websites. So, I don't need 10 billion tokens a second. I'd be in hog heaven with just 5-6 tokens per second.
And can I get your opinion on what I'm planning on doing?
I briefly had the M4 Macbook Pro 48GB model. I was running a small 22B model and a docker container and something else...and was hitting ~45GB of RAM usage.
So, I'm now considering the 64GB model and trying to make it work because I really don't have an extra 2k to throw at a 128GB Pro Max model. I had to sell all my Apple stuff, including my iPhone to get the $4,100 needed for the M4 Pro Max 64GB.
But with running a 50GB quant and a docker container and maybe VS Code (learning Python) and a VM with Kali (slowly dipping my toes into ethical hacking stuff to see if it's something that might be a good career choice in ~5 years from now)....do you think 64GB will be pushing it?
My only other option is to sell a few 3090s on Marketplace and then when my income is up in a few months...I would hopefully be able to buy two more at cheaper prices due to the 5090s being out.
bornsupercharged@reddit
That's a pretty big one to download, it'll take me a bit as HuggingFace downloads at roughly 40MB/s.
recitegod@reddit
Thanks for the stats, where do you get these numbers from?
siegevjorn@reddit (OP)
You can just run ollama with
And it will give the stats in the end.
Bidulol@reddit
Can someone point me to a link about how to run this test. Im having trouble running ollama, it starts a localhost website while I just want to use the weights in pytorch or similar.
siegevjorn@reddit (OP)
Did you install ollama? It should be accessible through command-line interface. You can download the weights (.gguf format) from huggingface and build your own model as well.
Bidulol@reddit
I did but it seems to start up a webserver. I'll dig further, thanks
PM_ME_YOUR_ROSY_LIPS@reddit
siegevjorn@reddit (OP)
Thanks for info. It's interesting that deepseek-coder-v2:16b-lite is much faster than Qwen coder 14b, despite the same model size. Do you happen to know the reason why?
PM_ME_YOUR_ROSY_LIPS@reddit
I think it's because of the architectural differences and the quant(though less impactful). Even though the offload to cpu/gpu is similar, the utilization is different.
Ok_Warning2146@reddit
Your machine is likely a Zen 3 Ryzen 7 6800. This laptop has dual channel DDR5-4800 RAM that translates to a RAM speed of 76.8GB/s. 3090 has 936GB/s which is 12.19x. So getting 1t/s seems normal when you combine the CPU with 4070
siegevjorn@reddit (OP)
It is a Zen3 indeed. What's the inference speed of llama 3.3 70B Q4_K_M on dual 3090 machine? I see some new laptops feature DDR5-6400 (102.4GB/s), which may be little faster but not that much.
Ok_Warning2146@reddit
This site says 16.29t/s for 3.1 70B. 3.3 70B should be similar.
The fastest laptop now should be Apple M4 Max 128GB which has 546.112GB/s.
siegevjorn@reddit (OP)
Someone posted 9 t/s inference speed for the vary laptop. 9/ 546 * 920 = 15.16 t/s, which is pretty similar to 16.29s. Considering that macs generally have lower core count, it makes sense that 3090 machine does bit better than the scaled prediction.
Durian881@reddit
My binned M3 Max (14/30) runs Qwen2.5 72B GGUF Q4_K_M generates 5.5 tokens/sec and Mistral Large Q4 at 3 tokens/sec.
croninsiglos@reddit
Your prompt is important, but I used the prompt you had listed in a comment:
total duration: 1m48.493107584s load duration: 31.374625ms prompt eval count: 26 token(s) prompt eval duration: 811ms prompt eval rate: 32.06 tokens/s eval count: 978 token(s) eval duration: 1m47.649s eval rate: 9.09 tokens/s
Typical performance I've seen ranges from 8.5 - 11 tokens per second on M4 Max
siegevjorn@reddit (OP)
That looks super. What's the spec of your M4 Max (Cpu core /GPU core counts / RAM ?)
croninsiglos@reddit
128 GB M4 Max 16 core 40 core GPU.
It's the 16 inch, in case heat dissipation factors into throttling.
NEEDMOREVRAM@reddit
Can you run 80GB GGUF quants on your 128GB M4 Max at reasonable reading speeds? 4-5 tokens per second? And if so...how much context do you have before it shits the pot, so to speak?
laerien@reddit
Yes, llama3.3:70b-instruct-q8_0 GGUF (d5b5e1b84868) for example weighs in at 74 GB and runs reasonably well with Ollama. I usually use MLX instead of GGUF. I do have my
/etc/sysctl.conf
set toiogpu.wired_limit_mb=114688
to dedicate a bit more to vram, but haven't had context issues.Same system as OP, 128 GB 16" M4 Max 16 core.
total duration: 3m11.942125625s load duration: 31.911833ms prompt eval count: 29 token(s) prompt eval duration: 1.627s prompt eval rate: 17.82 tokens/s eval count: 1115 token(s) eval duration: 3m10.281s eval rate: 5.86 tokens/s
siegevjorn@reddit (OP)
Thanks for the info! I wonder how it's performance would compare to Mac studio with M2 Max (12 core CPU and 38 core GPU). Would you think M2 Max Mac studio would experience a big peformance hit?
croninsiglos@reddit
It shouldn’t be terribly different, only a couple tokens per second.
siegevjorn@reddit (OP)
Thanks! Enjoy your new MBP!
330d@reddit
Same prompt, same quant. M1 Max 64GB
Red_Redditor_Reddit@reddit
Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
64GB DDR4 3200Mhz Ram
GPU disabled
dalhaze@reddit
8000 t/s with GPU disabled? i’m confused where is the power coming from?
Red_Redditor_Reddit@reddit
It wasn't doing 8k t/s. There wasnt a system prompt, and maybe its a weird divide by zero issue. The 0.7 t/s was what I was getting.
MrPecunius@reddit
Llama 3.3b-70b-Instruct-GGUF-Q3_K_M
Macbook Pro with binned M4 Pro (12 cpu/16 gpu), 48GB RAM:
5.93s to first token
1099 tokens
2.95 tok/sec
Lots of other stuff is running, but memory pressure still shows green.
davewolfs@reddit
On M4 Max this is
8 t/s for Q4_K_M.
chibop1@reddit
Here's mine for M3Max 64GB with various prompt sizes for llama-3.3-70b-q4_K_M and q5_K_M.
https://www.reddit.com/r/LocalLLaMA/comments/1h1v7mn/speed_for_70b_model_and_various_prompt_sizes_on/
siegevjorn@reddit (OP)
Thanks for the valuable info!
chibop1@reddit
Make sure everyone used the same prompt!
https://www.reddit.com/r/LocalLLaMA/comments/1h0bsyz/how_prompt_size_dramatically_affects_speed/
siegevjorn@reddit (OP)
Thanks! Updated prompt. My initial stats is from something else. Let me update my stats soon.
bornsupercharged@reddit
MacBook Pro M4 Max 128GB ram, running models off a Teamgroup MP44 NVMe in a TB4 enclosure.
Model
architecture llama
parameters 70.6B
context length 131072
embedding length 8192
quantization Q4_K_M
Prompt:
"Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent."
total duration: 1m55.291972708s
load duration: 27.476541ms
prompt eval count: 26 token(s)
prompt eval duration: 1.892s
prompt eval rate: 13.74 tokens/s
eval count: 864 token(s)
eval duration: 1m53.371s
eval rate: 7.62 tokens/s
siegevjorn@reddit (OP)
That looks impressive! How is the nvme connected? Thunderbolt?
bornsupercharged@reddit
Yes, TB port on the MBP. I have a TB5 enclosure from Acasis coming in the next week or so, although I doubt going from TB4 to TB5 enclosure is going to do much (if anything) for ollama speeds.
siegevjorn@reddit (OP)
That makes sense. I mean, $$ that apple charge for extra hard drive is just ridiculous. Having external HD doesn't seem to affect the inference speed in your case, possibly due to high speed of TB port.
bornsupercharged@reddit
Yea my read speed is 3800MB/s in TB4; in TB5 it should reach at least 7500MB/s. Since it's all loaded in memory anyway, I think there's zero/minimal speed improvement to be made after the model is up and running. But for $200 and $40 for a TB4 enclosure, this external 4TB nvme is way cheaper than what Apple wanted for similar space built in. I went with 1TB storage on the MBP since my last one was 2TB and I only used 300GB of it, due to having my external drive. Money saved/allocated to ram instead.
siegevjorn@reddit (OP)
Good call! This should be the way for all mac users until Apple cut down their price for extra HD.
Its_not_a_tumor@reddit
I think you need to put the Q value for a proper comparison. I'm guessing yours is Q4.
siegevjorn@reddit (OP)
That's correct. Edited OP. Thanks!
ForsookComparison@reddit
Which quant? And are you splitting 8gb into the 4070 or running purely off of memory?
siegevjorn@reddit (OP)
Q4_K_M. I'm splitting.
siegevjorn@reddit (OP)
84% on CPU, and 16% on GPU.