Laptop inference speed on Llama 3.3 70B | TheaterFire

Laptop inference speed on Llama 3.3 70B

Posted by siegevjorn@reddit | LocalLLaMA | View on Reddit | 75 comments

Hi I would like to start a thread for sharing laptop inference speed of running llama3.3 70B, just for fun, and for resources to lay out some baselines of 70B inferencing.

Mine has AMD 7 series CPU with 64GB DDR5 4800Mhz RAM, and RTX 4070 mobile 8GB VRAM.

Here is my stats for ollama:

NAME SIZE PROCESSOR UNTIL
llama3.3:70b 47 GB 84%/16% CPU/GPU 29 seconds from now

total duration: 8m37.784486758s load duration: 21.44819ms prompt eval count: 33 token(s) prompt eval duration: 3.57s prompt eval rate: 9.24 tokens/s eval count: 561 token(s) eval duration: 8m34.191s eval rate: 1.09 tokens/s

How does your laptop perform?

[-]

jacekpc@reddit

I ran this prompt on my CPU E5-2680 v4 with quad channel memory (512 GB of DDR4 in total).
I only have some ancient GPU so that the system posts - it was not used by olama.

total duration: 12m32.572623411s
load duration: 25.530167ms
prompt eval count: 26 token(s)
prompt eval duration: 9.631s
prompt eval rate: 2.70 tokens/s
eval count: 958 token(s)
eval duration: 12m22.915s
eval rate: 1.29 tokens/s

[-]

siegevjorn@reddit (OP)

That's not bad at all, I think.

https://www.intel.com/content/www/us/en/products/sku/91754/intel-xeon-processor-e52680-v4-35m-cache-2-40-ghz/specifications.html

Maximum memory throughput is 76.8 GB/s, which is quite decent.

You should try running deepseekv3 with 512 ram!

[-]

jacekpc@reddit

will do. In the meantime I tested my another PC (mini pc) with Ryzen 5 5600G and 64 GB (dual channel). I got the below results.

total duration: 12m19.013924279s
load duration: 17.715242ms
prompt eval count: 143 token(s)
prompt eval duration: 10.926s
prompt eval rate: 13.09 tokens/s
eval count: 747 token(s)
eval duration: 12m8.068s
eval rate: 1.03 tokens/s

They are not that far off from my workstation (e5 2680v4).

Prompt:
Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent.

Model
llama3.3 70b Q4_K_M

[-]

drsupermrcool@reddit

Do you have flash attention turned on?

[-]

siegevjorn@reddit (OP)

I didn't have it turn on. How can you do it in ollama?

[-]

Ok_Time806@reddit

Yeah, via environment variable: OLLAMA_FLASH_ATTENTION: true.

I think there was a PR to make it true by default, but haven't checked recently.

[-]

Educational_Gap5867@reddit

Damn the MacBook maybe slow compared to desktop Nvidias but it eats other cpu bound laptops for dinner. But unfortunately I can’t test I don’t have enough RAM for this. If you’re up for testing 32B I’d be down.

[-]

siegevjorn@reddit (OP)

Sure thing. Which 32B do you want to try?

[-]

Educational_Gap5867@reddit

Let’s do Qwen Coder?

[-]

MrPecunius@reddit

Macbook Pro, binned (12/16) M4 Pro, 48GB, using LM Studio

Qwen2.5-coder-14B-Instruct-MLX-4bit (\~7.75GB model size):

- .41s to first token, 722 tokens, 27.11 t/s

Qwen2.5-coder-32B-Instruct-GGUF-Q5_K_M (\~21.66GB model size):

- 1.32s to first token, 769 tokens, 6.46 t/s

[-]

Educational_Gap5867@reddit

That’s really nice. I had seen some benchmarks where the MLX improvement were marginal like 10% compared to GGUFs.

[-]

MrPecunius@reddit

There doesn't seem to be a difference with MLX on the M4 (non Pro, which I have in a Mac Mini), while it's a solid 10-15% gain on my now-traded-in M2 Macbook Air.

I haven't done any MLX/GGUF comparisons on the M4 Pro yet.

I'm quite pleased with the performance and the ability to run any reasonable model at usable speeds.

[-]

Educational_Gap5867@reddit

Oh damn you were comparing 14B to 32B my bad. I thought you got 30t/s on a 32B model lol 😂

[-]

MrPecunius@reddit

Overclocked to approximately lime green on the EM spectrum, maybe. :-D

[-]

Ruin-Capable@reddit

Fun fact, green light has approximately the same numerical value for both frequency and wavelength when frequency is measured in THz and wavelength is measured in nm.

[-]

siegevjorn@reddit (OP)

Sounds good. Here's my prompt:

Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent.

Will follow up with the stats soon.

[-]

Educational_Gap5867@reddit

u/siegevjorn have you tried testing with speculative decoding? I don't know if they have speculative decoding in Ollama.

[-]

RichNugget@reddit

soon https://github.com/ollama/ollama/pull/8134/

[-]

siegevjorn@reddit (OP)

No idea either. Will look into it!

[-]

Educational_Gap5867@reddit

total duration: 1m22.526419s

load duration: 27.578958ms

prompt eval count: 45 token(s)

prompt eval duration: 4.972s

prompt eval rate: 9.05 tokens/s

eval count: 738 token(s)

eval duration: 1m17.366s

eval rate: 9.54 tokens/s This isn't bad it was like watching someone type really really fast.

[-]

siegevjorn@reddit (OP)

That looks great. Can you share the spec of your macbook?

[-]

Educational_Gap5867@reddit

M4 Pro (12 core) 48GB RAM.

[-]

siegevjorn@reddit (OP)

Thanks! It's interesting that M4 pros have similar speed to M4 max MBPs reported here.

[-]

brotie@reddit

Nah I have an m4 mac and I get 20-30t/s response rate from qwen coder 2.5

[-]

siegevjorn@reddit (OP)

Oops that's my mistake. M4 max use case was for llama3 70B. I'll delete my prev comment. Confusing.

[-]

brotie@reddit

Yeah that’s what I mean with my m4 max t/s sorry autocorrect switched it to max but the performance boost is noticeable and the gpu is legit too bad I can’t use it for MSFS so I still need a 4070ti super 😂

[-]

bornsupercharged@reddit

MacBook Pro M4 Max 128GB ram, running models off a Teamgroup MP44 NVMe in a TB4 enclosure.

Model

architecture qwen2

parameters 32.8B

context length 32768

embedding length 5120

quantization Q4_K_M

Prompt:

"Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent."
total duration: 52.365282792s

load duration: 24.173792ms

prompt eval count: 45 token(s)

prompt eval duration: 9.95s

prompt eval rate: 4.52 tokens/s

eval count: 780 token(s)

eval duration: 42.217s

eval rate: 18.48 tokens/s

[-]

NEEDMOREVRAM@reddit

I realize this may be asking a bit much of you...

But are you able to run this: https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF/tree/main/Mistral-Large-Instruct-2407-Q5_K_M

At 4-5 tokens per second? I'm debating between the 64GB model and the 128GB M4 Pro Max.

[-]

bornsupercharged@reddit

MacBook Pro M4 Max 128GB ram, running models off a Teamgroup MP44 NVMe in a TB4 enclosure.

Model

architecture llama

parameters 122.6B

context length 131072

embedding length 12288

quantization Q5_K_M

Prompt:

"Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent."

total duration: 7m19.590320167s

load duration: 13.527167ms

prompt eval count: 21 token(s)

prompt eval duration: 4.335s

prompt eval rate: 4.84 tokens/s

eval count: 1424 token(s)

eval duration: 7m15.239s

eval rate: 3.27 tokens/s

[-]

NEEDMOREVRAM@reddit

Wow...so for an 86GB model it's only 3.27 tokens per second? And thank you for taking the time to download and perform the experiment. I was planning on selling my 3090s...but now not so sure. I might just get the 64GB Pro Max and then just run what I can...which maybe might be a 50GB quant?

[-]

bornsupercharged@reddit

It might be slightly higher, if I weren't running a ton of other apps (software engineer). I'm typically using qwen2.5-coder 32B Q4, which I find to be a good trade off of speed and usability.

The nice thing about Mac is the ram is shared with the gpu, but the M4 Max (up to 546 GB/s) is not nearly as much bandwidth as a 3090 (Up to 936.2 GB/s). Essentially, you can fit larger models in the ram than the 24GB 3090, but it's slower. If time isn't a huge factor it does work heh.

[-]

NEEDMOREVRAM@reddit

Can you define:

If time isn't a huge factor

lol

I just need it to help me do grammar stuff for blog posts and client websites. So, I don't need 10 billion tokens a second. I'd be in hog heaven with just 5-6 tokens per second.

And can I get your opinion on what I'm planning on doing?

I briefly had the M4 Macbook Pro 48GB model. I was running a small 22B model and a docker container and something else...and was hitting ~45GB of RAM usage.

So, I'm now considering the 64GB model and trying to make it work because I really don't have an extra 2k to throw at a 128GB Pro Max model. I had to sell all my Apple stuff, including my iPhone to get the $4,100 needed for the M4 Pro Max 64GB.

But with running a 50GB quant and a docker container and maybe VS Code (learning Python) and a VM with Kali (slowly dipping my toes into ethical hacking stuff to see if it's something that might be a good career choice in ~5 years from now)....do you think 64GB will be pushing it?

My only other option is to sell a few 3090s on Marketplace and then when my income is up in a few months...I would hopefully be able to buy two more at cheaper prices due to the 5090s being out.

[-]

bornsupercharged@reddit

That's a pretty big one to download, it'll take me a bit as HuggingFace downloads at roughly 40MB/s.

[-]

recitegod@reddit

Thanks for the stats, where do you get these numbers from?

[-]

siegevjorn@reddit (OP)

You can just run ollama with

ollama run --verbose [model name]

And it will give the stats in the end.

[-]

Bidulol@reddit

Can someone point me to a link about how to run this test. Im having trouble running ollama, it starts a localhost website while I just want to use the weights in pytorch or similar.

[-]

siegevjorn@reddit (OP)

Did you install ollama? It should be accessible through command-line interface. You can download the weights (.gguf format) from huggingface and build your own model as well.

[-]

Bidulol@reddit

I did but it seems to start up a webserver. I'll dig further, thanks

[-]

PM_ME_YOUR_ROSY_LIPS@reddit

## Prompt:
    Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent.

## Specs
    - xps 15(9560)
    - i7 7700HQ (turbo disabled, 2.8GHz)
    - 32GB DDR4-2400 RAM
    - GTX 1050 4GB GDDR5
    - SK Hynix 1TB nvme

- qwen2.5-coder:3b-instruct-q6_K
    - total duration:       50.0093556s
    - load duration:        32.4324ms
    - prompt eval count:    45 token(s)
    - prompt eval duration: 275ms
    - prompt eval rate:     163.64 tokens/s
    - eval count:           708 token(s)
    - eval duration:        49.177s
    - eval rate:            14.40 tokens/s

    | NAME                             | ID             | SIZE  | PROCESSOR         | UNTIL   |
    |----------------------------------|----------------|-------|-------------------|---------|
    | qwen2.5-coder:3b-instruct-q6_K   | 758dcf5aeb7e   | 3.7 GB| 7%/93% CPU/GPU    | Forever |

- qwen2.5-coder:3b-instruct-q6_K(32K context)
    - total duration:       1m20.9369252s
    - load duration:        33.2575ms
    - prompt eval count:    45 token(s)
    - prompt eval duration: 334ms
    - prompt eval rate:     134.73 tokens/s
    - eval count:           727 token(s)
    - eval duration:        1m20.04s
    - eval rate:            9.08 tokens/s

    | NAME                             | ID             | SIZE  | PROCESSOR         | UNTIL   |
    |----------------------------------|----------------|-------|-------------------|---------|
    | qwen2.5:3b-32k                   | b230d62c4902   | 5.1 GB| 32%/68% CPU/GPU   | Forever |

- qwen2.5-coder:14b-instruct-q4_K_M
    - total duration:       4m49.1418536s
    - load duration:        34.3742ms
    - prompt eval count:    45 token(s)
    - prompt eval duration: 1.669s
    - prompt eval rate:     26.96 tokens/s
    - eval count:           675 token(s)
    - eval duration:        4m46.897s
    - eval rate:            2.35 tokens/s

    | NAME                             | ID             | SIZE  | PROCESSOR         | UNTIL   |
    |----------------------------------|----------------|-------|-------------------|---------|
    | qwen2.5-coder:14b-instruct-q4_K_M| 3028237cc8c5   | 10 GB | 67%/33% CPU/GPU   | Forever |

- deepseek-coder-v2:16b-lite-instruct-q4_0
    - total duration:       1m15.9147623s
    - load duration:        24.6266ms
    - prompt eval count:    24 token(s)
    - prompt eval duration: 1.836s
    - prompt eval rate:     13.07 tokens/s
    - eval count:           685 token(s)
    - eval duration:        1m14.048s
    - eval rate:            9.25 tokens/s

    | NAME                                     | ID           | SIZE  | PROCESSOR       | UNTIL   |
    |------------------------------------------|--------------|-------|-----------------|---------|
    | deepseek-coder-v2:16b-lite-instruct-q4_0 | 63fb193b3a9b | 10 GB | 66%/34% CPU/GPU | Forever |

[-]

siegevjorn@reddit (OP)

Thanks for info. It's interesting that deepseek-coder-v2:16b-lite is much faster than Qwen coder 14b, despite the same model size. Do you happen to know the reason why?

[-]

PM_ME_YOUR_ROSY_LIPS@reddit

I think it's because of the architectural differences and the quant(though less impactful). Even though the offload to cpu/gpu is similar, the utilization is different.

deepseek:
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/28 layers to GPU
llm_load_tensors:    CUDA_Host model buffer size =  5975.31 MiB
llm_load_tensors:        CUDA0 model buffer size =  2513.46 MiB

qwen 14b:
llm_load_tensors: offloading 11 repeating layers to GPU
llm_load_tensors: offloaded 11/49 layers to GPU
llm_load_tensors:          CPU model buffer size =   417.66 MiB
llm_load_tensors:    CUDA_Host model buffer size =  6373.90 MiB
llm_load_tensors:        CUDA0 model buffer size =  1774.48 MiB

[-]

Ok_Warning2146@reddit

Your machine is likely a Zen 3 Ryzen 7 6800. This laptop has dual channel DDR5-4800 RAM that translates to a RAM speed of 76.8GB/s. 3090 has 936GB/s which is 12.19x. So getting 1t/s seems normal when you combine the CPU with 4070

[-]

siegevjorn@reddit (OP)

It is a Zen3 indeed. What's the inference speed of llama 3.3 70B Q4_K_M on dual 3090 machine? I see some new laptops feature DDR5-6400 (102.4GB/s), which may be little faster but not that much.

[-]

Ok_Warning2146@reddit

This site says 16.29t/s for 3.1 70B. 3.3 70B should be similar.

The fastest laptop now should be Apple M4 Max 128GB which has 546.112GB/s.

[-]

siegevjorn@reddit (OP)

Someone posted 9 t/s inference speed for the vary laptop. 9/ 546 * 920 = 15.16 t/s, which is pretty similar to 16.29s. Considering that macs generally have lower core count, it makes sense that 3090 machine does bit better than the scaled prediction.

[-]

Durian881@reddit

My binned M3 Max (14/30) runs Qwen2.5 72B GGUF Q4_K_M generates 5.5 tokens/sec and Mistral Large Q4 at 3 tokens/sec.

[-]

croninsiglos@reddit

Your prompt is important, but I used the prompt you had listed in a comment:

Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent.

total duration: 1m48.493107584s load duration: 31.374625ms prompt eval count: 26 token(s) prompt eval duration: 811ms prompt eval rate: 32.06 tokens/s eval count: 978 token(s) eval duration: 1m47.649s eval rate: 9.09 tokens/s

Typical performance I've seen ranges from 8.5 - 11 tokens per second on M4 Max

[-]

siegevjorn@reddit (OP)

That looks super. What's the spec of your M4 Max (Cpu core /GPU core counts / RAM ?)

[-]

croninsiglos@reddit

128 GB M4 Max 16 core 40 core GPU.

It's the 16 inch, in case heat dissipation factors into throttling.

[-]

NEEDMOREVRAM@reddit

"Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent."

Can you run 80GB GGUF quants on your 128GB M4 Max at reasonable reading speeds? 4-5 tokens per second? And if so...how much context do you have before it shits the pot, so to speak?

[-]

laerien@reddit

Yes, llama3.3:70b-instruct-q8_0 GGUF (d5b5e1b84868) for example weighs in at 74 GB and runs reasonably well with Ollama. I usually use MLX instead of GGUF. I do have my /etc/sysctl.conf set to iogpu.wired_limit_mb=114688 to dedicate a bit more to vram, but haven't had context issues.

Same system as OP, 128 GB 16" M4 Max 16 core.

total duration: 3m11.942125625s load duration: 31.911833ms prompt eval count: 29 token(s) prompt eval duration: 1.627s prompt eval rate: 17.82 tokens/s eval count: 1115 token(s) eval duration: 3m10.281s eval rate: 5.86 tokens/s

[-]

siegevjorn@reddit (OP)

Thanks for the info! I wonder how it's performance would compare to Mac studio with M2 Max (12 core CPU and 38 core GPU). Would you think M2 Max Mac studio would experience a big peformance hit?

[-]

croninsiglos@reddit

It shouldn’t be terribly different, only a couple tokens per second.

[-]

siegevjorn@reddit (OP)

Thanks! Enjoy your new MBP!

[-]

330d@reddit

Same prompt, same quant. M1 Max 64GB

total duration:       2m29.70336625s
load duration:        38.814583ms
prompt eval count:    58 token(s)
prompt eval duration: 1.801s
prompt eval rate:     32.20 tokens/s
eval count:           933 token(s)
eval duration:        2m27.51s
eval rate:            6.32 tokens/s

[-]

Red_Redditor_Reddit@reddit

Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

64GB DDR4 3200Mhz Ram

GPU disabled

Llama-3.3-70B-Instruct-Q4_K_L
sampling time =      35.03 ms /   293 runs   (    0.12 ms per token,  8363.30 tokens per second)
load time =   30205.32 ms
prompt eval time =  322150.58 ms /    46 tokens ( 7003.27 ms per token,     0.14 tokens per second)
eval time =  393168.74 ms /   273 runs   ( 1440.18 ms per token,     0.69 tokens per second)
total time =  717454.54 ms /   319 tokens

[-]

dalhaze@reddit

8000 t/s with GPU disabled? i’m confused where is the power coming from?

[-]

Red_Redditor_Reddit@reddit

It wasn't doing 8k t/s. There wasnt a system prompt, and maybe its a weird divide by zero issue. The 0.7 t/s was what I was getting.

[-]

MrPecunius@reddit

Llama 3.3b-70b-Instruct-GGUF-Q3_K_M

Macbook Pro with binned M4 Pro (12 cpu/16 gpu), 48GB RAM:

5.93s to first token

1099 tokens

2.95 tok/sec

Lots of other stuff is running, but memory pressure still shows green.

[-]

davewolfs@reddit

On M4 Max this is

8 t/s for Q4_K_M.

[-]

chibop1@reddit

Here's mine for M3Max 64GB with various prompt sizes for llama-3.3-70b-q4_K_M and q5_K_M.

https://www.reddit.com/r/LocalLLaMA/comments/1h1v7mn/speed_for_70b_model_and_various_prompt_sizes_on/

[-]

siegevjorn@reddit (OP)

Thanks for the valuable info!

[-]

chibop1@reddit

Make sure everyone used the same prompt!

https://www.reddit.com/r/LocalLLaMA/comments/1h0bsyz/how_prompt_size_dramatically_affects_speed/

[-]

siegevjorn@reddit (OP)

Thanks! Updated prompt. My initial stats is from something else. Let me update my stats soon.

[-]

bornsupercharged@reddit

MacBook Pro M4 Max 128GB ram, running models off a Teamgroup MP44 NVMe in a TB4 enclosure.

Model

architecture llama

parameters 70.6B

context length 131072

embedding length 8192

quantization Q4_K_M

Prompt:

"Write a numpy code to conduct logistic regression from scratch, using stochastic gradient descent."

total duration: 1m55.291972708s

load duration: 27.476541ms

prompt eval count: 26 token(s)

prompt eval duration: 1.892s

prompt eval rate: 13.74 tokens/s

eval count: 864 token(s)

eval duration: 1m53.371s

eval rate: 7.62 tokens/s

[-]

siegevjorn@reddit (OP)

That looks impressive! How is the nvme connected? Thunderbolt?

[-]

bornsupercharged@reddit

Yes, TB port on the MBP. I have a TB5 enclosure from Acasis coming in the next week or so, although I doubt going from TB4 to TB5 enclosure is going to do much (if anything) for ollama speeds.

[-]

siegevjorn@reddit (OP)

That makes sense. I mean, $$ that apple charge for extra hard drive is just ridiculous. Having external HD doesn't seem to affect the inference speed in your case, possibly due to high speed of TB port.

[-]

bornsupercharged@reddit

Yea my read speed is 3800MB/s in TB4; in TB5 it should reach at least 7500MB/s. Since it's all loaded in memory anyway, I think there's zero/minimal speed improvement to be made after the model is up and running. But for $200 and $40 for a TB4 enclosure, this external 4TB nvme is way cheaper than what Apple wanted for similar space built in. I went with 1TB storage on the MBP since my last one was 2TB and I only used 300GB of it, due to having my external drive. Money saved/allocated to ram instead.

[-]

siegevjorn@reddit (OP)

Good call! This should be the way for all mac users until Apple cut down their price for extra HD.

[-]

Its_not_a_tumor@reddit

I think you need to put the Q value for a proper comparison. I'm guessing yours is Q4.

[-]

siegevjorn@reddit (OP)

That's correct. Edited OP. Thanks!

[-]

ForsookComparison@reddit

1.09 tokens/second

Which quant? And are you splitting 8gb into the 4070 or running purely off of memory?

[-]

siegevjorn@reddit (OP)

Q4_K_M. I'm splitting.

[-]

siegevjorn@reddit (OP)

84% on CPU, and 16% on GPU.