Laptop LLM performance - beware of the power settings!

Posted by YordanTU@reddit | LocalLLaMA | View on Reddit | 17 comments

It's pity that I did such a lame negligence, but want to share with you, in case someone struggles with the same issue.

Both me and the wife have Lenovo gaming laptops:

Rizen 5, 16GB RAM, 3050ti 4GB
i5, 16GB RAM, 4060 8GB

Logically, if a model fits entirely in the VRAM, the machine 2 runs it noticeble faster. BUT, everything beyond 7B which is partially offloaded in VRAM, practically goes with less than 0.2T/s and takes 2-3 minutes to output the first token on the machine 2! While machine 1 runs Qwen 2.5 14B quite acceptable with around 2T/s.

I was changing nVidia/CUDA drivers, settings of llama.cpp - nothing helped. Till I checked the "power settings" of Windows and changed the presets from "balanced" to "performance". It was the CPU/RAM of the machine which killed all the fun. Now I get 5-10 T/s with 14B model and 26/49 layers to GPU.

[-]

imtusharraj@reddit

Anyone using macbook - hows the performance

soulefood@reddit

Not the same model, but m4 max with 128gb ram, I get about 6 t/s write on llama 3.3 70b 8-bit

Master-Meal-77@reddit

Beware of Windows

Everlier@reddit

Also beware of Windows in general

YordanTU@reddit (OP)

Agree, but in my case is needed (still)

Rizen 5

fr fr no cap

typo ;)

Adjustsglasses@reddit

Using Q4 with vLLM in Linux. It works well

brahh85@reddit

in this line of advice, people that use CPU for inference should try Q4_0_8_8 models, since many CPU have support for AVX2/AVX512 and it seems that quant is optimized.

to check in linux if your cpu has AVX

cat /proc/cpuinfo | grep avx

a_beautiful_rhind@reddit

Also overriding TDP limits in your GPU/CPU.