Laptop LLM performance - beware of the power settings!
Posted by YordanTU@reddit | LocalLLaMA | View on Reddit | 17 comments
It's pity that I did such a lame negligence, but want to share with you, in case someone struggles with the same issue.
Both me and the wife have Lenovo gaming laptops:
-
Rizen 5, 16GB RAM, 3050ti 4GB
-
i5, 16GB RAM, 4060 8GB
Logically, if a model fits entirely in the VRAM, the machine 2 runs it noticeble faster. BUT, everything beyond 7B which is partially offloaded in VRAM, practically goes with less than 0.2T/s and takes 2-3 minutes to output the first token on the machine 2! While machine 1 runs Qwen 2.5 14B quite acceptable with around 2T/s.
I was changing nVidia/CUDA drivers, settings of llama.cpp - nothing helped. Till I checked the "power settings" of Windows and changed the presets from "balanced" to "performance". It was the CPU/RAM of the machine which killed all the fun. Now I get 5-10 T/s with 14B model and 26/49 layers to GPU.
imtusharraj@reddit
Anyone using macbook - hows the performance
soulefood@reddit
Not the same model, but m4 max with 128gb ram, I get about 6 t/s write on llama 3.3 70b 8-bit
Master-Meal-77@reddit
Beware of Windows
Everlier@reddit
Also beware of Windows in general
YordanTU@reddit (OP)
Agree, but in my case is needed (still)
MoffKalast@reddit
Tfw Wine gives the app you need to use a rating of "Garbage"
squeasy_2202@reddit
Debatable. Dual boot is a thing.
Top-Salamander-2525@reddit
Especially if you’re a Russian oligarch.
paulirotta@reddit
Or a bird
Everlier@reddit
Or both
ortegaalfredo@reddit
Beware, you might cook your notebook.
Beneficial-Yak-1520@reddit
Do you have any experience with this happening?
I was of the impression the performance setting is not over locking the CPU or GPU (at least on windows). Therefore I would expect thermal throttling to slow down the CPU when temperature rises?
MoffKalast@reddit
fr fr no cap
YordanTU@reddit (OP)
typo ;)
Adjustsglasses@reddit
Using Q4 with vLLM in Linux. It works well
brahh85@reddit
in this line of advice, people that use CPU for inference should try Q4_0_8_8 models, since many CPU have support for AVX2/AVX512 and it seems that quant is optimized.
to check in linux if your cpu has AVX
cat /proc/cpuinfo | grep avx
a_beautiful_rhind@reddit
Also overriding TDP limits in your GPU/CPU.