LM Studio + Snapdragon Laptops = Bad experience

Posted by Andrew_C0@reddit | LocalLLaMA | View on Reddit | 16 comments

Hello. I've been running into this issue recently that I'm unable to debug or fix whatsoever.

Using the latest version of LM Studio (0.3.30) on my Snapdragon Laptop (a Slim 7X - the 32GB RAM version), I get pretty great experience first time I run LM Studio. I tried recently Qwen3 1.7B model just to test it out, and I get around 50 tokens/s, which is great.

However, that only works the first time the model is loaded. Afterwards, if I want to eject the model and use another one (let's say, Qwen3 4B), I get somewhat arount 0.02 tokens/s. I just don't get why. If I want to reload the same 1.7B model, I get the same token performance.

What I've noticed is that rebooting the laptop and loading the model again, it fixes the issue (in regards to whatever model I load first, including Qwen3 Coder 30B), but as soon as I eject and load another model, until I reboot, the tokens/s is always under 1 t/s.

I haven't altered any settings, so I just downloaded the model, loaded it in, and that's it.

I had the same experience using a Surface Laptop 7 in the past, with an older version of LM Studio, but after some updates, it was somehow fixed.

Any help is greatly appreciated to fix this.

[-]

xrvz@reddit

LE: Solved by changing the power plan to Best Performance, since Better power efficiency greatly handicapped the CPU and LM Studio performance, it seems.

BeTtEr ThAn ApPlE sIlIcOn. [insert retarded spongebob image]

Your problem is Snapdragon.

[-]

SkyFeistyLlama8@reddit

Snapdragon user here: don't use LM Studio. I've been running LLMs on Snapdragon X Elite and X Plus laptops for a year now and I get the best performance using llama.cpp.

If you want to use GGUF models, use llama-server in the llama.cpp archive. Use --no-mmap to fit the entire model into RAM without swapping. Use the adreno-opencl ZIP if you want to run GGUF models on the GPU (slower than on CPU but uses only 15-20 W max), use the windows-arm64 ZIP if you want to use the CPU (highest performance but it gets really hot).

You can also run small 4B models on the NPU using Nexa SDK. There are a couple of posts on here showing how to do that. I think it's great for smaller models because it uses 10 W max.

[-]

Andrew_C0@reddit (OP)

I managed to solve it in the end. It seems like `Better battery efficiency` power plan was working as intended, slowing the CPU that much that it was also slowing down LM Studio and tokens/s a lot.

I do appreciate your recommendations tough, I'll consider it.

[-]

SkyFeistyLlama8@reddit

Yes, that too. If you're using CPU inference, set it to Performance mode but the laptop will get very hot.

You can use efficiency mode with GPU inference. It will be slower than CPU inference but it keeps the laptop cool.

[-]

Elegant_Hyena8442@reddit

I like the idea of Snapdragon as a notebook processor, mainly because of the battery life. The problem is the lack of optimization.

[-]

Andrew_C0@reddit (OP)

It will take time, but there's real progress in many areas and the gap between x86 and Arm64 is shrinking, it's just that Microsoft should really push more, as Apple did with the M chips.

[-]

hegosder@reddit

I think it uses cpu instead of npu at the second model.

Seems like it can't unload the first model the way he wanted.

I might be wrong on this but here's what I think:

I assume LMstudio wants a big empty contagious ram in order to paste the model. When you first started the computer there is nothing uses this area right? And he says okay lets paste the model from 1gig to 4 gig. But then let's say windows creates a window that started at 4.1 gig when you offload the model, and started another model he thinks like, oh I can't! Because there is not enough empty area for that. So he uses your ssd to get virtual ram. And thus resulting very low speed.

Soo, what I think you should try is this: after using 1b model, try unload it and just load 700m model. See if it's fast or not.

[-]

Andrew_C0@reddit (OP)

I was thinking that might be the issue, but while loading the first model, I would still have like close to 20 GB of free RAM available, so space shouldn't be a concern. And, as far as I can see from the task manager, ejecting a model properly frees up the used RAM. Even manually closing the LM Studio instance (including manually closing all processes) and opening it again doesn't help out, it seems like the manual reboot is the only thing that consistently makes the model have a manageable return rate of tokens.

[-]

hegosder@reddit

Uhmm, sorry for lame explanation. I made this image to explain things better.

This is called RAM fragmentation.

When you "eject" the model, the memory is freed, but the operating system doesn't automatically shuffle everything around to create a large free block. That freed space is now available, but it's in the same fragmented state as I tried to show in my diagram.

[-]

Andrew_C0@reddit (OP)

I got your point out. The picture really helps out. I tried again, strarting fresh, getting a reboot, going 4B and then 1.7B. Unfortunately I got the same behaviour. Then, my laptop battery was running low. Plugged it in, tried with your method above, and now it worked! What was the problem? Well, the laptop was running on battery in the `Best power efficiency` mode, and that's why it was so slow. I feel so stupid not accounting for this honestly. Changing the power plan to `Better Performance` / `Best Performance` made the difference it seems, both on battery & plugged in.

[-]

hegosder@reddit

Niice! Glad it works. It seems like the best way for you is loading the biggest model available when you freshly start your computer.

[-]

Andrew_C0@reddit (OP)

Just by changing the power plan I was able to hot swap between Qwen3-code-30B and gpt-oss-20B in a couple of seconds with 20-30 tokens/s, so I guess Windows power plan really bottlenecked the LM Studio performance in the end.

[-]

Ok_Cow1976@reddit

It seems that the first model was kept in your system memory. You need to set this option off in the settings of lm studio. Also it seems there is a option of "keep model in memory" when loading the model. Check such things out.

[-]

Andrew_C0@reddit (OP)

I tried that, and as expected, it reduces the performance quite a bit, but loading the same model a second time / loading another model will still result in the same behaviour from the OP. Thanks tough

[-]

darkpigvirus@reddit

Instead of rebooting try to delete the specific task in the task manager .. maybe idk

[-]

Andrew_C0@reddit (OP)

Yeah, I tried it out, but that doesn't look like it solves the issue for now, unfortunately. Thanks for the suggestion tough.