A simple "hack" to speed up prompt processing for Qwen 3.5/3.6 in LM Studio

Posted by GrungeWerX@reddit | LocalLLaMA | View on Reddit | 14 comments

Increase your CPU Thread Pool Size to your processor's max. In LM Studio, the max is 10. I'm running an i7 12700K, so I set mine to 20. It doubled, and in some cases nearly tripled my prompt processing speed and now things are flying at over 100K context. I'm still getting 25+ tok/sec at high context since I can still max my gpu offload.

For those interested, I'm using Q5 UD K XL quants for both 3.5/3.6.

Sadly, doesn't seem to help with Gemma 4 31B, and your mileage may vary with other models, but it works well with Qwen.

Hope this helps someone else out.

[-]

pepedombo@reddit

Seems you forgot to post gpu setup 😄 Lms pumps cpu when you're not offloading all layers, same as llama.cpp.

[-]

GrungeWerX@reddit (OP)

Rtx 3090TI. All layers were offloaded, just was getting slow pp.

[-]

pepedombo@reddit

You have layers offloaded and I'd bet LMS went for system-ram or shared memory with your context.

Fully offloaded layers + context = 10-20% cpu usage. LMS does something wrong and it's able to not utilize full vram and push things into shared/system memory which comes out as heavy cpu workload. I used to fight with that with my multi-gpu setup.

In pure llama.cpp I can set context and in case when it doesn't fit it will simply crash 😄 After I switched to llama I noticed I can set even bigger f16 context that I could in LMS. Well, since lms couldn;t utilize availasble vram then i'm not surprised.

[-]

Iory1998@reddit

I have the same CPU, but LM Studio never uses more than 60% of my CPU. Other backends uses 90%+ and I have no idea why.

[-]

rootdood@reddit

Q2_K_XL Q8_0 KV, 40 GPU, 20 CPU, 256K context, 70TPS. It can do anything I throw at it. Maybe not the smartest, but it’s fast enough that it is at least interactive, and hasn’t wasted 10x the time going in the wrong direction with no user feedback. More VRAM is the only answer for larger quants.

[-]

codehamr@reddit

Good catch. LM Studio's default thread setting is conservative and leaves a lot on the table for hybrid CPU/GPU inference, especially on prefill where the CPU side actually matters.

Worth noting this mostly helps when you're partially offloading. If the model fits fully on GPU, threads barely move the needle. The Gemma 4 difference is probably the architecture, attention layout means the CPU side does less work to begin with so more threads don't help.

[-]

GrungeWerX@reddit (OP)

I was doing full offload and it helped a LOT. Didn’t really help for partial offload for me…it dramatically slowed down token speed…

[-]

bonobomaster@reddit

This value is heavily CPU / RAM subsystem dependent sometimes higher can even cost performance.

But it's a parameter worth testing / benchmarking for your own system.

And, as a former LM Studio enjoyer: Learning llama.cpp pays of nicely performance wise.

[-]

Ariquitaun@reddit

How about vllm? I've been abusing lmstudio but I'd rather properly set things up as iac and just relegate lm studio for exploring new stuff

[-]

Sufficient_Prune3897@reddit

Its faster, but it doesnt support hybrid, so its kinda useless for local model for me

[-]

bonobomaster@reddit

Vllm is absolutely on my to do / to learn list!

[-]

GrungeWerX@reddit (OP)

I tried llama.cpp last week and just got frustrated with it. For me, it was noticeably slower and things that I'm used to in Lm-studio - like being able to see how quickly it's processing a prompt, the actual context size of the discussion - either require additional extensions or just don't work period.

I was initially drawn to the built-in STT options, but the extra steps, clunky experience, and worst performance despite supposedly being so much faster turned me off. Might give it ago next time I have patience to blow.

[-]

LetsGoBrandon4256@reddit

Try some value between 10 and 20. Big chance the actual peak of the curve is somewhere in between.

[-]

bonobomaster@reddit

If you are on Nvidia hardware, the CUDA variant of llama.cpp should be faster, especially because you can enable ngram speculative encoding and stuff.

For the next time with extra patience: Get yourself a precompiled CUDA 12 variant and don't forget to download and move the CUDA 12.4 DLLs to the same dir as llama-server.exe, if it's Windows.

https://github.com/ggml-org/llama.cpp/releases

P.S.: CUDA 13 variant was nothing but problems for me. Not recommended.