LM Studio CPU thread pool size vs. tk/s with some MoE layers offloaded to CPU

[-]

Plastic-Stress-6468@reddit

Interesting, thanks for sharing. Maybe you can try something like CPU lasso and peg all background processes to a few dedicated cores, and reserve cores for just inferencing. 3900x uses a dual chiplet design so not having to pipe things cross chiplet might yield more perf benefits. Try to peg llama.cpp to only cores on the same ccd and peg all other process to the other ccd.

Reply

[-]

Clear-Ad-9312@reddit

I actually tried this on linux and it negatively affected performance. It might be that my memory and CPU are matching closely in speed, or that the hyperthreading technology that has 2 logical cores sharing the same physical core can work in tandem while linux handles which logical core is going to do work dynamically. Since the physical core can only handle a few memory lanes at a time, and 2 logical cores sharing the same memory lane does not increase speed efficiently enough. However with MoE, I imagine that each memory lane will get saturated dynamically. So the default llama.cpp config of half the cpu logical cores(same amount of physical cores) is the sweet spot for performance. That is with intel's hyperthreading technology. I imagine each CPU and RAM configuration will have a different sweet spot. I also wonder if core pinning is even worth it at all on a different setup.

Reply

[-]

Plastic-Stress-6468@reddit

Hey I think you are mixing up two concepts. Hyperthreading for Intel and SMT for AMD are fundamentally the same thing, just a clever scheduler trying to saturate cores by presenting a virtual thread to achieve async compute parallelism. What I talked about was the fact that AMD Zen cpus use a chiplet design which is two CCDs (chiplets) glued together using an interconnect so they can talk to each other, and together present to the MB as one single CPU as opposed to two NUMA nodes. There are actual cores on both chiplets, not conceptual virtual cores used for the scheduler.

Reply

[-]

tony__Y@reddit

bottlenecked by memory bandwidth

Reply

[-]

FatheredPuma81@reddit

This.

Reply

[-]

SnooPaintings8639@reddit

Anyone can explain why the tps is falling as we add more than 6 cores? Also, a similar growh for prompt processing would probably be more linear.

Reply

[-]

KageYume@reddit

The 3900X's 12 cores are divided into 2 groups (CCX - core complex), each group has their own cache. So when a task involves more than 6 cores, inter-CCX communication has to be used and slow everything down.

Reply

[-]

mlhher@reddit

As it is not immediately visible from the chart: Only physical cores provide a speedup, virtual cores do not (they actually slow down). I assume the persons CPU has 5 or 6 physical cores and the rest are virtual (common for desktop Intel CPUs).

Reply

[-]

bonobomaster@reddit (OP)

The person has 12 physical cores and 24 logical cores. ;)

Reply

[-]

Zc5Gwu@reddit

Right but he should have seen scaling to 6 but it seems to drop at 3…

Reply

[-]

Plastic-Stress-6468@reddit

It's almost certainly because there is addtional latency for data being piped between the 3900x's two ccds.

Reply

[-]

bonobomaster@reddit (OP)

Yeah, I'll probably will check that out in the future. I'm interested in that result as well. But my feeling, that prompted this little benchmark, hasn't acted up with the thread count for prompt processing. ;) I think higher is better there.

Reply

[-]

bonobomaster@reddit (OP)

I did a little benchmark of the CPU thread pool size option in LM Studio vs. output speed in tk/s with some MoE layers offloaded to CPU, because I always had a feeling, that a higher thread count was detrimental to the performance. For this particular benchmark I used qwen3.6-35b-a3b@MXFP4 but my feeling was regardless of quant, number of forced CPU layers or MoE model. Don't know if it's the same with dense models with offloaded layers. I did 5 runs for each thread count and averaged the results. For my particular CPU the happy place was 5 threads. Prompt: Write 25 random words. Output as a numbered list. enable\_thinking: false (needed roughly the same token count for each run) GPU offload: Set to 40 layers (all) Forced CPU layers: Set to 16 layers CPU: 12 core / 24 threads AMD Ryzen 9 3900X RAM: 84 GB of very slow DDR4 @ 2933 MHz GPU: 5070 TI VRAM: 16 GB GDDR7 Variables not included in this little "experiment": \- has the number of MoE layers forced to the CPU an influence on the sweet spot of threads? My feeling from past usage says no, but who knows?! \- number of tokens varied from around 120 to 150 tokens per run \- everything I missed and you can think of ;)

Reply

[-]

GreaterThanLess@reddit

I've tested this out on my system and it shows a similar curve. I only did 1 run per thread count, but the prompt I used had it generating more at 2000-3000 tokens so it smooths out okay. Model: Qwen3.6-35B-A3B-UD-Q4\_K\_XL CPU: Ryzen 3900x RAM: 96GB DDR4 @ 2133MT/s GPU: 4070 TI 12 GB https://preview.redd.it/sxi1cq0srzvg1.png?width=605&format=png&auto=webp&s=e7e45fe59c33692626635c11088a878d32ed40a1

Reply

[-]

bonobomaster@reddit (OP)

Huh, same CPU, slower ram, sweet spot thread count higher... that would be interesting. But I have a hunch, that the 1000 tokens variance of your 2000-3000 token answers could introduce a hefty measurement error. That's why I took 5 runs each with a shorter output to get a more uniform tokens per second / context length variance. And in a laptop system, temperature could also become more of a factor, with longer inference times. When an output, at the real sweet spot for example is 3000 tokens long, it degrades speed while infering through the larger context, maybe even temperature, which leads to reduced tk/s and therefore could skew the result and hide the real sweet spot. Same at a potential spot, that isn't sweet but only got to generate 2000 tokens. Less speed degradation, quicker result, higher tk/s, false positive for beeing sweet. :D Inference speed isn't linear and changes with the growing context length.

Reply

[-]

aaronr_90@reddit

Did you start at 12 threads and work your way down or did you start at 1 and go up or was it random? If you should do it again and reverse the ordering. Could your CPU and GPU be thermal throttling? I don’t think that is enough of a loss to be thermal throttling but just putting it out there.

Reply

[-]

bonobomaster@reddit (OP)

Nowhere near throttling. I just did a separate test of 5 fast successive runs @ 5 threads and the GPU went from 32.5 °C idle to 37.5 °C and dropped back to baseline in a hand full of seconds @ 12 threads GPU only reached 36.9 °C CPU went from 36.2 °C to 47.3 °C core temp @ 5 threads and dropped back to baseline in a hand full of seconds @ 12 threads max core temp was 49.9 °C. Order was 1 thread to 12 threads. The time I took to enter the results into my spreadsheet was pretty much enough to cool everything down enough, so that the temperature variation seems without relevant consequence. I absolutely won't repeat the experiment. ;)

Reply

[-]

aaronr_90@reddit

Ok, I believe you. But I also gave it a go to see what the sweet spot on my laptop was. I have a Dell XPS laptop with a 8gb 4070. I used Qwen3.5 35B A3B and offloaded all layers to GPU and then forced 33 layers on to CPU (the sweet spot before shared mem goes up and t/s goes down). I did randomly sample thread count but at the end I did not see much of a difference between high thread counts done up front and at the end. https://preview.redd.it/ck3fj479xyvg1.png?width=563&format=png&auto=webp&s=77b0feb5d8e974b60fdedbd9b77d6d21b529e22c I fully expected my results to have a drop off as yours did due to overhead and diminishing returns. This took 5 minutes. I would honestly recommend everyone do a test like this if you are curious.

Reply

[-]

bonobomaster@reddit (OP)

Very interesting. Do you have DDR5 RAM per chance? I guess, I got only RAM bandwidth for about 5 threads worth of compute! :D

Reply

[-]

aaronr_90@reddit

I do! I have 64 gb of DDR5

Reply

[-]

denoflore_ai_guy@reddit

Maybe I should start doing this so my cpu hits above 3% during each prompt processing and inference output.

Reply

[-]

bonobomaster@reddit (OP)

Are you sure, you want more CPU involved? :|

Reply

[-]

denoflore_ai_guy@reddit

Right now my home PCIE bus is pinned completely to 100% on inference. I have no more performance to squeeze out of that. Only thing I can do is offload to cpu

Reply

[-]

usuallyalurker11@reddit

Looks like 3 thread counts is the sweet spot.

Reply

[-]

bonobomaster@reddit (OP)

Sweet spot for power efficacy?

Reply

[-]

usuallyalurker11@reddit

I think so. Looks like the tk/s is bottlenecked by memory bandwidth so >3 thread counts is just a waste of power consumption.

Reply

[-]

Clear-Ad-9312@reddit

in my testing droping to 3 threads halves the speed. I am testing an intel hyperthreading enabled CPU, with 2 channels of DDR4. Max logical cores are 12 with hyperthreading, but physical cores max is 6. The sweet spot for me is matching the number of physical cores which is half by default on llama-cpp. If the CPU is strong enough, then memory can be the bottleneck but if memory and cpu match in performance (or cpu is weaker) then the number of physical cores that map to each memory lane is more important.

Reply

[-]

Wetbikeboy2500@reddit

In my testing, the biggest thing is to look for CPU bottlenecks, which is usually wattage or heat. Just for some historical knowledge, I had found 8 cores to be the best for my machine for MOE offload. [https://www.reddit.com/r/LocalLLaMA/comments/1kaqx3x/comment/mppms06/](https://www.reddit.com/r/LocalLLaMA/comments/1kaqx3x/comment/mppms06/) Since then, with a lot of llama-bench and spot checking llama-cli, 7 threads is slightly more stable for me. Also, not all cores are made equal. For intel, there is the intel extreme tuning utility where you can see what cores are actually the best performing and get the information around wattage draw and temperature limits while running. I assume there is something similar for AMD. I found bottlenecks in temperature, so I got a new cooler and it has been running much better. The main reason it runs better for me is the CPU can maintain its boost clock at its max without having temperature limits throttling the cores. When I start to throw more cores at it, the performance does stay the same, but I slowly see the boost clock go down alongside it to negate any gains. I have 8gb VRAM and 64 GB DDR5. For Qwen3.6 with 34 layers offloaded to CPU, I get around 600-1000 pp/s and 43 tg/s with 7 threads.

Reply

[-]

Equivalent_Job_2257@reddit

This chart is most probably limited to one single machine on earth.

Reply

[-]

bonobomaster@reddit (OP)

Yeah, you are probably right. As I wrote 5 threads is the happy place for my CPU. Should have written my system though but I hadn't realized yet, that this chart showed me my RAM bandwidth bottleneck, rather than a CPU overhead problem. I'm learning... ;)

Reply

[-]

Equivalent_Job_2257@reddit

That's fine!

Reply

[-]

Iory1998@reddit

Exactly! I wish it was that easy. On my machine, pool size doesn't do squat.

Reply

[-]

gigaflops_@reddit

Thank you for doing this experiment and telling us about what variables were and were not controlled. When I was a noob, I had to basically test evedything you did myself. Hopefully this post gets upvoted a lot and shows up in google search results for people trying to learn. If anyone else is wondering: the reason for this plateau and subsequent drop off is because the bottleneck in generation tokens/sec on typical hardware is the memory bandwidth. In this experiment, it took ~5 CPU cores to have enough compute such that each core was capable of performing math as fast as new numbers could be sent over from RAM. Adding more CPU cores after that increases compute without changing memory bandwidth, so the cores that are being used are sitting there idle part of the time while they wait for slow RAM to give them something to do, plus there's a small overhead cost of coordinating the work between a bunch if cores.

Reply

[-]

MmmmMorphine@reddit

I think it's usually better to pin the cores to the job once you find the number you need. Actually I'd use the next lower number to leave a bit of bandwidth for actual use of the computer (may not be relevant if a server) and to accommodate the inevitable background processes needed so you don't pogo your t/s

Reply

[-]

bonobomaster@reddit (OP)

Ah, very interesting. I didn't think that far but your explanation absolutely makes sense. I probably saturated my RAM bandwidth with only 5 threads / cores... \*sad slow ram noises\*

Reply

[-]

_VirtualCosmos_@reddit

Btw what does "MoE layers offloaded to CPU" mean If you are already processing it with the CPU?

Reply

[-]

eesnimi@reddit

That was the main drawback of my LM Studio use, that it felt that a 6 thread cpu only ran 3. With pure llama.cpp I can run 5-6 and get much more CPU utilization. Still keeping LM Studio, but no longer as the API server, but went for llama-swap. LM studio is still nice for model discovery and quick testing. It keeps it easy to check out the more smaller obscure models and their capabilities.

Reply

[-]

mp3m4k3r@reddit

Is this tokens/sec for generation or prompt processing?

Reply

[-]

bonobomaster@reddit (OP)

For generation.

Reply

[-]

dreamai87@reddit

This could be split of performance core and efficiency core. If you use or set performance core default for lmstudio then you will always get better throughput

Reply

[-]

bonobomaster@reddit (OP)

Oldschool AMD doesn't do this shit! :D We are all equal here! ;)

Reply

[-]

moahmo88@reddit

I think the quality of the model is important.

Reply

Reply to Post

42 Comments