TheaterFire

LM Studio CPU thread pool size vs. tk/s with some MoE layers offloaded to CPU

Posted by bonobomaster@reddit | LocalLLaMA | View on Reddit | 42 comments

LM Studio CPU thread pool size vs. tk/s with some MoE layers offloaded to CPU

Reply to Post

42 Comments

Plastic-Stress-6468@reddit

Interesting, thanks for sharing. Maybe you can try something like CPU lasso and peg all background processes to a few dedicated cores, and reserve cores for just inferencing. 3900x uses a dual chiplet design so not having to pipe things cross chiplet might yield more perf benefits. Try to peg llama.cpp to only cores on the same ccd and peg all other process to the other ccd.
View on Reddit #83809947

Clear-Ad-9312@reddit

I actually tried this on linux and it negatively affected performance. It might be that my memory and CPU are matching closely in speed, or that the hyperthreading technology that has 2 logical cores sharing the same physical core can work in tandem while linux handles which logical core is going to do work dynamically. Since the physical core can only handle a few memory lanes at a time, and 2 logical cores sharing the same memory lane does not increase speed efficiently enough. However with MoE, I imagine that each memory lane will get saturated dynamically. So the default llama.cpp config of half the cpu logical cores(same amount of physical cores) is the sweet spot for performance. That is with intel's hyperthreading technology. I imagine each CPU and RAM configuration will have a different sweet spot. I also wonder if core pinning is even worth it at all on a different setup.
View on Reddit #83825313

Plastic-Stress-6468@reddit

Hey I think you are mixing up two concepts. Hyperthreading for Intel and SMT for AMD are fundamentally the same thing, just a clever scheduler trying to saturate cores by presenting a virtual thread to achieve async compute parallelism. What I talked about was the fact that AMD Zen cpus use a chiplet design which is two CCDs (chiplets) glued together using an interconnect so they can talk to each other, and together present to the MB as one single CPU as opposed to two NUMA nodes. There are actual cores on both chiplets, not conceptual virtual cores used for the scheduler.
View on Reddit #83861626

tony__Y@reddit

bottlenecked by memory bandwidth
View on Reddit #83812355

FatheredPuma81@reddit

This.
View on Reddit #83861321

SnooPaintings8639@reddit

Anyone can explain why the tps is falling as we add more than 6 cores? Also, a similar growh for prompt processing would probably be more linear.
View on Reddit #83809695

KageYume@reddit

The 3900X's 12 cores are divided into 2 groups (CCX - core complex), each group has their own cache. So when a task involves more than 6 cores, inter-CCX communication has to be used and slow everything down.
View on Reddit #83855737

mlhher@reddit

As it is not immediately visible from the chart: Only physical cores provide a speedup, virtual cores do not (they actually slow down). I assume the persons CPU has 5 or 6 physical cores and the rest are virtual (common for desktop Intel CPUs).
View on Reddit #83810543

bonobomaster@reddit (OP)

The person has 12 physical cores and 24 logical cores. ;)
View on Reddit #83811280

Zc5Gwu@reddit

Right but he should have seen scaling to 6 but it seems to drop at 3…
View on Reddit #83810964

Plastic-Stress-6468@reddit

It's almost certainly because there is addtional latency for data being piped between the 3900x's two ccds.
View on Reddit #83810575

bonobomaster@reddit (OP)

Yeah, I'll probably will check that out in the future. I'm interested in that result as well. But my feeling, that prompted this little benchmark, hasn't acted up with the thread count for prompt processing. ;) I think higher is better there.
View on Reddit #83810192

bonobomaster@reddit (OP)

I did a little benchmark of the CPU thread pool size option in LM Studio vs. output speed in tk/s with some MoE layers offloaded to CPU, because I always had a feeling, that a higher thread count was detrimental to the performance. For this particular benchmark I used qwen3.6-35b-a3b@MXFP4 but my feeling was regardless of quant, number of forced CPU layers or MoE model. Don't know if it's the same with dense models with offloaded layers. I did 5 runs for each thread count and averaged the results. For my particular CPU the happy place was 5 threads. Prompt: Write 25 random words. Output as a numbered list. enable\_thinking: false (needed roughly the same token count for each run) GPU offload: Set to 40 layers (all) Forced CPU layers: Set to 16 layers CPU: 12 core / 24 threads AMD Ryzen 9 3900X RAM: 84 GB of very slow DDR4 @ 2933 MHz GPU: 5070 TI VRAM: 16 GB GDDR7 Variables not included in this little "experiment": \- has the number of MoE layers forced to the CPU an influence on the sweet spot of threads? My feeling from past usage says no, but who knows?! \- number of tokens varied from around 120 to 150 tokens per run \- everything I missed and you can think of ;)
View on Reddit #83808185

GreaterThanLess@reddit

I've tested this out on my system and it shows a similar curve. I only did 1 run per thread count, but the prompt I used had it generating more at 2000-3000 tokens so it smooths out okay. Model: Qwen3.6-35B-A3B-UD-Q4\_K\_XL CPU: Ryzen 3900x RAM: 96GB DDR4 @ 2133MT/s GPU: 4070 TI 12 GB https://preview.redd.it/sxi1cq0srzvg1.png?width=605&format=png&auto=webp&s=e7e45fe59c33692626635c11088a878d32ed40a1
View on Reddit #83827638

bonobomaster@reddit (OP)

Huh, same CPU, slower ram, sweet spot thread count higher... that would be interesting. But I have a hunch, that the 1000 tokens variance of your 2000-3000 token answers could introduce a hefty measurement error. That's why I took 5 runs each with a shorter output to get a more uniform tokens per second / context length variance. And in a laptop system, temperature could also become more of a factor, with longer inference times. When an output, at the real sweet spot for example is 3000 tokens long, it degrades speed while infering through the larger context, maybe even temperature, which leads to reduced tk/s and therefore could skew the result and hide the real sweet spot. Same at a potential spot, that isn't sweet but only got to generate 2000 tokens. Less speed degradation, quicker result, higher tk/s, false positive for beeing sweet. :D Inference speed isn't linear and changes with the growing context length.
View on Reddit #83833022

aaronr_90@reddit

Did you start at 12 threads and work your way down or did you start at 1 and go up or was it random? If you should do it again and reverse the ordering. Could your CPU and GPU be thermal throttling? I don’t think that is enough of a loss to be thermal throttling but just putting it out there.
View on Reddit #83808866

bonobomaster@reddit (OP)

Nowhere near throttling. I just did a separate test of 5 fast successive runs @ 5 threads and the GPU went from 32.5 °C idle to 37.5 °C and dropped back to baseline in a hand full of seconds @ 12 threads GPU only reached 36.9 °C CPU went from 36.2 °C to 47.3 °C core temp @ 5 threads and dropped back to baseline in a hand full of seconds @ 12 threads max core temp was 49.9 °C. Order was 1 thread to 12 threads. The time I took to enter the results into my spreadsheet was pretty much enough to cool everything down enough, so that the temperature variation seems without relevant consequence. I absolutely won't repeat the experiment. ;)
View on Reddit #83809840

aaronr_90@reddit

Ok, I believe you. But I also gave it a go to see what the sweet spot on my laptop was. I have a Dell XPS laptop with a 8gb 4070. I used Qwen3.5 35B A3B and offloaded all layers to GPU and then forced 33 layers on to CPU (the sweet spot before shared mem goes up and t/s goes down). I did randomly sample thread count but at the end I did not see much of a difference between high thread counts done up front and at the end. https://preview.redd.it/ck3fj479xyvg1.png?width=563&format=png&auto=webp&s=77b0feb5d8e974b60fdedbd9b77d6d21b529e22c I fully expected my results to have a drop off as yours did due to overhead and diminishing returns. This took 5 minutes. I would honestly recommend everyone do a test like this if you are curious.
View on Reddit #83814044

bonobomaster@reddit (OP)

Very interesting. Do you have DDR5 RAM per chance? I guess, I got only RAM bandwidth for about 5 threads worth of compute! :D
View on Reddit #83815717

aaronr_90@reddit

I do! I have 64 gb of DDR5
View on Reddit #83816668

denoflore_ai_guy@reddit

Maybe I should start doing this so my cpu hits above 3% during each prompt processing and inference output.
View on Reddit #83821616

bonobomaster@reddit (OP)

Are you sure, you want more CPU involved? :|
View on Reddit #83822728

denoflore_ai_guy@reddit

Right now my home PCIE bus is pinned completely to 100% on inference. I have no more performance to squeeze out of that. Only thing I can do is offload to cpu
View on Reddit #83826294

usuallyalurker11@reddit

Looks like 3 thread counts is the sweet spot.
View on Reddit #83809086

bonobomaster@reddit (OP)

Sweet spot for power efficacy?
View on Reddit #83809974

usuallyalurker11@reddit

I think so. Looks like the tk/s is bottlenecked by memory bandwidth so >3 thread counts is just a waste of power consumption.
View on Reddit #83814628

Clear-Ad-9312@reddit

in my testing droping to 3 threads halves the speed. I am testing an intel hyperthreading enabled CPU, with 2 channels of DDR4. Max logical cores are 12 with hyperthreading, but physical cores max is 6. The sweet spot for me is matching the number of physical cores which is half by default on llama-cpp. If the CPU is strong enough, then memory can be the bottleneck but if memory and cpu match in performance (or cpu is weaker) then the number of physical cores that map to each memory lane is more important.
View on Reddit #83824697

Wetbikeboy2500@reddit

In my testing, the biggest thing is to look for CPU bottlenecks, which is usually wattage or heat. Just for some historical knowledge, I had found 8 cores to be the best for my machine for MOE offload. [https://www.reddit.com/r/LocalLLaMA/comments/1kaqx3x/comment/mppms06/](https://www.reddit.com/r/LocalLLaMA/comments/1kaqx3x/comment/mppms06/) Since then, with a lot of llama-bench and spot checking llama-cli, 7 threads is slightly more stable for me. Also, not all cores are made equal. For intel, there is the intel extreme tuning utility where you can see what cores are actually the best performing and get the information around wattage draw and temperature limits while running. I assume there is something similar for AMD. I found bottlenecks in temperature, so I got a new cooler and it has been running much better. The main reason it runs better for me is the CPU can maintain its boost clock at its max without having temperature limits throttling the cores. When I start to throw more cores at it, the performance does stay the same, but I slowly see the boost clock go down alongside it to negate any gains. I have 8gb VRAM and 64 GB DDR5. For Qwen3.6 with 34 layers offloaded to CPU, I get around 600-1000 pp/s and 43 tg/s with 7 threads.
View on Reddit #83823548

Equivalent_Job_2257@reddit

This chart is most probably limited to one single machine on earth. 
View on Reddit #83814157

bonobomaster@reddit (OP)

Yeah, you are probably right. As I wrote 5 threads is the happy place for my CPU. Should have written my system though but I hadn't realized yet, that this chart showed me my RAM bandwidth bottleneck, rather than a CPU overhead problem. I'm learning... ;)
View on Reddit #83818247

Equivalent_Job_2257@reddit

That's fine!
View on Reddit #83822753

Iory1998@reddit

Exactly! I wish it was that easy. On my machine, pool size doesn't do squat.
View on Reddit #83815131

gigaflops_@reddit

Thank you for doing this experiment and telling us about what variables were and were not controlled. When I was a noob, I had to basically test evedything you did myself. Hopefully this post gets upvoted a lot and shows up in google search results for people trying to learn. If anyone else is wondering: the reason for this plateau and subsequent drop off is because the bottleneck in generation tokens/sec on typical hardware is the memory bandwidth. In this experiment, it took ~5 CPU cores to have enough compute such that each core was capable of performing math as fast as new numbers could be sent over from RAM. Adding more CPU cores after that increases compute without changing memory bandwidth, so the cores that are being used are sitting there idle part of the time while they wait for slow RAM to give them something to do, plus there's a small overhead cost of coordinating the work between a bunch if cores.
View on Reddit #83809973

MmmmMorphine@reddit

I think it's usually better to pin the cores to the job once you find the number you need. Actually I'd use the next lower number to leave a bit of bandwidth for actual use of the computer (may not be relevant if a server) and to accommodate the inevitable background processes needed so you don't pogo your t/s
View on Reddit #83818958

bonobomaster@reddit (OP)

Ah, very interesting. I didn't think that far but your explanation absolutely makes sense. I probably saturated my RAM bandwidth with only 5 threads / cores... \*sad slow ram noises\*
View on Reddit #83810456

_VirtualCosmos_@reddit

Btw what does "MoE layers offloaded to CPU" mean If you are already processing it with the CPU?
View on Reddit #83817706

eesnimi@reddit

That was the main drawback of my LM Studio use, that it felt that a 6 thread cpu only ran 3. With pure llama.cpp I can run 5-6 and get much more CPU utilization. Still keeping LM Studio, but no longer as the API server, but went for llama-swap. LM studio is still nice for model discovery and quick testing. It keeps it easy to check out the more smaller obscure models and their capabilities.
View on Reddit #83813040

mp3m4k3r@reddit

Is this tokens/sec for generation or prompt processing?
View on Reddit #83811963

bonobomaster@reddit (OP)

For generation.
View on Reddit #83812081

dreamai87@reddit

This could be split of performance core and efficiency core. If you use or set performance core default for lmstudio then you will always get better throughput
View on Reddit #83811042

bonobomaster@reddit (OP)

Oldschool AMD doesn't do this shit! :D We are all equal here! ;)
View on Reddit #83811118

moahmo88@reddit

I think the quality of the model is important.
View on Reddit #83810366