Thread for CPU-only LLM performance comparison
Posted by MLDataScientist@reddit | LocalLLaMA | View on Reddit | 46 comments
Hi everyone,
I could not find any recent posts about CPU only performance comparison of different CPUs. With recent advancements in CPUs, we are seeing incredible memory bandwidth speeds with DDR5 6400 12 channel EPYC 9005 (614.4 GB/s theoretical bw). AMD also announced that Zen 6 CPUs will have 1.6TB/s memory bw. The future of CPUs looks exciting. But for now, I wanted to test what we already have. I need your help to see where we stand with CPUs currently.
For this CPU only comparison, I want to use ik_llama - https://github.com/ikawrakow/ik_llama.cpp . I compiled and tested both ik_llama and llama.cpp with MoE models like Qwen3 30B3A Q4_1, gpt-oss 120B Q8 and qwen3 235B Q4_1. ik_llama is at least 2x faster prompt processing (PP) and 50% faster in text generation (TG).
For this benchmark, I used Qwen3 30B3A Q4_1 (19.2GB) (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/Qwen3-30B-A3B-Q4_1.gguf) and ran ik_llama in Ubuntu 24.04.3.
ik_llama installation:
git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
cmake -B build
cmake --build build --config Release -j $(nproc)
llama-bench benchmark (make sure GPUs are disabled with CUDA_VISIBLE_DEVICES="" just in case if you compiled for GPUs):
CUDA_VISIBLE_DEVICES="" ./build/bin/llama-bench -m /media/ai-llm/wd_2t/models/Qwen3-30B-A3B-Q4_1.gguf --threads 32
| model | size | params | backend | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
| qwen3moe ?B Q4_1 | 17.87 GiB | 30.53 B | CPU | 32 | 0 | pp512 | 263.02 ± 2.53 |
| qwen3moe ?B Q4_1 | 17.87 GiB | 30.53 B | CPU | 32 | 0 | tg128 | 38.98 ± 0.16 |
build: 6d2e7ca4 (3884)
GPT-OSS 120B:
CUDA_VISIBLE_DEVICES="" ./build/bin/llama-bench -m /media/ai-llm/wd_2t/models/GPT_OSS_120B_UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf -mmp 0 --threads 32
| model | size | params | backend | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
| gpt-oss ?B Q8_0 | 60.03 GiB | 116.83 B | CPU | 32 | 0 | pp512 | 163.24 ± 4.46 |
| gpt-oss ?B Q8_0 | 60.03 GiB | 116.83 B | CPU | 32 | 0 | tg128 | 24.77 ± 0.42 |
build: 6d2e7ca4 (3884)
So, the requirement for this benchmark is simple:
- Required: use CPU only inference (No APUs, NPUs, or build-in GPUs allowed)
- use ik-llama (any recent version) if possible since llama.cpp will be slower for your CPU performance
- Required model: Run the standard llama-bench benchmark with Qwen3-30B-A3B-Q4_1.gguf (2703 version should also be fine as long as it is Q4_1) and share the command with output in the comments as I shared above.
- Optional (not required but good to have): run CPU only benchmark with GPT-OSS 120B (file here: https://huggingface.co/unsloth/gpt-oss-120b-GGUF/tree/main/UD-Q8_K_XL) and share the command with output in the comments.
I will start by adding my CPU performance in this table below.
Motherboard | CPU (physical cores) | RAM size and type | channels | Qwen3 30B3A Q4_1 TG | Qwen3 30B3A Q4_1 PP |
---|---|---|---|---|---|
AsRock ROMED8-2T | AMD EPYC 7532 (32 cores) | 8x32GB DDR4 3200Mhz | 8 | 39.98 | 263.02 |
I will check comments daily and keep updating the table.
This awesome community is the best place to collect such performance metrics.
Thank you!
wasnt_me_rly@reddit
MB: Dell T630
CPU: 2x E5-2695 v4 18c / 36t
RAM: 8x64gb DDR4-2400 ECC
Channels: 4 per CPU, 8 total
IK_LLAMA w/o HT
Not sure why build is reporting as unknown but it was sync'ed and built today so its the latest.
wasnt_me_rly@reddit
Also ran the same w/ llama.cpp
LLAMA.CPP w/o HT
As this was compiled with the CUDA library, it was disabled via a run-time switch.
MLDataScientist@reddit (OP)
Great! Thank you!
Klutzy-Snow8016@reddit
The command I used was
CUDA_VISIBLE_DEVICES="" ./build/bin/llama-bench -m /path/to/Qwen3-30B-A3B-Q4_1.gguf -mmp 0 --threads (number of physical CPU cores)
in all cases.
Gigabyte B85M-D3H - Core i7-4790K - 32GB DDR3 1333 (Dual Channel) - Linux bare metal:
Asus TUF B450M-Plus Gaming - Ryzen 7 2700 - 32GB DDR4 3200 (Dual Channel) - wsl within Windows:
Gigabyte B550 AORUS ELITE AX V2 - Ryzen 7 3700X - 128GB DDR4 3200 (Dual Channel) - Linux bare metal:
Gigabyte B450 Aorus M - Ryzen 7 5800X3D - 128GB DDR4 3200 (Dual Channel) - wsl within Windows:
The 2700 is way slower than the 3700X, apparently.
MLDataScientist@reddit (OP)
Thank you for multiple results. I will add them soon.
Rynn-7@reddit
Motherboard | CPU (physical cores) | RAM | Channels | Qwen3 30B3AQ4_1 TG | Qwen3 30B3AQ4_1 PP
:--|:--:|:--|:--:|:--:|--:
AsRock ROMED8-2T | AMD EPYC 7742 (64 cores) | 8x64GB DDR4 3200MT/s | 8 | 37.05 | 358.97
Rynn-7@reddit
Since OP and I have nearly the same hardware, we can look at how CPU cores affect performance.
Doubling the core count from 32 to 64 cores has no effect on token/second generation rates, but it speeds up time to first token by approximately 70%.
MLDataScientist@reddit (OP)
Great results, thanks! I will add these results to the post soon.
mike95465@reddit
Asrock x399 Taichi
Threadripper 1950x (16C 32T)
64GB DDR4 3000 Mhz CL22 (4x16GB quad channel)
Qwen3-30B-A3B-Q4_1.gguf
latest ik_llama
| Threads | pp512 t/s (±) | tg128 t/s (±) |
| ------- | ------------- | ------------- |
| 8 | 55.63 ± 0.22 | 20.32 ± 0.14 |
| 10 | 64.87 ± 0.31 | 21.63 ± 0.15 |
| 12 | 75.64 ± 1.22 | 21.62 ± 0.63 |
| 14 | 77.32 ± 2.91 | 21.59 ± 0.55 |
| 16 | 70.36 ± 4.54 | 21.16 ± 0.66 |
| 18 | 64.98 ± 0.22 | 20.35 ± 0.19 |
| 20 | 72.01 ± 0.21 | 20.34 ± 0.23 |
| 22 | 75.21 ± 0.33 | 20.31 ± 0.26 |
| 24 | 84.00 ± 0.46 | 20.21 ± 0.25 |
| 26 | 86.13 ± 0.55 | 19.41 ± 0.32 |
| 28 | 86.64 ± 0.26 | 18.04 ± 0.15 |
| 30 | 87.09 ± 0.65 | 15.62 ± 0.55 |
| 32 | 90.14 ± 0.64 | 9.66 ± 0.87 |
MLDataScientist@reddit (OP)
Thank you!
Secure_Reflection409@reddit
OP: I was thinking of getting this cpu but those numbers are not super exciting. Have you measured memory bandwidth?
MLDataScientist@reddit (OP)
Yes, in triad bench, I was getting 145 GB/s. I am sure there is a way to improve this but I have not looked into bios settings. At 90% efficiency, we should get ~184 GB/s. But I need to work with the bios.
Rynn-7@reddit
Be sure to post any findings here. I've only been messing with AI locally for about a month now, but any improvement I can make to my system would be huge. Right now I have almost identical T/s to your rig (our hardware is largely similar).
MLDataScientist@reddit (OP)
Also, my CPU is not water cooled. I am using just a dynatron U2 cooler.
gapingweasel@reddit
tbh it makes old server junk way more interesting... those dusty EPYCs/Xeons with fat memory channels you see on eBay suddenly look like budget LLM toys..it;s crazy that decommissioned gear can outpace shiny new desktop CPUs for this niche.
zipzag@reddit
Especially because it's the always on typical server use that is so unattractive with these monsters.
I expect 120B to run much slower with a large context window and need and 10+GB.
jmager@reddit
I think it would be very useful to try different thread counts. I found with my 7950x (two channels) I actually got worse performance if the thread count got too large. In my case I think the best performance was with 2 threads for each memory channel. I'd suspect its all an interplay between memory latency and thread starvation, and more data could help us capture that relationship.
Otherwise-Loss-8419@reddit
This is what I get on my PC running a manual RAM OC.
CPU: 14900K @ 5.6 GHz P-core, 4.8 GHz ring
RAM: 48GB DDR5 @ 7600. Gets about 119GB/s bandwidth and 46.8ns latency measured by Intel MLC.
Motherboard is Asrock z790 riptide wifi
Running kernel 6.16.5-zen on Arch with the cpu governor set to performance.
llama.cpp:
ik_llama.cpp:
It would possibly perform a bit better with hyper-threading, but I don't really want to enable it just for a benchmark.
Some notes/observations
E-cores absolutely ruin performance on both pp and tg. --threads 24 performs worse than --threads 4. So, on Intel, it's best to only use the P-cores.
Doing taskset helps a bit (\~5%) with ik_llama.cpp, but doesn't change anything on llama.cpp. Not sure why.
MLDataScientist@reddit (OP)
Thank you! Those are very good numbers for Intel 14900k.
milkipedia@reddit
My kit:
Lenovo P620 workstation (proprietary AMD Castle Peak)
CPU: AMD Ryzen Threadripper PRO 3945WX 12-Cores
Memory: 128 GB 288-Pin, DDR4 3200MHz ECC RDIMM (8 x 16GB)
Qwen3-30B-A3B-Q4_1 on ik_llama.cpp:
gpt-oss-120b-UD-Q8_K_XL on ik_llama.cpp:
Git commit log info for ik_llama.cpp, since I'm not sure how else to share version info for my build environment:
milkipedia@reddit
For comparison's sake, because I haven't yet figured out how to tune ik_llama.cpp to produce significantly better performance than plain vanilla llama.cpp...
Qwen3-30B-A3B-Q4_1 on llama.cpp:
gpt-oss-120b-UD-Q8_K_XL on llama.cpp:
Git commit log info llama.cpp:
MLDataScientist@reddit (OP)
Thank you! Oh, gpt-oss 120b performance is interesting. Not sure why you are getting 2t/s in ik_llama and ~14t/s in llama.cpp.
in my case, I was getting ~16t/s in llama cpp but ik_llama compiled with the command in the post gave me ~25 t/s.
milkipedia@reddit
A couple of weeks back, I tried a bunch of different tuning parameters to see if I could get a different outcome, using the ggml.org MXFP4 quant. Maybe the DDR4 RAM is the limiting factor here. I really don't know. Thankfully, I have a RTX 3090 GPU that speeds this up quite a lot, or else gpt-oss-120b would not be usable at all for me.
I don't recall the command I used to compile ik_llama.cpp, so let me give it a try with what you posted and see if the results differ.
TechnoRhythmic@reddit
Great thread. Can you also add higher context length benchmarks. There is a simple flag for it I think.
MLDataScientist@reddit (OP)
Good point. I will add 8k context as well.
Steus_au@reddit
is it only for geeks? or possible to test on win10?
MLDataScientist@reddit (OP)
You should be able to compile ik_llama in windows and run the same tests.
ttkciar@reddit
I need to update this table with more recent models' performances, but it's where I've been recording my pure-CPU inference with llama.cpp on my laptop (i7-9750H CPU) and ancient Xeon (dual E5-2660v3):
http://ciar.org/h/performance.html
MLDataScientist@reddit (OP)
You should definitely test ik_ llama. You will see good speed up.
lly0571@reddit
I don't have a Q4_1 model now, the Q4_K_XL quants I am using could be slower.
That's my PC, it don't have enough RAM to run GPT-OSS-120B.
Motherboard: MSI B650M Mortar
RAM: 2 x 32GB DDR5 6000
CPU: Ryzen 7 7700(8c)
That's my server, I think there are some config issue here as using thread 64 would be much slower, maybe I should enable HT.
Motherboard: Tyan S8030GM2NE
RAM: 8 x 64GB DDR4 2666
CPU: 1S Epyc 7B13(64c, HT disabled manually)
MLDataScientist@reddit (OP)
Yes, there is definitely something wrong with the server in your case. You should get better results than my server.
MLDataScientist@reddit (OP)
Thank you!
Secure_Reflection409@reddit
Maybe try more threads?
Pentium95@reddit
i used ik_llama.cpp sweep bench to test every thread count, with my Ryzen 9 5950X (16 cores, 32 threads, 64MB L3), 4x16 DDR4 3800 MHz the amount of threads that gave me the best PP and TG speed is 7 with CPU + GPU inference. I never tested CPU only, tho, I think, due to the importance of L3 cache usage, the sweet spot is not gonna be above 9 threads. Linux Fedora. Usually, I saw on many posts, lots of users recommend "physical cores -1" and it was correct with my older CPU (6 core, 12 threads), 5 was the sweet spot. I tried to understand why 7 threads are giving me Better performance then 15 threads and I found out it is connected with the huge amount of time "wasted" with L3 cache misses caused by threads constantly loading and unloading LLM weights from the system memory.
MelodicRecognition7@reddit
this is correct only for generic low core gaming CPUs but not suitable for server CPUs.
https://old.reddit.com/r/LocalLLaMA/comments/1ni67vw/llamacpp_not_getting_my_cpu_ram/nehqxgv/
https://old.reddit.com/r/LocalLLaMA/comments/1ni67vw/llamacpp_not_getting_my_cpu_ram/nehnt27/
Rynn-7@reddit
I haven't seen this in my personal testing on an EPYC CPU. I only see a very moderate drop in token generation speed by utilizing every thread in the system, roughly 5% less.
In return, I get a massive reduction in TTFT.
Secure_Reflection409@reddit
It prolly ain't optimal for any CPU.
The definitive way is to check cpu utilisation and increment or decrement from there. You want to be as close to 100% without hitting 100, IMHO.
For me, on a 7800X3D, that's 12 threads but I did see at least one benchmark respond better with 16.
It's an 8 core / 16 thread processor.
Pentium95@reddit
yeah, 2 memory channels, memory bandwidth is a huge bottleneck. CPU inference needs atleast 8 memory channels with 5600 MHz modules, to really get decent speeds.
KillerQF@reddit
for such a table it would be useful to include the name of the frame work (ik_llama, llama.cpp, ..) and the version.
chisleu@reddit
MLDataScientist@reddit (OP)
Well, no. Any CPU should be fine for this benchmark as long as you have 20GB+ CPU RAM for qwen3 30B3A.
NoFudge4700@reddit
I cannot read on phone how many tokens per second did you get and what’s the context window you set?
MLDataScientist@reddit (OP)
qwen3 30BA3 Q4_1 runs at \~40t/s with 263 t/s prompt processing (CPU only).
NoFudge4700@reddit
That is decent performance. I have an Intel 14700KF and 32 GB DDR5 RAM. Can I pull same stats?
MLDataScientist@reddit (OP)
Not sure. I think you might not get \~40t/s with two channel memory. I have 8 channel memory with server CPU. Please, run the llama-bench and share results here.
NoFudge4700@reddit
Will do, thanks.