Thread for CPU-only LLM performance comparison

Posted by MLDataScientist@reddit | LocalLLaMA | View on Reddit | 46 comments

Hi everyone,

I could not find any recent posts about CPU only performance comparison of different CPUs. With recent advancements in CPUs, we are seeing incredible memory bandwidth speeds with DDR5 6400 12 channel EPYC 9005 (614.4 GB/s theoretical bw). AMD also announced that Zen 6 CPUs will have 1.6TB/s memory bw. The future of CPUs looks exciting. But for now, I wanted to test what we already have. I need your help to see where we stand with CPUs currently.

For this CPU only comparison, I want to use ik_llama - https://github.com/ikawrakow/ik_llama.cpp . I compiled and tested both ik_llama and llama.cpp with MoE models like Qwen3 30B3A Q4_1, gpt-oss 120B Q8 and qwen3 235B Q4_1. ik_llama is at least 2x faster prompt processing (PP) and 50% faster in text generation (TG).

For this benchmark, I used Qwen3 30B3A Q4_1 (19.2GB) (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/Qwen3-30B-A3B-Q4_1.gguf) and ran ik_llama in Ubuntu 24.04.3.

ik_llama installation:

git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
cmake -B build
cmake --build build --config Release -j $(nproc)

llama-bench benchmark (make sure GPUs are disabled with CUDA_VISIBLE_DEVICES="" just in case if you compiled for GPUs):

CUDA_VISIBLE_DEVICES="" ./build/bin/llama-bench -m /media/ai-llm/wd_2t/models/Qwen3-30B-A3B-Q4_1.gguf --threads 32

| model                          |       size |     params | backend    | threads | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
| qwen3moe ?B Q4_1               |  17.87 GiB |    30.53 B | CPU        |      32 |    0 |         pp512 |    263.02 ± 2.53 |
| qwen3moe ?B Q4_1               |  17.87 GiB |    30.53 B | CPU        |      32 |    0 |         tg128 |     38.98 ± 0.16 |

build: 6d2e7ca4 (3884)

GPT-OSS 120B:

CUDA_VISIBLE_DEVICES="" ./build/bin/llama-bench -m /media/ai-llm/wd_2t/models/GPT_OSS_120B_UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf -mmp 0 --threads 32
| model                          |       size |     params | backend    | threads | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
| gpt-oss ?B Q8_0                |  60.03 GiB |   116.83 B | CPU        |      32 |    0 |         pp512 |    163.24 ± 4.46 |
| gpt-oss ?B Q8_0                |  60.03 GiB |   116.83 B | CPU        |      32 |    0 |         tg128 |     24.77 ± 0.42 |

build: 6d2e7ca4 (3884)

So, the requirement for this benchmark is simple:

I will start by adding my CPU performance in this table below.

Motherboard CPU (physical cores) RAM size and type channels Qwen3 30B3A Q4_1 TG Qwen3 30B3A Q4_1 PP
AsRock ROMED8-2T AMD EPYC 7532 (32 cores) 8x32GB DDR4 3200Mhz 8 39.98 263.02

I will check comments daily and keep updating the table.

This awesome community is the best place to collect such performance metrics.

Thank you!