Thread for CPU-only LLM performance comparison

Posted by MLDataScientist@reddit | LocalLLaMA | View on Reddit | 46 comments

Hi everyone,

I could not find any recent posts about CPU only performance comparison of different CPUs. With recent advancements in CPUs, we are seeing incredible memory bandwidth speeds with DDR5 6400 12 channel EPYC 9005 (614.4 GB/s theoretical bw). AMD also announced that Zen 6 CPUs will have 1.6TB/s memory bw. The future of CPUs looks exciting. But for now, I wanted to test what we already have. I need your help to see where we stand with CPUs currently.

For this CPU only comparison, I want to use ik_llama - https://github.com/ikawrakow/ik_llama.cpp . I compiled and tested both ik_llama and llama.cpp with MoE models like Qwen3 30B3A Q4_1, gpt-oss 120B Q8 and qwen3 235B Q4_1. ik_llama is at least 2x faster prompt processing (PP) and 50% faster in text generation (TG).

For this benchmark, I used Qwen3 30B3A Q4_1 (19.2GB) (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/Qwen3-30B-A3B-Q4_1.gguf) and ran ik_llama in Ubuntu 24.04.3.

ik_llama installation:

git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
cmake -B build
cmake --build build --config Release -j $(nproc)

llama-bench benchmark (make sure GPUs are disabled with CUDA_VISIBLE_DEVICES="" just in case if you compiled for GPUs):

CUDA_VISIBLE_DEVICES="" ./build/bin/llama-bench -m /media/ai-llm/wd_2t/models/Qwen3-30B-A3B-Q4_1.gguf --threads 32

| model                          |       size |     params | backend    | threads | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
| qwen3moe ?B Q4_1               |  17.87 GiB |    30.53 B | CPU        |      32 |    0 |         pp512 |    263.02 ± 2.53 |
| qwen3moe ?B Q4_1               |  17.87 GiB |    30.53 B | CPU        |      32 |    0 |         tg128 |     38.98 ± 0.16 |

build: 6d2e7ca4 (3884)

GPT-OSS 120B:

CUDA_VISIBLE_DEVICES="" ./build/bin/llama-bench -m /media/ai-llm/wd_2t/models/GPT_OSS_120B_UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf -mmp 0 --threads 32
| model                          |       size |     params | backend    | threads | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
| gpt-oss ?B Q8_0                |  60.03 GiB |   116.83 B | CPU        |      32 |    0 |         pp512 |    163.24 ± 4.46 |
| gpt-oss ?B Q8_0                |  60.03 GiB |   116.83 B | CPU        |      32 |    0 |         tg128 |     24.77 ± 0.42 |

build: 6d2e7ca4 (3884)

So, the requirement for this benchmark is simple:

Required: use CPU only inference (No APUs, NPUs, or build-in GPUs allowed)
use ik-llama (any recent version) if possible since llama.cpp will be slower for your CPU performance
Required model: Run the standard llama-bench benchmark with Qwen3-30B-A3B-Q4_1.gguf (2703 version should also be fine as long as it is Q4_1) and share the command with output in the comments as I shared above.
Optional (not required but good to have): run CPU only benchmark with GPT-OSS 120B (file here: https://huggingface.co/unsloth/gpt-oss-120b-GGUF/tree/main/UD-Q8_K_XL) and share the command with output in the comments.

I will start by adding my CPU performance in this table below.

Motherboard	CPU (physical cores)	RAM size and type	channels	Qwen3 30B3A Q4_1 TG	Qwen3 30B3A Q4_1 PP
AsRock ROMED8-2T	AMD EPYC 7532 (32 cores)	8x32GB DDR4 3200Mhz	8	39.98	263.02

I will check comments daily and keep updating the table.

This awesome community is the best place to collect such performance metrics.

Thank you!

[-]

wasnt_me_rly@reddit

MB: Dell T630

CPU: 2x E5-2695 v4 18c / 36t

RAM: 8x64gb DDR4-2400 ECC

Channels: 4 per CPU, 8 total

IK_LLAMA w/o HT

Not sure why build is reporting as unknown but it was sync'ed and built today so its the latest.

# /usr/src/ik_llama.cpp/build/bin/llama-bench  -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf -mmp 0 --threads 36 -ngl 0
| model                          |       size |     params | backend    | threads | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
| gpt-oss ?B MXFP4 - 4.25 bpw    |  59.02 GiB |   116.83 B | CPU        |      36 |    0 |         pp512 |   109.71 ± 10.37 |
| gpt-oss ?B MXFP4 - 4.25 bpw    |  59.02 GiB |   116.83 B | CPU        |      36 |    0 |         tg128 |     11.30 ± 0.04 |

build: unknown (0)

# /usr/src/ik_llama.cpp/build/bin/llama-bench  -m Qwen3-30B-A3B-Instruct-2507-UD-Q4_K_XL.gguf -mmp 0 --threads 36 -ngl 0
| model                          |       size |     params | backend    | threads | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
| qwen3moe ?B Q4_K - Medium      |  16.47 GiB |    30.53 B | CPU        |      36 |    0 |         pp512 |   180.18 ± 14.99 |
| qwen3moe ?B Q4_K - Medium      |  16.47 GiB |    30.53 B | CPU        |      36 |    0 |         tg128 |     15.97 ± 0.46 |

build: unknown (0)

# /usr/src/ik_llama.cpp/build/bin/llama-bench  -m Qwen3-30B-A3B-Q4_1.gguf -mmp 0 --threads 36 -ngl 0
| model                          |       size |     params | backend    | threads | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
| qwen3moe ?B Q4_1               |  17.87 GiB |    30.53 B | CPU        |      36 |    0 |         pp512 |    183.84 ± 9.64 |
| qwen3moe ?B Q4_1               |  17.87 GiB |    30.53 B | CPU        |      36 |    0 |         tg128 |     15.71 ± 0.05 |

build: unknown (0)

[-]

wasnt_me_rly@reddit

Also ran the same w/ llama.cpp

LLAMA.CPP w/o HT

As this was compiled with the CUDA library, it was disabled via a run-time switch.

# CUDA_VISIBLE_DEVICES="" /usr/src/llama.cpp/build/bin/llama-bench -m ./gpt-oss-120b-mxfp4-00001-of-00003.gguf -mmp 0 --threads 36 -ngl 0
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |   0 |    0 |           pp512 |         59.04 ± 2.04 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |   0 |    0 |           tg128 |          5.16 ± 0.04 |


# CUDA_VISIBLE_DEVICES="" /usr/src/llama.cpp/build/bin/llama-bench -m ./Qwen3-30B-A3B-Instruct-2507-UD-Q4_K_XL.gguf -mmp 0 --threads 36 -ngl 0

ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  16.47 GiB |    30.53 B | CUDA       |   0 |    0 |           pp512 |       105.95 ± 10.56 |
| qwen3moe 30B.A3B Q4_K - Medium |  16.47 GiB |    30.53 B | CUDA       |   0 |    0 |           tg128 |         12.62 ± 0.17 |

build: 45363632 (6249)

# CUDA_VISIBLE_DEVICES="" /usr/src/llama.cpp/build/bin/llama-bench -m ./Qwen3-30B-A3B-Q4_1.gguf -mmp 0 --threads 36 -ngl 0
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_1          |  17.87 GiB |    30.53 B | CUDA       |   0 |    0 |           pp512 |         92.32 ± 2.50 |
| qwen3moe 30B.A3B Q4_1          |  17.87 GiB |    30.53 B | CUDA       |   0 |    0 |           tg128 |         10.72 ± 0.25 |

build: 45363632 (6249)

[-]