Please help improving a CPU-only inference speed
Posted by HumanDrone8721@reddit | LocalLLaMA | View on Reddit | 52 comments
This is a request for help for the people that want to use locally very large models on Q8 and better quanta at all costs, in my case the cost is inference speed.
So I have a 512GB DDR4 ECC 2666 with a Threadripper Pro 3945WS that gives me ca. 5-7tok/second for MiniMax-2.7 with llama.cpp CPU backend. Yes, it probably feels like torture for the ADHD generation, but I'm using it for processing LARGE specs and planning, and it steers a Qwen-3.6-27B for implementation and testing. Of course I've tried first low-bit quanta but the decrease in performance was not worth the marginal increase in speed.
So I was wondering if someone has any "tricks", unmerged PRs or hidden gems (I get that the CPU only inference is not the most popular topic right now, but maybe there are some half forgotten github repos somewhere), to maximize the inference output without sacrificing the model weights.
Also another topic of interest will be upgrading the bottom of the barrel CPU to a 5975, while everyone emphatically says that the inference speed is memory bandwidth bound, I see that during the PP and also on the inference all the cores are at 100% load. Here even the cloud models have contradictory answers, from "no significant increase" to doubling the speed. I really want to hear it from someone that actually did this.
czktcx@reddit
when a CPU core waiting a memory read for instructions, it's still categorized as "busy"...
Usually attention is much bandwidth bound than ffn, and when context is long enough, it's more compute intensive than ffn, so consider adding a GPU to do attention part.
HumanDrone8721@reddit (OP)
The tests with "little potato" GPU a 12GB RTX 4070Ti are done and show some improvement.
Tomorrow I'll do a "potato" 24GB RTX 3090FE, a "strong potato" RTX 4090 and soon a "dual-potato" 2x RTX3090.
The idea is for the given constraints ( Q8 quantization, DDR4-2666 bandwidth, 3945WS CPU) what is the configuration that uses these most optimal, to not leave anything on the table and while is actual obvious in hindsight that some GPU assist is needed, what % of the models size should should be VRAM instead of RAM.
Sadly it seems that I wasn't properly understood and was mostly hit with the obvious: "Bring on the VRAM cannon, add some AMD rotten potatoes to the stew, or whatever, just add more VRAM" and "Bro, just use IQ2, is practically the same...", the well known expensive or sloppy non-solutions to an optimization problem.
Anyway, the tests so far showed some unexpected to me results, and when I'm done I'll try to make a summary new post.
czktcx@reddit
FFNs compute/bandwidth requirement is irrelevant to context length, so they can all be put in CPU/RAM. Attention is sensitive to context so better put in GPU/VRAM.
Which means, in theory the VRAM you need is just to cover KVCache and QKV projection weights(and some compute buffer).
Btw, quantization type also impact the performance, 3bit weights may not always be faster than 4bit, especially when it involves too many calcualtion to extract data from compressed form
po_stulate@reddit
Get a tiny 8GB GPU with faster VRAM than your 2666, and offload kv cache to it, you can leave everything else on your slow CPU RAM.
HumanDrone8721@reddit (OP)
I was thinking of it, but how do I do this, I have 4080Ti with 12GB collecting dust, so I can use this and report back. Any special command line options to llama.cpp?
po_stulate@reddit
I have a 6 yo dell laptop with a 4GB gtx 1650 ti GPU and i7 10750H CPU. If I put kv cache on CPU RAM, I got 7 tokens/s for gemma-4-26b-a4b Q4_K_XL, but if I put it on GPU RAM, it's 20 tokens/s. I'm using llama.cpp on Linux.
For comparison, I also have a M4 Max 128GB laptop, and the same model Q6_K_XL runs at 75 tokens/s, so with how old the laptop is, 20 tokens/s feels really not that bad.
d1722825@reddit
How do you put the KV cache on GPU RAM witch llama.cpp? Could you share the command line arguments? Thanks.
po_stulate@reddit
For a MoE model, I do
-ngl 999,--n-cpu-moe <N>, and make sure there's no--no-kv-offloadin your command. You adjustN, which is number of layers to load to your CPU RAM everything else will be on your VRAM, including KV cache.First set
Nto offload all layers to CPU, check how much VRAM is used, and then slowly reduceNuntil VRAM is almost full.rorowhat@reddit
Will kv cache will keep growing..so ventually it falls out of it?
po_stulate@reddit
I know that some models (like gemma4) that uses SWA will just crash when context grows too large VRAM can't hold it anymore. You can use
--swa-fullto preallocate maximum context size for these models.My_Unbiased_Opinion@reddit
Kvcache on GPU is the way. Throw in the 4080 and PP will fly.
HumanDrone8721@reddit (OP)
Well, I really hope you're right, I'll throw even a bit more if it the first tests will bring some improvement.
Also in parallel I've sadly observed that there is no vLLM equivalent that can use safetensors on non-AVX512 CPU only AND the new models, not even Qwen-3.5 for a test. the Vimo fork path looked promising but is obsolete :(. So it seems that I have to stay with llama.cpp for the moment and throw some GPUs in the hope that at least this will work.
My_Unbiased_Opinion@reddit
Yeah this is how I set up Derestricted 120B on a 64gb ram 24gb ram system. I did use LM studio. Kvcache on VRAM with some experts on it even made it fast enough for multi turn tool calls. Ran this until I switched to Qwen 3.5 35B heretic quantized completely in VRAM.
jikilan_@reddit
I guess this means just plug in fast gpu >=24gb then with the number of your ram, it will increase a few more single digit tokens.
Regarding the llama.pp parameters , nowadays a lot of default value is enough for most of the people. Eg: —fit is turned on by default, really don’t need to set the ngl or moe anymore. Unless u know wat u r doing
HumanDrone8721@reddit (OP)
If I would have that many fast GPUs with >= 24GB RAM I would have not made this post, maybe another one related to cooling ;).
The question is: you're sure that --fit will put the K/V cache in the VRAM and not something else?
jikilan_@reddit
What I am trying to point out in my previous post is to pair with at least 1 gpu then your machine will fly 😁 cos minimax is a moe model.
Don’t need to trust me just try it yourself. It won’t hurt.
HumanDrone8721@reddit (OP)
I will definitely do so, the question was about the suggested loading of the K/V cache explicitly into the VRAM and I was wondering:
1) Will a 12GB VRAM make a difference?
2) Will the --fit do it by default or other parameters are needed for the llama-serve?
jikilan_@reddit
Typically —fit is smart enough to put important layer into gpu first. I don’t now? 🤷♂️12gb maybe small given minimax is huge. Try with your current gpu and see how much it improves.
Just run llama-server with -c 0 and dun put anything else. Because they might override the default values. I really I suggest try to read —help. A lot of things already going default and don’t need to set explicitly, eg: —jinja
HumanDrone8721@reddit (OP)
Well, now I'm collecting base performance with llama.cpp and ik_llama with llama-benchy.
Afterwards, I'll go trough the hassle of installing the NVIDIA stuff, the current OS is a "pure" Debian, then I'll try to cable the 4080Ti and see what it brings.
Bootes-sphere@reddit
For CPU-only inference at that scale, you're hitting the fundamental limits of DDR4 bandwidth—5-7 tok/sec is actually reasonable given your memory speed. A few practical suggestions: try `llama.cpp` with `-ngl 0` and experiment with different thread counts (start with physical cores only), use lower precision quantization if accuracy permits, and consider splitting inference across multiple processes to maximize cache utilization. If you need faster turnaround times without hardware upgrades, cloud inference providers like Groq or Together offer Llama models at pennies per token with response times in milliseconds—sometimes a hybrid approach (local for privacy-critical work, cloud for speed) beats pure local optimization
HumanDrone8721@reddit (OP)
Thank you Claude.
GlitteringChemical87@reddit
You need to maximize your memory bandwidth, which by default llama.cpp is not going to do for you, it's just going to launch as many threads as there are physical cores and let the scheduler shuffle them around from logical core to logical core. It's a mess, and this is why it seems like you get more t/s when using less threads and there's a sweet spot. Good news is you can maximize both the size and the speed of that sweet spot.
Check the layout of the L3 cache for your processor, and only assign as many threads as their are unique L3 cache blocks, don't let llama.cpp use all cores as is the default.
Then make sure to use
--cpumaskto assign one core per L3 cache block, usually the first one, as well as call llama-server with--numa numactland to launch llama-server usingnumactl --interleave=0-N --physcpubind ... sh -ct "[llama-server command]"where N is the number of numa nodes minus 1. Ignore the interleave option if there's only one of course. Interleave will make sure parts of the model loaded in RAM don't all end up on the same numa node while threads are distributed across them, especially when the model is much smaller than the size of your RAM.On the dual EPYC 7282 (total 32 cores, 64 threads/PUs, 2 NUMA nodes) with 256GB DDR4 ECC 2666, the command looks something like this
On the dual EPYC 7642 (total 96 cores, 192 threads/PUs, 2 NUMA nodes) with also 256GB DDR4 ECC 2666, the command is a tad more daunting (but the exact same concept applies)
It used to be that the "sweet spot" was
-t 8without any memory bandwidth optimization and I thought that was preposterous for a 96 cores beast.Use
lstopo -cto see the NUMA/L3/L2/L1 layout wrt core/PUs and to build the cpumask, and monitor which NUMA node the model is being offloaded onto withnumactl -H. You'll probably want to drop memory cache between runs when trying to figure this all out withsudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'.HumanDrone8721@reddit (OP)
The poor 3945WS is seen as one numa node.
alphatrad@reddit
Why would anyone bother with this when you can get a RX 7900 XTX off eBay for $750 bucks and be generating at 77tps
HumanDrone8721@reddit (OP)
Huh? In which universe?!?
alphatrad@reddit
The in which I live. Maybe you're not in the US. The first option on eBay.
I have been running Dual RX 7900 XTX's for awhile. They're equal to running a 3090 without the mark up
But the point being:
Bits per weight ≠ tokens per second, so dequantization overhead can cancel out file-size savings. Which it sounds like you're experienced.
Q4_K_M is the CPU-safe default: balanced quality, CPU-optimized kernels.
"Contrary to the behavior observed on our GPU setup, the KQ kernels outperform the IQ kernels on the Intel CPU. Accordingly, we suggest trying the KQ models first if you plan to run on CPUs. [byteshape CPU blog]"
You're kind of in a spot where, the reality is, at most maybe you squeeze 1-5 more tps second out of that by flipping some switches.
But the ultimate answer is, buy a GPU.
HumanDrone8721@reddit (OP)
Well, the clanker consensus is that 24GB of sub-mediocre VRAM will not bring anything of substance, we will test this in this WE, with a real 3090 eventually + a 4080Ti = 36GB of actual VRAM, that is something like 15% of the model size, but maybe will produce some speed improvement.
Anything under Q8 is proven junk for complex tasks, this models are anyways sensitive to quantization, so kindly please do not bring this up anymore.
an0maly33@reddit
Sub mediocre vram? Even a partial gpu offload is a night and day difference.
alphatrad@reddit
The OP is a clown. He can't afford a GPU. Wants to run full models but isn't bothering to ask, hey how can I do tensor parallelism with 8 3090's
HumanDrone8721@reddit (OP)
Maybe when one can offload a significant part of layers and/or K/V cache, for a 270B model 12GB or even 24GB will not make a difference (5-10%). but the moment I'm done with the Vimo branch of vLLM (it will be fast,it doesn't seem to support new models and parsers, is some kind of obsolete version fermenting there slowly) I'll set in order a RTX4080Ti, a RTX3090 and both.
Let's see the night and day thing, I'll be so happy for the clankers and common sense to be wrong :)
alphatrad@reddit
Ok sure. 🙄
I'll just be over here enjoying my subpar VRAM getting shit done in real time.
HumanDrone8721@reddit (OP)
Q4 of an already skinny model? Discarded, it may be useful for "role playing" and "email campaigns", but that's it, show me a MiniMax or any other model >250B at Q8 and we'll talk otherwise...
alphatrad@reddit
Dude your post history shows your nothing but chasing engagement - get out of here Mr. running on the CPU. Some of us are shipping stuff every day. Who knows what you're doing, being a clown? Talking out his ass nonstop?
Some of us are doing fine tunes and benching these regularly on custom bench marks using real world tools.
BUT TELL US MORE ABOUT CPU inference
alphatrad@reddit
Ok sure 🙄
pmttyji@reddit
Q8 is too much for CPU-only inference. Go for Q4 (IQ4_NL or IQ4_XS)
HumanDrone8721@reddit (OP)
Thanks for the answer I guess, you didn't fully read my post, but is OK, I will repeat it here once more:
I did try using small quanta, the marginal speed increase was not worth the catastrophic lose of quality, I'm using this big slow model for complex planning and evaluation, not for "creative writting"
tmvr@reddit
You should at least go down to Q6_K or Q6_K_XL, there is no drop in quality there compared to Q8. Decode (tg) or bandwidth limited, but prefill (pp) is compute limited to you can get some increse tere is you have a faster CPU and you can also try ik_llama which has about 30% faster prefill on the CPU from what I've seen when I looked at it. Not sure if mainline llamacpp already got that speed as well, just compare the latest builds of both. Then as someone said put in at least an 8GB GPU, if you have that 4080Ti 12GB it will help a lot to have it ans to use the
-fitparameter with llamacpp.HumanDrone8721@reddit (OP)
I did went down to Q4 and including Q6, the degradation was very visible and not-acceptable for my use case, going to low-quanta is the low-effort, low-hanging fruit and of course I've tested it first.
ik_llama and 12GB 4080TI are next to come after the benchmark suite is done on the current stock llama.cpp.
tmvr@reddit
It is your experience so hard to argue with, but when someone says there was very visible degradation between Q8_0 and Q6_K that makes it always sound very suspicious that something does not add up. Even just to measure the differences between the two (PPL and KLD) it's hard to see any noticeable difference in the numbers.
pmttyji@reddit
ik_llama could give you some boost. Hope your CPU has AVX512 support. There's a build variable for that before compiling it.
ik_llama documentation to get boost:
https://github.com/ikawrakow/ik_llama.cpp/blob/main/docs/parameters.md
Thireus' fork has pre-built releases(including AVX512 version).
https://github.com/Thireus/ik_llama.cpp/releases
HumanDrone8721@reddit (OP)
Nope, nothing in Threadripper Pro has AVX512 support until 7000 that far away from my wallet capacity
BigYoSpeck@reddit
Have you limited threads to physical cores rather than also using the SMT threads? My Ryzen 5900X performs better limiting threads to the 12 physical cores rather than the 24 threads available
Adding a GPU would help as you can still offload expert layers to CPU but you should get a prompt processing boost and even a mild token generation boost
HumanDrone8721@reddit (OP)
Yes, the latest llama.cpp detects and set this by default.
lemondrops9@reddit
When I was testing thing with cpu only I found 4-6 threads was best. If you max it out then you OS and other programs will be fighting for the same threads.
MelodicRecognition7@reddit
yes this is correct https://files.catbox.moe/5w3eqh.png
MelodicRecognition7@reddit
the PP speed has linear dependene on CPU performance, the more powerful the CPU the faster the PP tps.
MelodicRecognition7@reddit
https://old.reddit.com/r/LocalLLaMA/comments/1qxgnqa/running_kimik25_on_cpuonly_amd_epyc_9175f/o3w9bjw/
korino11@reddit
You need ram tweaks. UPdate agesa for last one. there are new Refresh modes - Fine Granularity and Mixed/ use Mixed! + you need to down latency, tweak timings
reto-wyss@reddit
The problem with your CPU is that you don't really get 8-channels in the first place because of how the chiplets connect. I don't know which SKU is the cutoff in 3000 series, but for 5000 you need at least the 5965WX to get full memory bandwidth.
3945WX is good for driving the PCIe lanes, not so good for compute.
I had fiddled with CPU inference using my 5965WX + 512gb 2400, but I have no recent experience.
HumanDrone8721@reddit (OP)
ik_llama is coming, is installed to the latest master commit and compiled, I'll start the benchmarks ASAP.
But I would really want to know more about the memory channels and the AMD different CPUs, could you please explain more, go read this site is also OK.
MaybeIWasTheBot@reddit
What are you using to serve models? ik_llama.cpp is a good starting point, since it has aggressive CPU optimizations compared to llama.cpp
HumanDrone8721@reddit (OP)
For the moment just the "vanilla" llama.cpp with the CPU backend, do you have any special compile parameter settings for the ik_llama ?
Because I did try to use it but there was no actual visible improvement of the inference speed.