Dual CPU Penalty?
Posted by jsconiers@reddit | LocalLLaMA | View on Reddit | 20 comments
Should there be a noticable penalty for running dual CPUs on a workload? Two systems running same version of Ubuntu Linux, on ollama with gemma3 (27b-it-fp16). One has a thread ripper 7985 with 256GB memory, 5090. Second system is a dual 8480 Xeon with 256GB memory and a 5090. Regaurdless of workload the threadripper is always faster.
humanoid64@reddit
For non AI stuff the company I am at moved from building dual socket Epyc to single socket Epyc because at high load 2 single socket Epyc perform better than a dual socket Epyc. Assuming your workload can fit in RAM of a single socket. For our use case (many VMs) it was a no brainer. Reason: if your VM or application thread is on CPU 1 but the memory it's working on it on CPU 2, performance sucks big time. This is a summary of the main NUMA challenge. There are a lot of CPU pinning tricks but when you have a lot of systems it turns into a lot of time / management / cost and you are way better off with just more single socket systems.
cantgetthistowork@reddit
What about dual epyc to increase the GPU count?
Marksta@reddit
Dual epyc doesn't get you more pcie lanes, it's 128 per CPU and if you have 2 CPU, 128 of the 256 pcie lanes (half) are used to link the 2 CPUs together with 4x xGMI. So you still only have 128 pcie lanes, but now they're split between the two CPU and there is a latency penalty for 1 gpu talking to another gpu that's across CPU nodes.
There are data parallel strategies that could make use of the situation for theoretical big gains, but really the software just isn't there yet. Don't go dual CPU until you hear NUMA aware stuff making it into the big engines.
cantgetthistowork@reddit
This mobo takes up to 19 GPUs. The highest a single CPU can go is 14 in ROMED8-2T.
https://www.asrockrack.com/general/productdetail.asp?Model=ROME2D32GM-2T
Marksta@reddit
Oh I guess so, looks like 7002 and up do get some extra PCIe lanes, 128 up to 160. Still faces the NUMA issue though. I just moved from dual cpu to single, too much extra variables and settings to mess around with while trying to balance standard inference settings too.
cantgetthistowork@reddit
According to chatgpt EPYC doesn't use lanes for innterconnect
EPYC CPUs use Infinity Fabric for CPU-to-CPU communication—not PCIe
➤ How it works:
EPYC dual-socket platforms do not use PCIe lanes for CPU interconnect.
Instead, they use Infinity Fabric over a dedicated coherent interconnect, called xGMI (inter-socket Global Memory Interconnect).
This link is completely separate from the 128 PCIe lanes provided by each EPYC CPU.
Marksta@reddit
Sounds like it's super obviously wrong then? It's probably confusing the semantics of protocol vs. the physical traces or something. 100% the lanes are being 'repurposed', it's the same CPUs that had 128 PCIe lanes and when placed in a 2 CPU board, they don't have 128 PCIe lanes anymore. They went somewhere... the interconnect xGMI. Sort of like Ethernet as a physical cable, vs. Ethernet as a protocol.
humanoid64@reddit
Likely using a pcie switch chip
ttkciar@reddit
Getting my dual Xeons to perform well has proven tricky. It's marginally faster to run on both vs just one, after tuning inference parameters via trial-and-error.
It would not surprise me at all if a single-socket newer CPU outperformed an older dual-socket, even though "on paper" the dual has more aggregate memory bw.
Relevant: http://ciar.org/h/performance.html
Agreeable-Prompt-666@reddit
Have you tried interleaving, either with numactl or forcing it in the bios?
ttkciar@reddit
I had messed with numactl a while back, but couldn't remember if I'd tried interleaving. Tried it with Gemma3-27B just now and, alas, it dropped from 2.50 tokens/sec to just 1.77 tokens/sec, but then I tried it with Phi-4 and performance improved from 3.85 tokens/sec to 4.19 tokens/sec!!
I'm going to try it with my other usual models and see where it's a win. Thanks for the tip!
jsconiers@reddit (OP)
Any tips?
ttkciar@reddit
Only to fiddle with NUMA and thread settings until you find your hardware's "sweet spot". I don't know what those options are for ollama; I'm strictly a llama.cpp dweeb.
Also, Gemma3 is weird for performance, mostly because of SWA. If you have Flash Attention enabled, try disabling it. That will increase memory consumption but for pure-CPU inference disabling it improves Gemma3 inference speed.
I'm sure you already know this, but since you asked, the Q3 quant with reduced context limit will let you fit everything in 32GB of VRAM, so if you feel like throwing money at the problem, you could buy a second GPU. That would make the CPU and main memory almost completely irrelevant.
Street_Teaching_7434@reddit
My experience is similar to what other have said in this thread. Getting NUMA to play nicely is quite annoying and only gives a marginal speed increase over just using one of the two cpus. If you really want to, kTransformers is the only proper way to use NUMA properly and if you have the required memory to load the model for each CPU (2x memory then usual) so there is no foreign RAM access, ist actually quite fast. If speed for a single request is less important then total throughput, it is still way faster to just run two separate processes of whatever your inference backend is, one on each cpu.
Rich_Repeat_22@reddit
The Dual 8480 should be faster if you use Intel AMX, kTransformers and set up NUMA. Assuming you have 8 channels RAM on both the CPUs accessible, that's around 700GB/s total.
There are guides how to set up this properly.
xoexohexox@reddit
Numa numa numa distribute!
MDT-49@reddit
If you haven't already, look into numa and your configuration.
MengerianMango@reddit
Check the interconnect speed in bios. Not sure it'll help but one idea to look at
mxmumtuna@reddit
People who do this for a living weep about managing numa performance
Agreeable-Prompt-666@reddit
Dual sockets are harder to setup, not only you need to use the right software/switches, but the bios can unlock more performance options.