Are x86_64 CPU Core ID Numbers Reported by the Kernel the Physical Core IDs?
Posted by mkvalor@reddit | linux | View on Reddit | 3 comments
APOLOGIA: As measured below, "low" reported core-to-core latency is 41 ns; "high" latency is 62ns. So feel free to decide I'm silly for caring about these deltas and stop reading here. 😂
tl;dr - for "reasons", I'm trying to be a control freak about which processes get pinned to which cores on a single-die, multicore CPU. But I suspect I'm getting "fooled" by the Linux kernel's enumeration of the core IDs, such that it's giving me virtual IDs or at least different IDs across either simple reboots or reboots with new kernel updates. Is this true? And if so, can I configure anything to force the kernel to always give me the pure physical CPU IDs?
Rationale: CPUs of the type I'm using (see below) have many cores with an internal set of ring busses that connect them. This puts some cores much 'nearer' on the "bus line" to the on-die PCI modules (signals from e.g. Ethernet adapters, etc). So I'd like to pin my processes which consume external ethernet traffic closer to the PCI modules on the ring bus. Then, I'd like to pin other processes (which manipulate this ingested information) to cores closest to those initial cores, etc, etc.
BACKGROUND: I'm running current Fedora 40 on a system with a single Intel Xeon Scalable v3 Gold 6338 CPU with 32 physical cores / 64 hw threads (RAM & storage are are normal & unremarkable). I tend to run the system package updater once per week for all software and I reboot the system if the kernel gets updated. There are probably only about 20 additional packages installed on the system besides base, including things like vim, make, GCC - so it's pretty minimal.
MY EFFORTS: I'm fiddling around with some soft 'realtime' programming by using shell scripts to pin processes (single-threaded programs I have written in C or rust) to core IDs using 'taskset' and then setting the process scheduler and priorities for these running pids using 'chrt' . So far, all of this has worked as expected. I don't use any kernel API calls in my programs to change cores or scheduling or anything like that. When I look at top or htop (actually I like to use 'atop'), I see that my processes are indeed pinned to the cores I specified.
HOWEVER: I recently became aware of independent projects which claim to measure core-to-core latencies on CPUs, so I decided to try one published on GitHub: andportnoy/core-to-core-latency. It spit out a useful CSV file with my core-to-core latencies. I was surprised to discover that the program showed higher latencies between core IDs which should be much closer to one another on the internal ring buses. But I'm no hardware guru, so I thought, "Well maybe that's just how it works, or maybe there's a bug in the program, or maybe something else is going on I simply don't know about." By the way, the system was reasonably idle whenever I ran this.
BUT THEN: I used the Fedora package manager to update my system which included a kernel update and I rebooted the machine to pick up the new kernel. For no particular reason, I re-ran the above program afterwards and found that it spit out a different set of core latency relationships in the CVS file output. But the pattern of the latencies was surprisingly similar; only the core IDs had changed. In other words: instead of (earlier run) physical core 29 with its associated hyperthread 61 showing the lowest latencies to many (but not all) cores near it on the physical ring bus -- this time, it was core 17 with associated hyperthread 49 which showed the very same lower latencies to new core IDs somewhat near it on the bus. I got a bit wise, and decided to make a shell script to run the 'measure' program (produced by the GitHub project) multiple times in a row, sending the output to /dev/null except for the final run. All the latencies reported in the final CSV output file did go down modestly, but the "new" basic relationship of lower latencies to core 17 remained.
THE QUESTION(S): When I use 'taskset' on a system with a single CPU that has multiple cores on a single die, may I assume that the core ID numbers I pass in as parameters should map to the actual physical core IDs on the die? If not (by default), is there some combination of configuration setting or kernel boot parameter, or kernel build configuration that could force this to be true? Finally (for extra credit 😁), would anyone with knowledge of Intel hardware who took a few moments to examine the 'measure.c' source file in the GitHub repo care to offer an opinion as to why, for example, cores 0 - 6, which are fairly close to one another on the CPU ring bus might report average to high latencies among each other, compared with some other core much closer to the center or the far right of the die seemingly exhibiting magically lower latencies to many of its nearby neighbors?
Extreme gratitude to all who made it this far in my post. I appreciate the privilege of having access to this community!
uraniumingot@reddit
Assuming that the cores are actually interconnected through a "ring bus" (pcie is not a bus, it's a network): with one numa node, there would be no measure difference in core to core latencies because memory access depends on the bus clock. It doesn't matter if the cores are physically closer, a single bus broadcast is always a fixed time slice. You would not see any difference in main memory access speed.
What could be different: L2 cache access latency, depending on the number of cores sharing an L2 cache; if the data is stored in a directory-based cache it could also be different.
A NIC over a PCIe bus will always be slow because 1. The memory region used for memory mapped IO cannot be cached 2. The dma from the NIC is always to the main memory (assuming you don't have something like Intel DDIO, in which data goes to the LLC) and 3. For common protocols such as TCP/UDP, the kernel uses RSS to determine which core gets to process the in-kernel traffic, which may affect cache behavior depending on the hw hashing traffic.
What you can do is minimize the number of interrupt cores that processes the NIC interrupt, but there is no good way of hard pinning which kernel threads processes the network traffic.
mkvalor@reddit (OP)
I see where you are coming from, but I feel that maybe I didn't fully explain the ring bus concept. The image at this link displays a schematic of a CPU very similar to mine, except that it has eight additional cores. As you can see, there are indeed different distances between cores and this affects things such as cache snooping which can be an issue during, e.g. SHMEM segment cache line access contention.
uraniumingot@reddit
What GitHub project are you referring to? What is the access latency you are trying to measure? If it's not in the CPU cache, how do you disable/trick the hw prefetcher on the processor?
Regarding Ethernet traffic. Packets are delivered via DMA even on a 15 dollar NIC. Everything is in main memory with the exception being the head and tail pointer of the hw tx/rx queue (which are in device memory). The core that posts to device memory is also not guaranteed to be the pinned application core. Under idle load, your core pinning should have minimum effect on your access latency (maybe at most 10ns for 64 packets). Everything else is most likely just variance. It's hard to say for sure without knowing what the benchmark you are running is doing.