[Chips and Cheese] Investigating Split Locks on x86-64

Posted by WHY_DO_I_SHOUT@reddit | hardware | View on Reddit | 10 comments

[-]

Alternative_Spite_11@reddit

This was the first C and C article where I was just like “what did I just read and why did I read it?” How does a piece of data get split across two cache lines anyways ?

[-]

DesperateAdvantage76@reddit

Easiest way is to disable automatic packing for structs.

[-]

jaaval@reddit

How does a piece of data get split across two cache lines anyways ?

Pretty much only if you intentionally do it. Like create a packed struct where there is 62 bytes of padding and then a 4 byte variable. The variable would be split between cache lines. I think this is only possible if you tell the compiler not to fix your stupid code.

Let's see if I can ELI5ish the point of the article.

Atomic operation means the operation has to be guaranteed to execute fully without being interfered by any other operation. This is actually very difficult to do when two different CPUs might be reading and writing in the same memory. Look at a simple C statement i++. That has to read what is in i and then add 1 and write back. What if you need that to be atomic? To ensure the i++ is atomic you need to lock the other cores from accessing the memory before the operation is done. In the original x86 CPUs there was a physical pin that would send a signal to lock the memory bus if a non atomic operation needed atomicity. This of course hurts performance but not that much in these old CPUs.

Modern CPUs do some internal cache coherence magic that is much more efficient. But if an atomic operation crosses cache line boundaries it needs two data fetches and the coherence system can't handle it. Instead it locks analogous to the old way until the operation is done. And two data accesses means slow. This is a split lock.

X86 allows these kind of bad misaligned atomic memory operations (whereas ARM for example does not). This is usually not a relevant problem because you have to do something pretty much intentionally bad to make modern compiler do a misaligned operation. Normally the compiler would pad data structures so that variables are cache aligned.

But it turns out this can be a security issue because it is possible to do a kind of a denial of service attack where you make a badly behaving program that kills performance of large part of the CPU. Latest x86 processors send signals allowing the OS to kill processes that do split locks.

The article, probably just out of interest, looks at how badly intentionally caused split locks hurt performance.

[-]

phire@reddit

I think this is only possible if you tell the compiler not to fix your stupid code.

It's somewhat easy to do accidentally in C/C++ when you allocate memory via a custom memory allocator.

The system malloc (and other allocators) do usually return memory that's already aligned to the size of the biggest atomic operation by default, but sometimes people forget that x86 supports 16 byte (128-bit) compare and exchange with CMPXCHG16B (or assume it won't be used) and configure their allocators to hand out allocations with only 64-bit or 32-bit alignment.

Game engines especially love providing their custom allocators, for performance reasons.

[-]

R-ten-K@reddit

This is a fairly common issue. Many data structures don’t map cleanly to cache line boundaries, either they’re misaligned or simply too large. E.g. When a data element straddles two cache lines, the CPU has to perform an extra cache access, which increases latency.

The article is specifically looking at this in the context of atomic operations spanning cache lines. CPU may need to take a split lock on the memory bus to ensure those sequential accesses remain consistent. That’s expensive, it can stall other cores in the system while the lock completes.

The usual optimization is to ensure proper alignment by the compiler (or manually structuring data) so that critical data doesn’t cross cache line boundaries. The tradeoff is that this can introduce architecture specific code paths.

[-]

ImpossibleFrosting2@reddit

if im not mistaken in modern x86 cpus cache lines are 64 byte long. so if you have large enough structure (>64 bytes) or a shorter structure eg a 16byte structure (such as a vector) that starts at 60th byte of the cache line, it will cross the line boundary.

[-]

LAwLzaWU1A@reddit

You are correct in most of the things you wrote. I just want to add a minor nitpick.

The important extra detail is that a split-lock issue is not just "a struct is bigger than 64 bytes", but that the specific value being accessed atomically ends up straddling two cache lines.

So for example, if you do a locked 8-byte operation on a value that starts at offset 60 in a 64-byte cache line, then 4 bytes are in one line and 4 bytes are in the next. That's the problematic case as you wrote.

A large struct can span multiple cache lines, but that alone is not enough to make it a split lock. A struct larger than 64 bytes just means the struct as a whole occupies more than one cache line. The locked instruction does not target the entire struct as one unit unless the operand itself is that large. It targets a specific memory operand, such as a particular field inside the struct.

So what matters is whether the bytes for that specific locked field cross the cache-line boundary, not whether the surrounding struct happens to be large.

[-]

maxitoon@reddit

Thank you for expanding on the above.