How Linux 7.0 Broke PostgreSQL: The Preemption Regression Explained
Posted by teivah@reddit | programming | View on Reddit | 58 comments
I wrote about a recent case where Linux 7.0 cut a PostgreSQL benchmark's throughput in half. I tried to explain it from first principles. Please let me know what you think :)
angelicosphosphoros@reddit
Never use spinlocks in userspace.
PostgreSQL should just use futex-based mutex which would have same performance as spinlock almost always.
ants_a@reddit
Futexes need two atomics, spinlocks can be released with an unlocked write. For uncontended locks in hot paths it can be a significant difference. The freelist lock is normally not contended, because having something in the freelist is a transient state that will disappear quickly under allocation pressure, and the empty check is executed without a lock. In fact next release will eliminate that mechanism altogether. And indeed - using futexes eliminates a bit of spinning, but doesn't fix the regression.
The benchmark causing the regression was the perfect storm to trigger the issue - large memory, no huge-pages, very large number of clients and a short empty cache run. Just to underline how unreasonable that configuration is - the per-process page tables are 2x the size of the buffer pool.
But there is one thing I don't yet understand about the minor page fault causing lock holding process getting descheduled explanation - the page fault happens after releasing the lock. Does ARM not retire preceding instructions before handling the fault? That sounds exactly like the speculative execution security problems Intel had a while back.
myaut@reddit
Use adaptive mutexes then, spin some cycles than fallback to the futex
ants_a@reddit
That's how any real world futex implementation works, it still has to use atomics on release to not miss any wakeups.
admalledd@reddit
In this case, Restartable Sequences would probably be an even better fit, but yes. Generally any concept of "user space locks between threads/processes" is highly likely to be wrong at fundamental levels. I and others were shocked/confused at PostreSQL using userspace spinlocks at all with how well known their failures are.
HighRelevancy@reddit
Go write them a patch about it then.
pellets@reddit
For anyone not reading past the headline, it is sensationalist. PostgreSQL isn't broken with Linux 7. A kernel change degraded performance. Some configuration changes can restore it.
Although the word "broken" implies it, it's also not clear that real-world performance would degrade. One benchmark having a problem doesn't make something broken.
danted002@reddit
Well it kinda did unto 50% performance regression because a flag was removed from the kernel is a bit odd. Reading the article I don’t quite understand why that flag got removed in the first place
happyscrappy@reddit
It's not clear to me why having run til block threads was incompatible with "modern CPU architectures".
admalledd@reddit
In the specific case here, restartable sequences seems likely the better fit, and has been supported for nearly a decade now as all-around better than userspace spinlocks for shmem. RSEQ behind the scenes does use some CPU/transactional-memory magic for its performance.
FWIW, when this happened, there was general surprise that PostgreSQL was relying on something generally recommended against (user-space spinlocks) for about thirty years. The issue isn't blocking threads or such, it is the attempt at user-space locking at all.
happyscrappy@reddit
From the link:
'The actual ABI is unfortunately only available in the code and selftests.'
'Allows to implement per CPU data efficiently. Documentation is in code and selftests. :('
And those are on top of the uses section where it says that restartable sequences are good for implementing userspace restartable sequences. That's a non-explanation explanation.
Personally I don't agree anything that isn't really documented is a better fit. It doesn't sound like it's actually fully supported. You want your code to keep working for a relatively long period, you need something supported.
You have to build your own kernel to turn this on apparently. I think in that case I might just be tempted to turn run til block processes back on instead.
I do agree that this "ask for a timeslice extension" API probably would do a good job of reducing the instances of this problem occurring to near zero. It's not a bad design for solving the problem. I just don't think using this esoteric functionality is a good path.
I would also say also run til block is recommended against.
I'm not saying I'm for user-space locking. But as I said in my other post, if the purpose of this machine is really to just run this one service then you might as well act like you own the place. Because you do. Really nothing is off limits as long as you are willing to spend the time maintaining it.
This still doesn't explain why linux thinks that run til block is incompatible with modern CPU architectures. It was never recommended before. It's hard to see why it went from not recommended to not available.
admalledd@reddit
You keep using the term "Run [un]til block" and while the words themselves stand alone make sense and can be guessed at, your usage here clearly is implying something else than what I inferred, can you clarify? Because I was assuming you were just aliasing the term vs what PREEMPT_NONE did before, but your phrasing belies that you think it is something different.
PREEMPT_NONE was just a simple choice server workloads had on the default scheduler. If your server code relied on _NONE for critical sections it was already broken/buggy. That is why RSEQ and friends exist, which was added as a syscall in 4.18 (~2018), and had existed as pure-userspace (generally only available in RT enabled CPUs/Kernels prior to 4.4) via CPU-specific intrinsics (Source: wrote some back in ~2012 and maintained until product end of life'd in ~2016).
There is nearly two decades of prior art on how to not need spinlocks in userspace. Further, This is wildly overblown how much it impacts real-world performance in PostreSQL workloads. It is one hyper specific and generally unrealistic benchmark. "Fixing it" as a sysadmin if you are impacted can be as simple as "why haven't you turned on HugePages yet?" and "upate PG to v19 or later where this spinlock was removed even before the LKML thread started".
happyscrappy@reddit
Well, I'm not certain exactly what PREEMPT_NONE does because I don't use linux that way. But I think it is the same as run til block. Run till block is a very common concept across operating systems, especially with lower-level ones (no VM, "real time" meaning not really real time, but lower latency).
Here's what run til block does in the shortest explanation possible:
Once your task is schedule in it will run until it makes a syscall.
This is what I think this explanation is saying too, it just overexplains it:
'when it makes a syscall, blocks on I/O, or explicitly sleeps'
Doing I/O or sleeping requires making a syscall. So all 3 of those are just the same as saying make a syscall.
In systems without paging (as I mentioned above) truly user-level processor core is yours until you make a syscall. Interrupts can interrupt you, like timers or I/O interrupts. But the kernel will not re-schedule on the way out of those handlers so it will always return to your process going back to user level.
If you put in paging you have to add some weird extra specific stuff, basically going back to a reflexive definition of "if you're not blocked, the user level CPU is yours". So that means if you get a page fault the processor might be taken away while your page is readied (loaded) but after it's ready it'll reschedule you back in.
PREEMPT_NONE to me sounds like it is this latter definition. It's no guarantee you can't block, but it you aren't blocked, you'll be running.
I use run till block because it is a broad OS concept, not just a linux thing. And to be honest, I'm not a linux expert. My OS knowledge is more in other areas. So maybe it just makes me comfortable to write it?
Right. User space code really should not be doing this. If you write enough user space code like that you'll find later it's near impossible to fix it to work with preemption. You just have too many unexpressed dependencies on not being preempted to fix them all. And if you only fix 99.9% of them you won't be reliable, ever.
Need is a tricky term sometimes. Sort of like broken is as used here. How many of those show you don't need them, but they still can help your performance. This is not me saying you should use them, but this kind of thing is kind of a heuristic proof, right? 99% of the time you don't need them because of futexes. 99% of the time that's no good you don't need them because of another technique. But there are always corner cases. As you mention slightly below. It helps here, but this is an edge case of an edge case, right?
Well, this is really the trick, right? postgresql as a project has different constraints. They want to work well on all systems. Whereas to the sysadmin of a performance critical machine that only does that the real fix can be part of a kernel configuration.
This is on top of how the best way to solve contention isn't necessarily the same for all levels of simultaneous threading. The right solution for 96 threads on 96 cores may be unnecessary overhead for 4 threads on 8 cores. Once you have a full, mission-critical system to optimize top-to-bottom then more customization can be the best way to get results.
admalledd@reddit
I am familiar with a few was of doing embedded development. For ref, PREEMPT_NONE does not, has not meant "never possible to pre-empt/work-steal a user-thread" in Linux, it was just a very specific promise within the scheduler to prefer other threads first basically. Granted, lots of handwaving/generalizations there, but while similar they are very different from loose co-operative scheduling, which sounds like what you are calling run-until-blocked. (note, hard co-operative scheduling is a whole different topic that is moot here, if you have hard co-op scheduling you'd also never spin-lock because either you have no other cores to wait for or you can easily know which job/core to wait on so can use IPIs or atomics or other time-slice techniques)
If you have a server that has 64GB+ of memory and aren't enabling and further maybe forcing the use of huge-pages for memory hungry things like a SQL server, thats on you. As the LKML thread pointed out, and various other reproductions pointed out, simply using Huge Pages, which is already strongly recommended, bypasses the problem entirely, which was the horrific number of 4kb TLB misses when 120GB of pages were needing to be initialized. Once the system was in steady-state the performance regression was no longer 50% but far less to be around 5% (yes, still not great, but that is trying to so 100K+ transactions per second without, you know, doing anything close to correct as a sysadmin or DBA setting things up).
I will stand by that spinlocks in userspace multi-tasking kernels is 100% always wrong, has been wrong for decades, and so far every example you've tried to bring up are not multi-tasking server kernels. My job is often to do machine/assembly level performance engineering, there is a reason I mention restartable sequences and TSX and friends, those are how for over a decade the general solution against userspace spinlocks. Before that was indeed a bit more specialized as you would have to be case-by-case on why the hell you were even thinking of spinlocking at all.
happyscrappy@reddit
I tell everyone not to use them. I have done so in this thread. But the best way to be wrong is to say "never" or "always".
I didn't say anything that I meant only to apply to one type of kernel. Even when I described one type of kernel I made it clear I was only describing what it does, not saying it was the only way it would be done.
If ever I found a spinlock in the code I was working with I would go ask the person who wrote it why they shouldn't it was a good idea. The answers were virtually always more along the lines of "I didn't know any better" than "well, I tried the good ways first and..."
SlinkyAvenger@reddit
Modern CPUs have instructions/mechanisms that make "spinlocking" a complete waste in like, 98% of cases. The article even discusses a potential code modification for Postgres that would potentially return performance by leveraging an OS feature.
happyscrappy@reddit
Like what? Please help me understand.
It's talking about wiring the memory. This is very drastic.
Neither of these two things you say explains how run til block threads are incompatible with modern CPU architectures. Even what is suggested as a (drastic) fix is just another way to do it. It doesn't explain how the old way is not suitable.
Blutkoete@reddit
I'm not thinking it's broken, I'm just confused that nobody at PostgreSQL appears to be following Kernel development or else they should have spotted this beforehand
teivah@reddit (OP)
The title is a bit sensationalist, I don't disagree but I explored nonetheless why Linux 7 had a significant impact on a specific benchmark.
Yes but it has tradeoffs. It's not about restoring exactly how things work before.
mss-cyclist@reddit
Thanks for sparing us our time
anydalch@reddit
Yeah, it's called a fucking futex. What are we doing using spinlocks in 2026? The whole goddamn point of a futex is that, if the critical section is short, you get the "fast userspace" part and never suspend the thread, so you get the same fast case as a spinlock but without the catastrophic behavior in the slow case.
programming-ModTeam@reddit
No content written mostly by an LLM. If you don't want to write it, we don't want to read it.
teivah@reddit (OP)
What?? I spent literally 20 hours on that post!
winston_the_69th@reddit
FWIW, I appreciated the read.
teivah@reddit (OP)
Thanks, I really appreciate it.
winston_the_69th@reddit
Also enjoyed your Go book - small world!
teivah@reddit (OP)
lol, indeed :)
teivah@reddit (OP)
And no content? I literally explained all the core principles both on Linux and PostgreSQL to explain why this issue occurred. You may not like it, no worries at all, but your comment is both insulting and plain wrong.
happyscrappy@reddit
Trying to cheat the scheduler.
I fought the law and the law won.
4K pages just don't seem appropriate anymore. Apple's ARM chips (maybe Qualcomm's too?) use 16K pages and it seems like a better compromise for today's systems.
To prevent this "3x explosion in CPU use" shown here if you use spinlocks (don't use spinlocks) then you should only spin X number of times before you then go into a blocking lock. Since you can't share the same mutex between a spinlock and a regular lock that means you pretty much have to just go to usleep() after a few non-blocking spin attempts. And that will block you.
But by doing this you won't have n-1 threads burning all their CPUs because a thread was preempted while holding the spinlock.
Futexes are really fast. Using spinlocks seems perilous. I do know database companies often cheat on this stuff though. At some point if the entire system's existence is only to run your program you might as well start acting like you own the place. Because you do.
ants_a@reddit
It's not a naive spinlock implementation, it already does limited spinning with randomized exponential backoff. Futexes would have had the same throughput regression. Getting descheduled while holding a contended lock is bad for throughput either way.
happyscrappy@reddit
The performance regression I was referring to was specifically the other threads burning up their CPU threads. And your link says that's what happens:
'Using futexes has a bit lower throughput but also reduces CPU usage a bit for the same amount of work, which is about what you'd expect.'
It has about the same throughput in this specific test because they have hte same number of database threads (96) as CPU hardware threads. It would not be the same if you have more database threads. Although perhaps the lesson there is if you can service more than one client per thread (and this example can) then don't have more threads than you have hardware threads.
As to the it not being a naive implementation. Great, glad to see it. But the behavior described in the article is that of a naive implementation and that's what I was referring to improving.
The "real fix" if you have this much contention is really what is at the end of the article. Just do your work as if you owned the resource and then try to swap it in with an atomic swap. If that swap fails then reset to the start and do your work again based upon the updated values and try to swap in again. It certainly has its own issues but it doesn't hold off others if you get descheduled while making an update.
The last sentence is just straight bunk. The kernel didn't break userspace. Performance dropped, but the same old code still works correctly unmodified. Kernels never guaranteed they wouldn't slow you down while handing system resources to other tasks. They didn't break any kind of contract they had with userspace.
ants_a@reddit
Throughput was lower because preempting processes that are holding short-lived highly contended locks is a bad idea for throughput. The benchmark specification from the regression report said 1024 user connections.
Correct fix is indeed to not have contended locks. This specific lock is usually not contended in performance sensitive applications, which is how it had survived for so long. However it will be gone in PostgreSQL 19 for unrelated reasons.
I completely agree with the assessment that not breaking userspace does not mean zero performance regressions.
tesfabpel@reddit
PostgreSQL should also have used
mmap()withMAP_HUGETLB | MAP_HUGE_1GBso that, for 120GB, it would only have 120 entries in the page table.ants_a@reddit
It does that by default (though the default is 2MB pages). There is a fallback to default pages in case huge pages are not available.
happyscrappy@reddit
I'd question if that would ever work. In order for a 1GB mapping to work every logical address would have to have the same 30 bits as the physical address behind it. And arranging things so that is the case is awfully difficult. In practice it would mean every task would have to have start on a 1GB physical memory boundary. And that's pretty wasteful of real RAM for a system to do. You could arrange it if you compile your own kernel though. At that point you're really talking about specialized systems here more than "I'm just gonna install this app".
ants_a@reddit
It works fine for the specific case of a large shared mapping for the buffer pool with the huge pages reserved at boot time.
ToaruBaka@reddit
I feel like the only actual solution to this is adding a new
madviseflagMADV_AVOID_RESCHEDULE_ON_MINOR_FAULT, which would be ignored if the selected scheduler doesn't support it. I'm not going to rip on using spin locks in userspace, regardless of what people keep saying they do have their purposes.This feels like edge case behavior related to paging moreso than the scheduler - IMO page faults shouldn't result in a task switch for lazy page allocation, because 99% of the time you're popping a physical page off a free list or bumping a pointer. Those operations are fast - it doesn't really make sense (IMO) to go through the hassle of switching to a different task after such a minor kernel operation - especially for something "lazy".
ants_a@reddit
I think the actual solution is to get rid of contended locks. This specific locking path was removed already by an unrelated change because it was not pulling its weight. But for similar cases, I wonder if there's a reasonably cheap way to make sure the lock release store gets retired before the page fault is taken. Then a simple "don't do stuff that can create page faults while holding spinlocks" rule would be enough.
ToaruBaka@reddit
So... "don't access memory while holding a spinlock"? Lol?
I'm 99% sure that's impossible to implement, especially if the fault is within the locked region. maybe you could do something with
userfaultfdfrom another process, but that sounds super hacky.ants_a@reddit
"Don't access memory for the first time in this process." That's entirely feasible as a policy given the highly limited use of spinlocks in postgres. Doesn't avoid the kernel swapping the page out, but such is life. And apparently it also doesn't avoid a page fault from a memory access outside of the locked region preempting the process, which is what appears to be happening here.
lerliplatu@reddit
Imho, the article it links in sources is better written than this one. This one feels like a summary of the other.
kurisaka@reddit
Better written? It's an AI from top to bottom.
teivah@reddit (OP)
No it’s not. I’m an experienced writer. I’ve been writing online for more than a decade, my book was published before AI: https://www.goodreads.com/book/show/58571862. Your comment is insulting.
kurisaka@reddit
I think you missunderstood (and looks like mods too) I was talking about link in u/lerliplatu reply, not your post.
ants_a@reddit
It's still insulting, just to a different experienced writer.
teivah@reddit (OP)
Ah, I’m sorry for overreacting. Mods deleting my post for saying it was written by an LLM made me quite sad. Sorry about that…
AxelLuktarGott@reddit
I've read both now. As a simple consumer of postgres I liked OP's article. It explained a lot of concepts that I weren't familiar with that the other article assumed that you know.
I'm glad that I read both. Repetition is good to make the knowledge stick.
teivah@reddit (OP)
Thanks for the comment. I won't challenge it, I really liked it as well :)
I have a different audience, though, thebuild.com is focused on PostgreSQL so the readers are more experts. Instead, I tried explaining things from first principle (TLB, page, preemption, etc.)
lerliplatu@reddit
Fair, thanks for writing!
TheAlaskanMailman@reddit
Such an interesting read. Thanks a lot. Really appreciate it.
teivah@reddit (OP)
Thank you very much :)
unicodemonkey@reddit
The article introduces TLB before the concept of VM pages - shouldn't be the other way around?
Pheasn@reddit
This was a nice read. Not sure what the LLM accusation is about, didn't seem sloppy at all.
teivah@reddit (OP)
Thanks a lot man 🥹
Takeoded@reddit
I wanna see the v6 spinlock vs v7 with a proper mutex lock. Spinlocks are usually the wrong approach anyway..
Blutkoete@reddit
Are all databases hit by this problem or is PostgreSQL written in a special way that requires this feature?
And is no PostgreSQL developer testing against incoming kernels to raise a hand before release?
razialx@reddit
This was a good write up. Thank you for sharing it.
teivah@reddit (OP)
That's nice of you, thank you very much.