The Linux Kernel has removed PREEMPT_NONE and PREEMPT_VOLUNTARY.
Posted by InfinitesimaInfinity@reddit | linux | View on Reddit | 37 comments
PREEMPT_NONE has previously existed to provide a way to gain more throughput on almost all workloads at the expense of also gaining some more latency, and it was better for most server workloads, which value throughput more than latency.
For workloads that spend most of their time in spinlocks, it was actually able to have significantly lower latency than the other preemption options, as well.
According to Salvatore Dipietro, some PostgreSQL workloads have approximately half the performance when using PREEMPT_LAZY vs PREEMPT_NONE.
The Linux kernel maintainers have responded that PostgreSQL should add the use of the "RSEQ timeslice extension", which enables a process to ask the kernel to delay preemption for a short period of time. (The default delay is 5 millionths of a second.)
However, this solution is not perfect. First of all, it would require PostgreSQL to make changes that would make PostgreSQL unable to work on any machines that do not have an up to date kernel, dropping support for all kernels below version 7. Second of all, it would still reduce throughput and latency on such workloads. It would merely reduce them less.
shinyfootwork@reddit
This should be fixed by:
assface@reddit
Nearly every DBMS uses spin locks. It is the correct way to implement a latch.
TerribleReason4195@reddit
Ah, there is my daily dose of Linux drama.
wyldcraft@reddit
Where's Lunduke to turn this molehill into a mountain.
oshaboy@reddit
Probably too busy complaining about all the transgenders
Addianis@reddit
Nah, he's currently attacking another content creator over their tumblr from 7 years ago.
HunsterMonter@reddit
To be fair, the racist homophobic loli blog might be worth criticising, though I don't think he is exactly the right person to do so (if that's the 7 year old blog he is talking about)
ang-p@reddit
Hahaha.. Every paragraph of that is just "and there's more..."
Never liked the dweeb's stuff, but hadn't realised he was just a content creator who had one day decided Linux was just his latest "content".
clearlybreghldalzee@reddit
While using graphics stack built by trans devs lmao
Turbulent_Fig_9354@reddit
no one i nthis story is trans or woke enough to get his attention
Helmic@reddit
i'm still upset cachyOS isn't on his list of woke projects but he did get his fans to send death threats to their discord mods for saying it's not OK to be shitty to queer people, so i guess that's sorta validating in a horrifying way.
Misicks0349@reddit
Thats OK! We can invent some!
gplanon@reddit
Your comment is not productive and nothing about OP is dramatic.
adoodle83@reddit
Shouldn’t it be on the kernel devs to prove their case that the change proposed, that breaks backwards compatibility of any application like postgresql, is warranted due to some major benefit?
redundant78@reddit
this isn't really a backwards compatibility break in the traditional sense. the kernel guarantees userspace ABI stability (syscalls etc), but internal kernel config options like preemption models have never been part of that contract. distros pick their preemption model at build time, and users/apps don't directly depend on it. the postgresql regression is real but it's more of a "your workload was accidentally benefiting from a specific kernel config" situation than an actual API break.
azrazalea@reddit
The major benefit is vastly simplified scheduler code.
InfinitesimaInfinity@reddit (OP)
Technically, it does not break compatibility. It merely makes it unable to run as fast on servers that would have used PREEMPT_NONE.
ropid@reddit
Did you see the comments in that chain that said that it's not noticeable on x86_64 CPUs? The report came from an ARM architecture machine. Maybe that's why the problem wasn't noticed before this report there.
TropicalAudio@reddit
And only if a rather silly configuration is used. It's a bit like complaining that swap is getting slower in a new kernel version - if you were banking on your swap being fast, you're probably doing something wrong.
Existing-Tough-6517@reddit
When you lie and say technically what you mean is that it in no way breaks backward compatibility and you didn't know what those words mean.
adoodle83@reddit
Oh just reading through some of the other comments, who know more about the details than me, this appears to only noticeably impact certain architectures.
Existing-Tough-6517@reddit
Prove to who do you write their paycheck? Can they not have their own fork of Linux if they like? Incidentally not sure why you think this breaks backward compatibility.
aioeu@reddit
You've missed out a key aspect to all of this. The performance regressions reportedly go away if PostgreSQL uses huge pages.
Given there is a readily available mitigation in current versions of PostgreSQL, I don't think the kernel developers are going to be likely to hold on to code they've been trying to remove.
ChrisTX4@reddit
There’s some context needed for this, too. So the workload that was reported being halved was a stress test with 1024 clients hammering requests to an AWS VM with a 96-core Graviton4 CPU - so 96 threads for Postgres - and 120GB shared_buffers. The disk were 12 SSDs with 32k IOPS each.
In this very specific and artificial benchmark, this resulted in a -49% performance drop. The thing is, for this to become a problem, a huge amount of page faults have to occur disrupting the lock holders, leading to lock contention. So this requires a machine with a massive amount of clients attempting requests at the same time. Then there will be a large amount of page faults when using the standard 4 kilobyte pages. However, in such a setup you would gain a sizeable amount of performance with huge pages anyway and you will not see anything else running on a machine with those specs, as it would be silly to install something else on a server handling that many clients with those kinds resources.
If you have a known amount of memory your database will receive, then allocating almost all memory as huge pages was always the reasonable thing to do, as not using huge pages meant leaving a significant amount of performance on the road.
Here is an article about using huge pages, and they have a test at the end where with just 180 clients they see a 9-20% performance boost. Note that the article is not about the kernel 7.0 regression, just tuning Postgres in older setups. Huge pages won’t lead to more than 1-5% speedup with smaller setups but you should still do it for a few reasons like memory fragmentation.
Either way, for a large scale setup, like the one the AWS engineer set up, not using huge pages would be absolutely stupid, and that has nothing to do with kernel 7.0.
tldr: performance regression only happens in a synthetic stress test for a huuuge database instance that has not to be configured correctly as well. Plus, Postgres can use RSEQ time slices to recover the performance even in this benchmark if they so desire. That is why this was considered not breaking user space.
This is a massive nothing burger.
randomBugHunter@reddit
Yo these guys straight up haters
deepthought-64@reddit
What?
ilep@reddit
Exactly. Running huge databases without hugepages is a bonkers configuration to begin with.
Additionally, it looks like there is secondary effect regarding buffered vs. direct IO:
https://lore.kernel.org/lkml/20260403191942.21410-1-dipiets@amazon.it/T/#u
Kevin_Kofler@reddit
Funny. After the years-long resistance against supporting kernel preemption at all, now they make it mandatory.
Was the Linux kernel suddenly taken over by GNOME? Desupported hardware, restrictions on getting file systems into the kernel, and this, looks very much like a concerted effort to remove features from the kernel.
DotJaded996@reddit
What happened to don't break userspace?
TheBendit@reddit
echo huge_pages=on >> postgresql.conf
If you have a 256GB database, it will now be faster than before the change was made.
foobar93@reddit
How does it break userspace?
corbet@reddit
Need I say that LWN covered this situation in detail a month ago...? :)
julioqc@reddit
oh no
Notosk@reddit
...anyways
fellipec@reddit
With the ability to use other schedulers, I think this is not a big deal
aioeu@reddit
That's patently false. It could just continue to use the old behaviour when the extension isn't available.
InfinitesimaInfinity@reddit (OP)
I suppose that you have a point. I have added an edit to correct this.