Heads up: Kernel 6.19 regression causing silent crash loops in MongoDB (and eating Btrfs storage)
Posted by newaaa41@reddit | linux | View on Reddit | 15 comments
I just spent my entire weekend chasing a "phantom" storage leak on my Fedora homelab (Threadripper 2920X) and figured I’d share the findings in case anyone else is seeing weird disk usage after updating to Kernel 6.19.7.
TL;DR: There's a regression in the 6.19 memory management subsystem that causes MongoDB’s tcmalloc to SIGSEGV every \~30 seconds. If you're on Btrfs with snapshots, each crash triggers a WiredTiger journal recovery that generates hundreds of megabytes of new CoW extents. I lost 800GB of space in 4 days.
The symptoms:
du reported 60GB, but btrfs fi usage showed the drive was nearly full (1TB).
docker inspect was totally useless—it showed exit code 0 for the crashing container because it was restarting so fast.
docker events was the only thing that showed the constant "die (exit code 139)" loop.
I eventually had to do a binary-search isolation test on my 70+ containers to find the culprit. It wasn't my config or a bug in the app—rolling back to Kernel 6.18 fixed it instantly.
If you're running MongoDB on Btrfs, you might want to hold off on the 6.19 update or keep a close eye on your snapshot growth. I wrote a post-mortem with the full debugging steps and logs here if you're interested in the deep dive:
AutoModerator@reddit
This submission has been removed due to receiving too many reports from users. The mods have been notified and will re-approve if this removal was inappropriate, or leave it removed.
This is most likely because:
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
gringer@reddit
But is it web scale?
ChrisTX4@reddit
This isn’t a kernel bug. MongoDB uses tcmalloc under the hood, which makes use of Restaurator sequences (RSEQ). Tcmalloc intentionally violates the RSEQ API in order to implement an optimisation they consider somewhat critical, but because they’ve been violating the API, the introduction of upstream changes to RSEQ wasn’t blocked here even if broke tcmalloc and by extension MongoDB.
The tcmalloc developers are currently looking for workarounds that would allow them to retain this trick. Their upstream issue can be found here.
julioqc@reddit
No source, no reference, just AI slop
newaaa41@reddit (OP)
this happened to me, I am the source
https://jira.mongodb.org/browse/SERVER-122741
julioqc@reddit
you, or not, you have no reference. you post looks like a promotional AI slop piece. either you're one of many online grifter, either you need to work on your editorial skills.
b4k4ni@reddit
Maybe add the link to the blog. Your thread felt a bit too much like promoting the blog/webpage.
C0rn3j@reddit
Where's the bug report link?
newaaa41@reddit (OP)
https://jira.mongodb.org/browse/SERVER-122741
this is the bug report
alvas_man@reddit
That one seems to be a duplicate of this one: https://jira.mongodb.org/browse/SERVER-121912
C0rn3j@reddit
Thanks.
As per https://jira.mongodb.org/browse/SERVER-121912, this does not seem to be a kernel regression but a bug in tcmalloc that just happens to be exposed more on a newer kernel.
Both_Reaction6450@reddit
bug reports are for people who actually understand what they're talking about instead of just posting random crash dumps to reddit first
kansetsupanikku@reddit
for a scenario made up by a llm? that would make it even worse if it made a report just to support the story
Wartz@reddit
Fake LLM bot news
alvas_man@reddit
This is a real bug in mongo DB, I was hit by it too. The write up is probably LLM though, yes.