Heads up: Kernel 6.19 regression causing silent crash loops in MongoDB (and eating Btrfs storage)

Posted by newaaa41@reddit | linux | View on Reddit | 15 comments

I just spent my entire weekend chasing a "phantom" storage leak on my Fedora homelab (Threadripper 2920X) and figured I’d share the findings in case anyone else is seeing weird disk usage after updating to Kernel 6.19.7.

TL;DR: There's a regression in the 6.19 memory management subsystem that causes MongoDB’s tcmalloc to SIGSEGV every \~30 seconds. If you're on Btrfs with snapshots, each crash triggers a WiredTiger journal recovery that generates hundreds of megabytes of new CoW extents. I lost 800GB of space in 4 days.

The symptoms:

du reported 60GB, but btrfs fi usage showed the drive was nearly full (1TB).

docker inspect was totally useless—it showed exit code 0 for the crashing container because it was restarting so fast.

docker events was the only thing that showed the constant "die (exit code 139)" loop.

I eventually had to do a binary-search isolation test on my 70+ containers to find the culprit. It wasn't my config or a bug in the app—rolling back to Kernel 6.18 fixed it instantly.

If you're running MongoDB on Btrfs, you might want to hold off on the 6.19 update or keep a close eye on your snapshot growth. I wrote a post-mortem with the full debugging steps and logs here if you're interested in the deep dive:

https://ali.rabeei.com/blog/database-that-ate-my-disk

[-]

AutoModerator@reddit

This submission has been removed due to receiving too many reports from users. The mods have been notified and will re-approve if this removal was inappropriate, or leave it removed.

This is most likely because:

Your post belongs in r/linuxquestions or r/linux4noobs
Your post belongs in r/linuxmemes
Your post is considered "fluff" - things like a Tux plushie or old Linux CDs are an example and, while they may be popular vote wise, they are not considered on topic
Your post is otherwise deemed not appropriate for the subreddit

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[-]

gringer@reddit

But is it web scale?

[-]

ChrisTX4@reddit

This isn’t a kernel bug. MongoDB uses tcmalloc under the hood, which makes use of Restaurator sequences (RSEQ). Tcmalloc intentionally violates the RSEQ API in order to implement an optimisation they consider somewhat critical, but because they’ve been violating the API, the introduction of upstream changes to RSEQ wasn’t blocked here even if broke tcmalloc and by extension MongoDB.

The tcmalloc developers are currently looking for workarounds that would allow them to retain this trick. Their upstream issue can be found here.

[-]

julioqc@reddit

No source, no reference, just AI slop

[-]

newaaa41@reddit (OP)

this happened to me, I am the source

https://jira.mongodb.org/browse/SERVER-122741

[-]

julioqc@reddit

you, or not, you have no reference. you post looks like a promotional AI slop piece. either you're one of many online grifter, either you need to work on your editorial skills.

[-]

b4k4ni@reddit

Maybe add the link to the blog. Your thread felt a bit too much like promoting the blog/webpage.

[-]

C0rn3j@reddit

Where's the bug report link?

[-]

newaaa41@reddit (OP)

https://jira.mongodb.org/browse/SERVER-122741

this is the bug report

[-]

alvas_man@reddit

That one seems to be a duplicate of this one: https://jira.mongodb.org/browse/SERVER-121912

[-]

C0rn3j@reddit

Thanks.

As per https://jira.mongodb.org/browse/SERVER-121912, this does not seem to be a kernel regression but a bug in tcmalloc that just happens to be exposed more on a newer kernel.

[-]