BTRFS chunk tree corruption on UGREEN DXP2800 NAS, orphaned block groups blocking mount, standard repair tools failing
Posted by osoatwork@reddit | sysadmin | View on Reddit | 15 comments
Running a UGREEN DXP2800 NAS (Intel N100, UGOS/Debian-based) with two 8TB WD Red drives in BTRFS RAID1. After a power loss, Volume 1 mounted read-only with chunk tree corruption.
**Current state:**
- Both drives pass SMART
- `btrfs check --chunk-root 29573120 --repair --force` successfully opens the filesystem and repairs extent references
- Two orphaned block groups remain that cause it to abort: `Block group[4769591590912]` and `Block group[4770665332736]` — "didn't find relative chunk"
- Filesystem will not mount
**What I've tried:**
- `btrfs rescue chunk-recover` — device busy
- `btrfs rescue zero-log` — couldn't open ctree
- `btrfs check --repair` with all 4 backup chunk roots from superblock
- `--clear-space-cache v2` — completed successfully
- `--init-extent-tree` — crashes with assertion error
- SystemRescue live USB for unmounted repair — auto-reboots before repair can complete
**Specific question:** How do I remove or fix these two orphaned block groups? Is there a way to manually delete them from the chunk tree, or force BTRFS to ignore them on mount?
Any help appreciated.
ledow@reddit
Restore from backup.
BTRFS is a mess and a disaster waiting to happen. Virtually unrepairable without stupendous amounts of help, and you want to keep the original corruption for rollback because most "repairs" make the thing worse.
Plus it's well-documented to corrupt regularly if it nears filling up the storage device.
I wouldn't touch btrfs with a bargepole given previous experiences.
The only people who stand a chance will be those expert developers frequenting the various btrfs mailing lists, and you know what? It's not worth their or your time.
Restore from backup. If no backup, then learn the lesson - you can't just "fix" things that easily. Recover what you can (good luck, not much supports recovery from btrfs except the btrfs tools you already have), try not to run too many commands (unless you retain before and after copies), and consider NOT using btrfs in the future.
The mailing lists are full of people trying to recover data from borked btrfs images and I rarely see success.
dustojnikhummer@reddit
And here I thought BTRFS is considered a mature filesystem for network storage? AFAIK Synology uses it even on rackstations?
I guess me running ext4 because "that's what I'm used to was the right choice?
kiler129@reddit
If you want stability you use ZFS.
ledow@reddit
It's that nonsense which I've always pushed back on.
BTRFS has stuff that Ext4 doesn't, for sure, but all these fancy "new" filesystems are nowhere near as mature as stuff like Ext4.
BTRFS has serious problems in out-of-space conditions, leading to corruption. Their advice is basically "never let it run out of space" which, with snapshots and file streams and all kinds of other metadata, is not something which you can monitor simply.
I don't count ZFS on Linux as mature, either. I'm sure the FS is well-established but the open ports are still relatively new.
I've never had an Ext4 trash itself to the point that the filesystemwas unrecoverable (maybe a file or two), but on pretty much every other FS (including NTFS), I've seen that exact scenario. "Wipe it and start again".
Just bought some brand-new QNAP NAS's and they come with either ZFS or Ext4, no option for btrfs, and ZFS is not the default. There's a reason for that.
People enable btrfs because they want snapshotting and fancy features and it works fine... until it doesn't.
A lifetime of IT has taught me: Don't play with your filesystems. It's just not worth it.
ledow@reddit
Oh, and anecdote time:
Still suffering trauma from my DOS days when a friend was using our very-expensive Windows 3.1 PC and helping with some file maintenance.
We needed to delete a folder to clear so space, and we triple-checked everything, and we were booted into DOS to do so, and then they deleted the files from DOS.
We got suspicious quite quickly because, even with an old IDE hard drive, the deletes were taking too long. We killed the delete command (Ctrl-C / Ctrl-Break at the time) and went looking. Our root filesystem was empty. C:\ showed almost nothing.
Ran chkdsk and discovered "cross-linked chains" on the FAT16 filesystem. Basically the filesystem was linking directories together erroneouslly and had gotten corrupt. Sudirectories were actually pointing to the root of the drive. So we had:
C:\ which contained C:\DOS, etc. and C:\WIN31 - which contained C:\WIN31\DOS and C:\WIN31\WIN31 and so on... basically it thought that the subdirectory WIN31 contained everything that was actually in the root, and so generated an infinite loop when deleting.
So we had things like:
C:\WIN31\WIN31\WIN31\WIN31\WIN31\WIN31\WIN31\AUTOEXEC.BAT and so on.
And, of course, deleting a file from those subfolders deleted it from the ROOT of the C:\ drive too. So it had deleted virtually everything on the drive, and was spinning trying to delete an infintely-nested subdirectory.
Fortunately, our Ctrl-C happened in time and we were still running (somehow!).
We were able to fix the problem with chkdsk, and we had had the sense to partition our drive, so the other drive letters where the data was were fine, but we had to recreate the AUTOEXEC, CONFIG.SYS, DOS folders, etc. that were on there before we dared reboot. Fortunately, our REAL Windows install was on C:\WINDOWS (at the time you could customise it) and the cross-linked chain was on C:\WIN31 (a copy of Windows that we kept around as a backup), so it never managed to delete our Windows install or folders named more than W.
We managed to fix it (a LOT of checking and a single reboot, believe it or not), and that computer ran for years afterwards, but we were about to murder our friend. But they hadn't really done anything wrong. They were deleting an innocent subdirectory that just happened to have got cross-linked to the root, so the machine ran off and deleted the entire drive thinking it was a subdirectory of what we wanted to delete.
dustojnikhummer@reddit
Hardlinked cyclic nested subfolders... I have managed both, but never at the same time. And that was on modern Windows, even that was a pain to fix.
Yeah don't fuck with your filesystem, amen to that. Surprised you still had stuff like checkdisk at that point. Also, this is exactly why Windows prevents you from fucking around in its own folders. The memes "I am the Admin" are wrong. You aren't and you shouldn't be. Same as rm -rf / now requiring --no-preserve-root
Computers do what you tell them to do, not what you want them to do.
dustojnikhummer@reddit
I think ZFS is considered mature on BSD/Unix and relatively new on Linux side. TrueNAS uses it and so far it looks rocks solid, but as you said it isn't perfect.
And what about XFS? EnterpriseLinux uses it by default
rejectionhotlin3@reddit
ZFS is your friend. BTRFS is a neat concept but very poor overall. See linux lore and Linus's comments.
Unnamed-3891@reddit
This. I am really sad that this is still true, but it is. ZFS is fucking amazing and I love it, but I get why it’s not everywhere due to licensing concerns. BTRFS, despite existing for almost 20 years and having been officially declared ”stable” for 12 has neve really been that.
tobias3@reddit
To all the btrfs haters: This thing does not have ECC memory, so may as well have been a memory corruption. Also idk if the Linux kernel is a reasonable one.
In a professional setting you don't want a "repair" that randomly restores some data. You'd want to properly restore from backup.
To OP: Contact your vendor. Posting on the btrfs mailing list might also help you. You already might have made the situation worse via the repair commands.
sambodia85@reddit
UGREEN, N100. Surely this is one for r/homelab
Altusbc@reddit
Good luck the OP in recovering from this corruption. IMO, btrfs has way too many issues to used in any sense of production systems.
FalconDriver85@reddit
SLES 16 made it the default file system.
Deploying the standard images on Azure, I found out Btrfs was the default.
I’m really really tempted to do a setup on Hyper-V with LVM and XFS and use it as base image to deploy on Azure…
Sroni4967@reddit
the post literally says they already tried chunk-recover (device busy). for the orphaned block groups specifically, have you tried mounting with rescue=all,ro? also btrfs inspect-internal dump-tree -t chunk might help you see exactly what's going on with those two block groups before you try anything more destructive. if none of that works honestly I'd post to the btrfs mailing list with the dump-tree output, the devs are pretty responsive to corruption cases like this
GallowWho@reddit
Can you try mounting the files using a arch Linux USB environment? Trying a newer kernel might help with the kernel panic