Need someone who's real good with mdadm...

Posted by beboshoulddie@reddit | linuxadmin | View on Reddit | 28 comments

Hi folks,

I'll cut a long story short - I have a NAS which uses mdadm under the hood for RAID. I had 2 out of 4 disks die (monitoring fail...) but was able to clone the recently faulty one to a fresh disk and reinsert it into the array. The problem is, it still shows as faulty in when I run mdadm --detail.

I need to get that disk back in the array so it'll let me add the 4th disk and start to rebuild.

Can someone confirm if removing and re-adding a disk to an mdadm array will do so non-destructively? Is there another way to do this?

mdadm --detail output below. /dev/sdc3 is the cloned disk which is now healthy. /dev/sdd4 (the 4th missing disk) failed long before and seems to have been removed.

/dev/md1:
        Version : 1.0
  Creation Time : Sun Jul 21 17:20:33 2019
     Raid Level : raid5
     Array Size : 17551701504 (16738.61 GiB 17972.94 GB)
  Used Dev Size : 5850567168 (5579.54 GiB 5990.98 GB)
   Raid Devices : 4
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Thu Mar 20 13:24:54 2025
          State : active, FAILED, Rescue
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : 1
           UUID : 3f7dac17:d6e5552b:48696ee6:859815b6
         Events : 17835551

    Number   Major   Minor   RaidDevice State
       4       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      faulty   /dev/sdc3
       6       0        0        6      removed

[-]

Superb_Raccoon@reddit

Recall your tapes. (No backups? Prepare 3 envelopes...) You don't say what level RAID you have but only 0+1 has a chance of surviving.

Next time, have a hot spare, and of, course fix the monitoring.

[-]

michaelpaoli@reddit

Oh, let's see. If I recall correctly, there's some type of assume clean/good or the like. The (potential) downside of that, is if it's not actually clean/good, or has missing or corrupted data ... but other than that, I think it ought work. So, let me see if I can throw together a quick test/demo of it - and in this case it will actually be good/clean - I can't speak for your actual drives or their data. So - I think maybe I'll (mostly) skip the comments on it, and just show commands/output. And may omit/trim output some fair bit for brevity/clarity (and space savings).

# cd $(mktemp -d /tmp/md.demo.XXXXXXXXXX) && df -h .
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           512M  768K  512M   1% /tmp
# (for d in 0 2 3; do truncate -s 134217728 "$d" && losetup -f --show "$d"; done)
/dev/loop0
/dev/loop2
/dev/loop3
# mdadm --create --level=raid5 --raid-devices=3 /dev/md53 /dev/loop[023]
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md53 started.
# grep . /sys/block/md53/size
516096
# factor 516096
516096: 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 7
# perl -e 'print(2**(13+9),"\n");'
4194304
# dd bs=4194304 if=/dev/random of=/dev/md53 count=63 status=none
# < /dev/md53 sha512sum > sha512sum
# mdadm /dev/md53 --fail /dev/loop2 --remove /dev/loop2
# losetup -d /dev/loop2 && rm 2
# mdadm /dev/md53 --fail /dev/loop3 --remove /dev/loop3
mdadm: set device faulty failed for /dev/loop3:  Device or resource busy
# mdadm --detail /dev/md53 | sed -ne '/Major/,$p'
    Number   Major   Minor   RaidDevice State
       0       7        0        0      active sync   /dev/loop0
       -       0        0        1      removed
       3       7        3        2      faulty   /dev/loop3
# mdadm --stop /dev/md53
# mdadm --examine /dev/loop3 | grep -F -e '  State' -e 'Data Offset'
    Data Offset : 4096 sectors
          State : clean
# mdadm --zero-superblock /dev/loop3
# mdadm --examine /dev/loop3
mdadm: No md superblock detected on /dev/loop3.
# uuid=$(mdadm --examine /dev/loop0 | awk '/Array UUID/ {print $(NF);}')
# mdadm --create --level=raid5 --uuid="$uuid" --raid-devices=3 --force /dev/md53 /dev/loop0 missing /dev/loop3
mdadm: /dev/loop0 appears to be part of a raid array:
       level=raid5 devices=3 ctime=Sat Aug 30 19:29:47 2025
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md53 started.
# < /dev/md53 sha512sum | cmp - sha512sum && echo MATCHED
MATCHED
# sed -ne '/^md53 :/,/^ *$/{/^ *$/q;p;}' /proc/mdstat
md53 : active raid5 loop3[2] loop0[0]
      258048 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [U_U]
# mdadm --detail /dev/md53 | sed -ne '/Major/,$p'
    Number   Major   Minor   RaidDevice State
       0       7        0        0      active sync   /dev/loop0
       -       0        0        1      removed
       2       7        3        2      active sync   /dev/loop3
# (d=2; truncate -s 134217728 "$d" && losetup -f --show "$d")
/dev/loop2
# mdadm /dev/md53 --add /dev/loop2
mdadm: added /dev/loop2
# sed -ne '/^md53 :/,/^ *$/{/^ *$/q;p;}' /proc/mdstat
md53 : active raid5 loop2[3] loop3[2] loop0[0]
      258048 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
# mdadm --detail /dev/md53 | sed -ne '/Major/,$p'
    Number   Major   Minor   RaidDevice State
       0       7        0        0      active sync   /dev/loop0
       3       7        2        1      active sync   /dev/loop2
       2       7        3        2      active sync   /dev/loop3
# < /dev/md53 sha512sum | cmp - sha512sum && echo MATCHED
MATCHED
# [ "$uuid" = "$(mdadm --detail /dev/md53 | awk '/UUID/ {print $(NF);}')" ] && { echo MATCHED; unset uuid; }
MATCHED
#

And, well, not enough space left to add my comments here, so shall reply to this to continue that.

[-]

michaelpaoli@reddit

And, continuing from my earlier comment above:

So, though I marked the device in the array as faulty, I wasn't able to get it to show an unclean state, so, I took the more extreme measure of wiping the superblock (--zero-superblock) - so md would have no idea of the status or nature of any data there. Then I recreated the array - exactly as before, except starting with one device missing. In that case, with raid5, there's no parity to be written, or any data other than superblock metadata, so, in creating exactly the same, the structure and layout is again exactly the same, and the data is otherwise untouched, and only the metadata/superblock is written. And since we've given md no reason to presume or believe anything is wrong with our device that has no superblock at all, it simply writes out our data. At that point we have operational started md raid5 in degraded state, with one missing device. The rest is highly straight forward - I just show some details that the data exactly matches what was on the md device before, also preserved (Array) UUID, and showed some status bits of the recovered array in degraded state, and after adding replacement device and allowing it time to sync, again final status and again a check showing the data still precisely matched, and that we've got same correct (Array) UUID for the md device. Easy peasy. ;-) Uhm, yeah, when it doubt, generally good to test on not actual production data and devices. And, if nothing else, with loop devices, that can be pretty darn easy and convenient to do.

Note also, you've got version 1.0, so if you actually try something like (re)creating the array on those devices, be sure to do it with exact same version and exact same means/layout of creation - except have at least one device missing when so doing - so it doesn't start calculating and writing out parity data. In fact with sparse files, you could pretty easily test it while consuming very little actual space to do so ... at least until one adds last missing device and works to go from degraded to full sync, and calculates and writes out all that parity data - then the space used would quickly largely balloon (up to bit more than eating the full size of one of the devices). You can also test it by putting some moderate bit of random (or other quite unique) data on there first (but again, with one device missing, so it doesn't calculate and write out parity), and read that data early in your testing (and save it or hash thereof), and likewise when you have all devices in the array healthy except for missing one drive. Yeah, also be sure the order of any such (re)creation of array is exactly the same way - otherwise the data would likely become toast (or at least upon writes to array, or resync when md writes parity to the array). In your case, you can probably avoid recreating the array, and using --assume-clean. Also, I tried to assemble array with less drives than needed to at least start it in degraded mode - I don't know that md has a means of doing that (I didn't find such means). Seems there should be, on running array, means to unmark a drive from being in failed/faulty state ... but I'm not aware of a more direct way to do that.

[-]

beboshoulddie@reddit (OP)

As I replied to another commenter, I've spent some time today setting up a VM with 4 disks with a similar configuration to my real life issue.

If I fail and remove one disk (disk '4' from my real life scenario), then fail another disk (disk '3'), the array remains readable (as expected, it's degraded but accessible, but won't rebuild).

If I unmount, stop and re-assemble the array with the --force flag, and using only disks 1-3 then that seems to preserve my data and clear the faulty flag (and i am avoiding using --add which does seem destructive).

I can then use --add on the 4th (blank) drive to start the rebuild.

Does that seem sane?

[-]

michaelpaoli@reddit

Yes, seems like a sane plan, and of course be sure you've well tested that scenario. And as I pointed out, can well emulate that, with sparse files, loopback devicesf, etc. Even copy the exact same metadata off the existing where that's readable - just be sure to then use those either on another host, or change the UUIDs and other bits so they don't at all conflict on the same host.

[-]

beboshoulddie@reddit (OP)

Hey there, just wanted to say thanks for your detailed write up again - unfortunately on the old md version on this NAS device the re-assemble wasn't resetting the fail flag on that 3rd disk. I performed the superblock zero you outlined and it worked, I've now been able to insert my 4th disk and start the rebuild.

An absolute hero, I owe you a pint if you're ever in the UK. 🍻

[-]

michaelpaoli@reddit

Cool, glad to hear it worked! I figured it probably would. Always good to test/verify, a lot of what gets put on The Internet and (so called) "social media" ... uhm, yeah, ... that. Yeah, I was dealing with different version on the labeling - though theoretically that would still behave the same - using the correct version labeling, etc. ... but of course also sounds like I'm probably using a much newer version of mdadm, so that could also potentially make a difference.

Yeah, sometimes mdadm can be a bit tricky - it doesn't always give one all the low-level access one might needs/want to do certain things ... for better and/or worse. But in a lot of cases, there are, if nothing else, effective work-arounds to get the needed done, if there isn't a simpler more direct way with more basic mdadm commands or the like. I suppose also, e.g. with digging into source code and/or maybe even md(4), probably also feasible to figure out how to set the device state to clean, e.g. stop array, change state of unclean device to clean, then restart array.

[-]

Dr_Hacks@reddit

It's possible way, but as i said in 2nd answer with additional info request - you should NEVER do things like --force reassemble unless full backup, it's very dangerous try cause we don't know specs of raid and device order, need to find this first from device details and mdstat

However, data recovery software like, again, r-studio, can try to assemble resulting block device from this state in minutes with \~0 risk of data loss.

[-]

Eiodalin@reddit

As someone who has recovered from Raid5 disk failures, i know for a fact that you need to remove the failed drive with the command

mdadm --manage /dev/md0 --remove /dev/sdc1

However since you have no spare disk in the array already that might become a situation of data loss

[-]

uzlonewolf@reddit

I know for a fact that --add will be destructive and wipe everything on the drive. I think --re-add is your best bet but I have never done it myself and don't know if it'll recover your array.

If you have another drive that's not part of the array (doesn't have to be that big) you can emulate something like a "dry run" by redirecting all writes to an overlay file so the original disks are not touched.

[-]

beboshoulddie@reddit (OP)

I've spent some time testing on a VM with a few small disks that I assembled into a RAID 5 config. If I fail and remove one disk (disk '4'), then fail another disk (disk '3'), the array remains readable (as expected, it's degraded but accessible, but won't rebuild).

I can then use --add on the 4th (blank) drive to start the rebuild.

Does that seem sane?

[-]

uzlonewolf@reddit

Yep, that looks good!

[-]

Dr_Hacks@reddit

moved from mdadm raid5(10 years) to testing btrfs raid5 just week ago cause of really bad mdadm cli and block

RIAD5 is 3xN disks raid, you CAN NOT make 4 disks raid5(unlike most of hardware conrollers using stripes for each disk, but even hardware raid5 with 4 disks will be a mess by the size), it's will be just raid 5 degraded like this or raid5 3 disks and 1 spare.
You DONT need to remove anything to test and restore, just read everything from md1 like dd|pv>/dev/null or rsync to safe place and thats all needed to test(better to do ACTUAL backup with this to avoid duplicate access if remaining disks have some bad sectors). YOU NEED THIS FIRST
You MUST NOT replace faulty disk this way like you did, it's ALREADY MARKED AS FAILED if it can write data, on it's metadata, in md terms you need to remove disk by mdadm and reinsert as fresh, ONLY AFTER that resync will start

mdadm --manage /dev/md1 --fail /dev/sdc3

mdadm --manage /dev/md1 --remove /dev/sdc3

mdadm --grow /dev/md1--raid-devices=3

mdadm --manage /dev/md1 --add /dev/sdc3

If [2] is ok or it's reading just fine you can start [3] now already, nothing missed, it's raid5 2/3 disks alive array.

[-]

uzlonewolf@reddit

OP has a RAID5 array with 2 drives failed. Attempting to manage and add drives like you suggest will result in the array being destroyed and all data lost.

[-]

Dr_Hacks@reddit

OP has a RAID5 array with 2 drives failed

Wrong (c)

[-]

uzlonewolf@reddit

Raid Devices : 4
Working Devices : 2

Did you not read the OP?

[-]

Dr_Hacks@reddit

RTFM above, you're so bad "admin" , that you can't even realize that RAID5 on 4 drives md is impossible, 4th - spare, if not - it's ALREADY DESTROYED cause of wrong OP actions, he'll need to recover manually after, marking replaced failed(even recovered) as good on active raid is worst idea ever, it's more about "go to data recovery specialists".

even mdadm clearly says it

 Active Devices : 2
Working Devices : 2
 Failed Devices : 1

cause there is no spare in stats

And there is no way to "destruct" md array. It won't let you.

[-]

uzlonewolf@reddit

RAID5 on 4 drives md is impossible

Complete bullshit. Please go learn RAID basics before spouting off this nonsense. RAID5 works just fine with 4 disks (the data is striped across 3 of them and the 4th is used for parity).

And when the array is in a failed state, doing --add on a disk that is required but was removed/marked failed WILL destroy the array.

[-]

beboshoulddie@reddit (OP)

This is crazy - RAID 5 is a minimum of 3 disks but can be any number.

4 works fine, as does 20.

RAID 5 stripes the parity across all drives with tolerance for 1 failure. It is not dependant on the number of drives, apart from the minimum.

[-]

Dr_Hacks@reddit

P.S. Thanx for corrections, totally agree with type of raid determination by size. It's 3 to 1 raid5 with 3/4 capacity.

[-]

Dr_Hacks@reddit

3/4 capacity it's not raid 5 at all, it's either DOUBLE raid5 with ABC BCD xor groups, or stripes, and no, mdadm NOT using striped structure to make raid5/6 from any number of disk , so it's double raid5.

[-]

Dr_Hacks@reddit

It can't be not power of 3(raid5) in md by default, but can be in latest kernels/tools, or if created using flags - its ALREADY broken, it's NOT RAID5 and its unrecoverable using mdadm way,thats why i'm asking about /proc/mdstat to check for spares. The only way to recover it from totally failed fake raid5(let me guess - newer mdadm will create raid5 3/4 capacity, that's impossible for raid5) with stripes - reassemble in r-studio, manual try recreate array without rebuild and so on, but anyway, NOT WITHOUT BACKUP.

     Array Size : 17551701504 (16738.61 GiB 17972.94 GB)
  Used Dev Size : 5850567168 (5579.54 GiB 5990.98 GB)

gives bad intentions, as i've checked 3/4 fake raid5 from mdadm gives exact 3/4 capacity thats's impossible to recover from >1 drive failure(not spare), and thats it. So best guess if it's really happened - dont mess with mdadm, as i said above. start with backup then mount loop as r/o and try to force reassemble or just go with data recovery software that can recognize soft-raid and reconstruct mappings independent of md state.

[-]

skat_in_the_hat@reddit

Get on another linux box, create 4 files, write a file system to each file. Set them up the same way in mdadm. Then do what you're trying to do with that setup.

[-]

beboshoulddie@reddit (OP)

Replied to another commenter, I've spent some time today doing just that and found the following.

If I fail and remove one disk (disk '4' from my real life scenario), then fail another disk (disk '3'), the array remains readable (as expected, it's degraded but accessible, but won't rebuild).

I can then use --add on the 4th (blank) drive to start the rebuild.

Does that seem sane?

[-]

Dr_Hacks@reddit

P.S. if you want to "just clear FAULTY flag" - it's officially impossible. but it's easy done using destroy+forced reassembling of array without resync. Cause of unknown disk order it's can be done only if you have backed up every disk.

[-]

Einaiden@reddit

Here is my recommendation: clone all four drives to new drives and work on those clones. If you can, it might be easier if you can clone them to disk images and assemble the raid from the disk images.

Now you can test forcing the array to start without destroying recovery options.

[-]

beboshoulddie@reddit (OP)

Unfortunately I don't have the ability to do that, I have a clone of disk 3 but that's all I have space for. Disk 4 has failed entirely so can't be cloned.

[-]

uzlonewolf@reddit

If you have another drive that's not part of the array (doesn't have to be that big) you can emulate something like a "dry run" by redirecting all writes to a device mapper snapshot file so the original disks are not touched.

https://unix.stackexchange.com/a/67681/125151 <- Do that for all 3 drives and pass mdadm the /dev/mapper/... virtual devices to test the re-adding.