How to gracefully swap a failing SAS in a RAID5 array on a Poweredge PERC controller?

Posted by Snot-p@reddit | sysadmin | View on Reddit | 50 comments

Hi all,

In a bit of a situation where I can use some guidance on hardware I inherited. I have 5 1.2TB SAS drives in a RAID5 array on an older Poweredge R540 on a PERC H740P hardware RAID controller.

One of the five drives in the RAID5 is throwing SMART errors and is in a predictive failure state but is still online for now. I have an identical 1.2TB SAS listed ready as a global hot spare on this PERC controller. It's not dedicated to that RAID5 array.

I am heavily imagining it's incredibly bad practice to yank the failing drive and simulate an array failover onto that global hot spare as then I'm risking the array to puncture during rebuild. From reading, I see you're supposed to do a replace member on the PERC. The issue - iDRAC exposes none of that from what I can see to mark a drive for replace member and kick off the safe preemptive build on the hot spare.

I see that you can use PERCCLI to kick off a Replace Member - is this just a Dell utility that runs on the Hypervisor? Is this the right way of going about this? Or are people just yanking a drive and letting the array do the work after immediately slapping in a new healthy drive?

Thanks

[-]

lutiana@reddit

You pull the failing drive, put in the new drive and walk away. Your disk access will be very slow until the new drive has been completely re-silvered (ie the data is rebuilt from the parity).

These server are designed to allow you to do exactly this, so as to eliminate downtime from a failed drive, either from removal or straight up failure.

That said, contact Dell support to verify this if you must, but also make sure your backups are solid and reliable.

[-]

Smooth-Zucchini4923@reddit

This. I spent literally hours trying to read the manual and BIOS to try to figure out how to tell the RAID array that I was about to pull the disk, and it should replace it with the new disk.

It's really as simple as lutiana says: you pull the failing disk, and insert the new disk in the same position in the array as the old disk.

[-]

CeldonShooper@reddit

I know this goes without saying but please for the love of everything that's holy: Check carefully that you are 100% absolutely sure you pull exactly the right drive.

If you have just one drive redundancy and you have to rebuild an accidentally pulled healthy drive with parity from the non-redundant array with the ailing drive, it can finally die and your whole array is dead in the water.

[-]

Zenkin@reddit

iDRAC allows you to "blink" a HDD. Use the feature!

[-]

jrhoades@reddit

Be very careful doing this - a failed drive will already have an error light lit. Blinking the light with a Dell PERC may cover this error light causing your on site sysadmin to pull the wrong drive. Yes we did this. Twice.

[-]

SlayTalon@reddit

There's nothing worse than the silence on the other side of the phone call after the client says they're pulling the drive, then the distinct sounds of one drive being removed, being immediately placed back in, and another drive being removed.

[-]

CeldonShooper@reddit

Yeah it's like a surgeon amputating the wrong leg.

[-]

TreborG2@reddit

Guys you're forgetting, this is admin has already marked a drive as hot spare. It should be March for hot spare of this or any raid container, and so as soon as the failed drive or failing drive is pulled, the RAID controller should go into a full fail replacement mode and automatically pick up the assigned hot spare.

There's no pull the one that's failing out and then put something in place you pull the one that's filling out and the hot spare takes over.

The question becomes when do you put the new drive in and make it the new hot spare ...

[-]

Expensive_Plant_9530@reddit

To be fair, you can put another drive in, but what you’d be doing is putting in an additional hot spare with you then have to configure after the fact.

But as soon as you pull the failing drive, yeah the hot spare is going to start rebuilding the array if the array is set up correctly, lol.

[-]

Euler007@reddit

To do it gracefully make sure to bend the knees and do a short bow after removing the failed drive.

[-]

lutiana@reddit

Well I used to sacrifice a chicken but my boss got tired of having clean the blood out of the server room. So now I just do a ritual dance.

[-]

TundraGon@reddit

I spray holy water on each server and network equipment, every morning.

/s - in case it's not clear.

[-]

kvorythix@reddit

pull the bad drive, let the controller finish marking it failed, then swap in the new one and monitor the rebuild. don't force it unless you already know the array is healthy enough to take the hit

[-]

BlotchyBaboon@reddit

Dell support can definitely tell you the right way to do this.

I can tell you that in the past I've yanked drive and shoved the new one in. But really, don't trust me. There's probably a better way. I don't know your exact set up or what's involved, so my advice is terrible.

Regardless... it doesn't sound like a fun Friday afternoon thing, so I feel for ya.

[-]

Snot-p@reddit (OP)

Unfortunately that Poweredge is on an island of its own and is close to a decade old at this point. No active support contract or even iDRAC license (ugh).

Luckily...the drive has been recovering from the errors and I have some time to think before I do. The array is still in a healthy state so just trying to think through this before I really make a mess of things and have to even think about the off-site backups.

[-]

Expensive_Plant_9530@reddit

I wouldn’t delay this beyond what’s necessary or else the drive might fail unexpectedly and force a rebuild on you anyway.

If you have backups, I would just initiate a rebuild/pull the drive.

If you don’t, get backups sorted ASAP, then do it.

I’d also be ordering a new hot spare.

[-]

Secret_Account07@reddit

Ouch

So how do you even console in or reach mgmt console?

[-]

Snot-p@reddit (OP)

Luckily whoever set up this box beforehand correctly placed the hypervisor on a separate virtual disk not belonging to the array.

The array that's in trouble is slated just for the HyperV stuff. So worst case I just RDP back in and can rebuild off of the Azure storage account backups. Although one of the VM's being a PDC..so that'd be a headache as we all know restoring a DC from a backup never ever goes well. So that'd mean rebuilding a DC this weekend which I'm really not trying to do lol.

Otherwise I still have the basic iDRAC functionality even without a license.

[-]

JoDrRe@reddit

You might be able to use OMSA (OpenManage) to get you some more info or some of the tasks. No license or support contract needed. Dell moved away or is moving away from it for the newer models but it’s still viable for old ones.

[-]

cntry2001@reddit

Do full on site backups of the server Make sure the perc firmware is updated Reboot the server Blink the drive with errors to make sure you don’t yank the wrong drive Yank bad drive Put in new drive Have beer

[-]

saltysomadmin@reddit

Time to setup an on-site backup before you pull the drive.

[-]

No_Yesterday_3260@reddit

As long as you don't pull the wrong one, put it in, pull it out, put it in, then also pull another one out and put it in.

Had a colleague at my previous job, that apparantly panicked when having to do a simple NAS drive switchout :P

[-]

Barrerayy@reddit

Just swap the drive with a new one and let the array resilver?

[-]

StiffAssedBrit@reddit

Is the DC VM small enough to move to a volume that is not on the failing array? I would move as much as possible off that volume until the failed disk is replaced.

[-]

unethicalposter@reddit

Every server I support is redundant. You pull and replace and walk away. And friends don't let friends use raid5... I know c you said it's 10 years old but still

[-]

SalzigHund@reddit

If you have backups and don’t have a bunch of drives, RAID 5 is perfectly fine.

[-]

unethicalposter@reddit

You're not wrong, if the drives are small, but raid6 or 10 are really the only ways to go these days

[-]

wastewater-IT@reddit

I prefer to force the disk offline via the iDRAC CLI then replace it, not sure if it's any different than just pulling the drive but it makes me feel better: https://www.dell.com/support/kbdoc/en-us/000202557/kb-how-to-take-physical-disk-offline-using-idrac-racadm?msockid=279e9e4c983363a41cc98859997f6228

[-]

flecom@reddit

This is the proper way, prepare for removal then remove

I've of course never done this and just pull the drive

[-]

Master-IT-All@reddit

Yes, it would be better to soft tell it to go down than ripping it out. Thanks for this, I couldn't remember how.

[-]

CountyMorgue@reddit

Please do a full backup and test restore just in case the rebuild fails.

[-]

TechMonkey13@reddit

Yank the old bitch out, shove the new bitch in.

[-]

fulafisken@reddit

If you can use the remote console on the idrac, maybe you could reboot and enter the perc options from there and soft fail the bad drive, it is better to rebuild the array before removing the drive that is not yet failed. It could save you if another drive fails during rebuild.

[-]

Agromahdi123@reddit

on dell machines anything orange can be pulled while running and anything blue needs to be shutdown. idrac licenses (even old ones) can be bought on ebay, and for really old ones you can find the file or start a trial. The Raid controller either way would be accessed from the bios.

[-]

Snot-p@reddit (OP)

Thanks all. I'm gonna just bite the bullet and yank and replace. I do hear people's concern about the risk of failure being rather high due to age during rebuild..I'm sketched out too. But it's a rock and a hard place. Regardless, the array is going to fall into degraded state at some point so I might as well rip the band-aid now.

My backups are confirmed good for a critical SQL server - but otherwise it hosts my PDC and Entra Connect VM's which if I lose those...I'm gonna have a long weekend because that'll mean just having to rebuild a DC. Praying for a good outcome.

Thanks again for the help

[-]

dinominant@reddit

Verify and TEST your backps. A raid5 with a faulty drive from age will likely lose a 2nd drive druing recovery/rebuild and go offline.

Recovering a 2-disk raid5 failure is doable if all drives are working and the bad sectors are distributed randomly. But hardware raid controllers will refuse to help with that.

[-]

nitroman89@reddit

I had a Dell R720 and R730 with perc controllers in my home lab. I was running esxi on them at the time, if I remember right there's a utility you can use to interact with the raid otherwise boot it up into the web bios or whatever it's called.

I think the utility is called perc-cli which should be a rebranded version of storcli from LSI.

Like the one comment, you should be able to swap and it start.

Thomas w. Something has a website with a bunch of commands like "storcli /call show".

[-]

countsachot@reddit

Pop latch, Yank drive, coffee time. Check status at refill time.

Hot Spares rock.

[-]

Master-IT-All@reddit

The hotspare isn't there for you to use when doing maintenance to replace a failing drive. It's there part of the array in case the drive fails suddenly without warning. Only then does it come online as part of the array's data disks.

To use the hotspare, you'd need to break it off the array, remove it from the cage. Then remove the failing drive and replace it with the hotspare.

Then initiate the rebuild.

You can't rebuild an array til you remove the drive. Maybe there's a software interface way to do all that in virtual and mark the drive as failed, but the fastest way is to just pull the drive and put in a new one and let it go.

[-]

Kind_Ability3218@reddit

lol you dont.

[-]

Baselet@reddit

Why are you here?

[-]

loosebolts@reddit

I’ve only ever pulled the disk and replaced. The array and controller is designed to deal with that exact situation.

[-]

Puzzled-Formal-7957@reddit

Pull and replace and get another spare on hand immediately... cause you're going to have more failures soon.

[-]

ntrlsur@reddit

Best case shutdown the server pull the bad drive put in the replacement drive it will rebuild. What I typically do is just pull the drive. I typically unlatch the connector and slide it partial out. When it finishes spinning down I pull it completely. Insert a new drive in another slot and make it the new global hotswap. If you replace the drive in the same slot then it will want to rebuild again. I find it easier on the drives to just rebuild once.

[-]

qkdsm7@reddit

Global hot spare --- Is it specifically assigned to ANOTHER array, and that's why it didn't already take over?

As others posted, confirm backups, verify backups, triple check backups----- and fail it over.

[-]

Complete-Mission-636@reddit

Yank and put in new.

[-]

SmartDrv@reddit

Make sure you have good backups first before you do anything. Rebuild is stressful on the remaining drives and it is always possible that you lose another drive part way through. Or if things like raid scrubbing aren’t properly configured, you may find your healthy volume isn’t so healthy during the rebuild causing it to fail. See Linus Tech Tips for examples of raid gone wrong lol

[-]

Rex_Bossman@reddit

I'm in the same boat on an R740. Waiting on a replacement drive from Dell that was supposed to ship next-day on Wednesday. I'm just going to wait until end of day and swap them out; I figured that's why they are hot swap drives right? Fingers crossed.

[-]

compu85@reddit

My only beef with pulling the drive is that it will begin to spool up the hot spare, then you have to wait for that to finish rebuilding before it will move the new disk into the array. In the past I've called Dell support with questions like this, and even for out of warranty servers they were excellent and offered step by step direction on stabilizing the array.

[-]

Rio__Grande@reddit

If you pull the bad drive the global should take its spot immediately. In older servers like the 13th gen I would sometimes need to manually start a rebuild.

We used the that family of Perc extensible in the 14th gen. Never used ssds for our arrays but sas hdds. I don't think that causes much difference tho.

I imagine it shouldn't take long to rebuild that array