We encountered a failed logical disk on a Dell Perc SAS controller. After a quick review we discovered that two disks out of the four configured for RAID5 had failed. This event triggered the Perc controller to put the logical disk offline. Now what…
Everyone knows that when using a raid5 distributed partity with 4 disks the maximum redundancy is losing 1 disk. With two failed disks data loss is usually inevitable. SO, if this is also the case with your machine, please realize your chances of recovering are slim. This article will not magically increase the chances you have on recovering. The logic of the Dell Perc SAS controller actually might.
First off, I will not accept any responsibility for damage done by following this article. Its content is intended to offer the troubleshooting engineer an possible solution path. Key knowledge is needed to interpret your situation correctly and with that the applicability of this article.
TIP: Save any data still available to you in a read-only state.
(If you have read only data, this article does not apply to you!)
What do you need?
Obviously you need to have two replacement disks available.
You also need to have a iDRAC (Dell remote access card) or some other means to access the systemlog.
You need to have physical access to the machine (to replace the disks and review the system behavior)
What to do?
Our specific setup: Controller 0 -Logical volume 1, raid5 + Disk 0:2 Online + Disk 0:3 Failed + Disk 0:4 Online + Disk 0:5 Failed
The chance both problematic disks 0:3 and 0:5 failed simultaneously is near to zero. What I mean to say by this is that disks 0:3 and 0:5 will have failed in a specific order. This means that the disk who failed first will have ancient data on it. In order to make an recovery attempt we need actual and not historical data. To this end we first need to identify the disk that failed first. This will be the disk that we will be replacing shortly.
Identifying the order in which the disks failed
Luckily most Dell machines ship with a Dell Remote Access Card (Drac). HP and other vendors have similar solutions. The iDRAC keeps an system log. In this log the iDRAC will report any system events. This also goes for the events triggered by the Perc SAS controler. Enter the iDRAC interface during boot <CTR+?> and review the eventslog. Use the timestamps in this log to identify the first disk that failed. Below an example of the Log output:
In our case, disk 0:5 failed prior to disk 0:3. Be absolutely sure that you identify the correct disk. We want the most current data to be used for a rebuild. If this is for any reason historical data, you will end up with corrupted data. Write the disk number of the disk that failed first on a piece of paper. This is the disk that needs to be replaced with a new one. This could be a stressful situation (for your manager), so be mindful that a stressed manager chasing you gut could confuse you. You do not want to mix the disks up, so keep checking your paper and do not second guess but check if your not sure.
Exit the IDRAC interface and reboot the machine and enter the Perc controller, usually <CTR+?> during boot. Note that the controller also reports the logical volume being offline. If this is the case, enter the Physical Disks (PD) page (CTR+N in our case). Also note here that disks 0:3 and 0:5 are in a failed state. Select disk 0:3 and force this disk online using the operations menu (F2 in our case) and accept the warning. DO NOT SELECT or ALTER THE DISK WITH HISTORICAL DATA (0:5)!!!
Now physically replace disk 0:5 with the spare you have available. If all is well, you should notice that the controller is automatically starting a rebuild (LEDS flashing fanatically). Review your lcd-screen and note that disk 0:5 is now in a rebuilding state. Most controllers let you review the progress. On our controller the progress was hidden on the next page of the disk details in the physical disk (PD) page, which was reachable using the tab key. Wait for the controller to finish. (This can take quite a while, clock the time between % and muliply that with 100 then divide that with 60 to get the idea. Get a coffee or good night sleep).
Once the controller is finished it will in most cases place the replaced disk 0:5 in an OFFLINE state and the forced online disk (0:3) back in FAILED state. Now use the operations menu to force DISK 0:5 (rebuild disk) online and note the logical volume becoming available in a degraded state. Reboot the machine and wait for the OS to boot.
Well the logical volume should be available to the OS. This doesnt mean there is any readable data left on the device. Usually this will become apparent during OS boot. Most operating systems will perform a quick checkdisk during mount. Most errors will be found there. One of two things can happen:
1) Your disk is recovered but unclean and will be cleaned by the OS after which the boot will be successful or…
2) the disk is corrupted beyond the capabilities of a basic scan disk.
In the latter case you might want tot attempt additional repair steps and perform a OS partition recovery. In most cases, if this is your scenario, the chance you will successfully recover the data is very slim.
I hope you, like me, successfully recovered your disk.
(Thanks to the failure imminent detection and precaution functions the Dell Perc controller implement)