In a separate article, I shared some observations about how to set up a RAID 5 to make failure less likely. These observations were based on thousands of successful data recovery cases. But if your RAID has already failed, here are 12 RAID 5 failure tips to help you out.
When a RAID device is inaccessible, it is common for IT professionals to feel somewhat responsible. Sometimes a client is screaming at the top of their lungs at them. Their whole business has ground to a halt without this array. And they are losing thousands of dollars for every hour of downtime. Worst case scenarios – such as losing a client or facing litigation – can creep into the IT professional’s psyche. The urge to get the RAID up and running quickly can be overwhelming. It’s important to try to relax and to avoid actions without fully understanding their consequences.
The RAID card, assuming it isn’t smoked, likely knows a lot more about the situation than you do. If you try to initiate a process and the card says it can cause catastrophic data loss, believe it and don’t do it.
All RAID card manufacturers produce high quality manuals. They explain the inner workings of the card. They give useful tips on configurations and troubleshooting. They give insight into the sometimes archaic and non-descriptive error messages and warnings. And they typically take 30-60 minutes to read. If you didn’t read the manual when you installed it, read the thing! If someone threw it in the trash, you can probably find it in ten seconds using your favorite search engine.
While outside the scope of this document, our guide to RAID 5 data recovery demonstrates how RAID 5 works how your data is stored on on a RAID 5 array.
A rebuild doesn’t repair anything in the file system or make any data accessible that previous wasn’t. Any data that’s missing will not magically appear after a rebuild. It doesn’t fix any corrupt files or partitions. It won’t make your server boot if it wasn’t booting in the first place. A good rule of thumb is to never initiate a rebuild unless all your data is currently accessible and 100% functional. Another good rule of thumb is that unless you’ve physically replaced a drive there probably isn’t much point in doing a rebuild in the first place. Presumably, the RAID card took a drive offline because it was troubled. If you force it online and make it the rebuild target, there’s a decent chance the RAID card is going to kick it offline again soon anyways.
Unless the array is accessible and all of the important, recently updated data is proven valid, never run any RAID rebuilds. A RAID 5 rebuild will simply take the current state of affairs on a degraded array and restore redundancy. It does this by doing XOR calculations on the degraded set. And then it writes those calculated values onto the new, healthy drive. If the current state of the union is the array is not mounting, a rebuild may actually render this state of the union permanent. While the array will no longer be degraded, the newly redundant array will be full of corrupted garbage.
I can’t tell you how many times we’ve had clients notice two hard drives in a RAID 5 failed and simply replaced both drives (annihilating the previous volume) because they knew they had a solid recent backup. After the annihilation they restore hundreds of GB of data from the backup onto the new array. Then they realize that the backup was corrupted, incomplete, or many months old. This scenario is easily avoided by testing your backup on a storage array that has nothing to do with the hard drives inside the original failed array. Don’t make a rush decision to restore to the only available working drives. Instead, explain to the client your game plan is to source a new array, test all the backups, and then deal with the dead array.
While it may seem like common sense to many, I’ve seen many scenarios where we call a client mid-recovery effort asking them where the other drives are. They inform us that the drive was dead, not even detecting in the controller, so they sent it back to the manufacturer for their warranty replacement. We shouldn’t need it, because it’s a RAID 5 and we only need n-1 drives. Then we let them know that one of the drives they sent to us actually was taken offline by the array many months ago and the drive they returned had died most recently because their array has been running degraded for months. The process of retrieving a drive that has been returned to a manufacturer is horrible and usually fruitless.
When a RAID device has failed, a common response from the manufacturer is to send a replacement RAID card, often at a large expense to the consumer. But if the drives are detected, the RAID card is probably OK. In other words, if it’s telling you stuff, it’s probably fine. If you can’t get the drives to detect, it’s possible the RAID card or the motherboard has issues.
We’ve seen many scenarios where an IT professional has yanked a hot-spare to use in a new storage array, fully confident that it never engaged and is blank. Again, verify your backups are current and consistent on another volume completely unrelated to the failed array before utilizing any of the failed array’s drives, including hot-spares.
Until a backup is verified, I’d say to never force an offline drive online. The array likely took it offline for a reason: It was failing! Unless you know exactly when it was removed, and know for a fact that zero critical files were updated after that fact, it’s just a bad idea. If a drive failed many days or months ago, all data of relevant size will be “corrupted” since the “stale epoch.” The newly updated data won’t actually be “corrupted”: a more appropriate term would be “incomplete”.
Say you have a 3 drive array and the stripe size is 64kb. Now, you force a drive that failed months ago online. Any file bigger than 192kb is virtually guaranteed to have stripes of its binary run list residing across all three drives. Any file bigger than 192kb that has been created or updated subsequent to the initial drive failure is guaranteed to be full of “holes” and essentially useless. There would be a 1/3 chance that the actual file definitions of any file created or updated since the failure would be corrupted or missing.
Often in these situations the operating system will notice these inconsistencies in the file system. It will run a “helpful” check-disk subroutine to “repair” these problems. These were not corruptions to be fixed, these were inconsistencies due to plugging a stale drive into the array. These “repairs” will permanently destroy valuable current data across all member drives, not just the “stale” one.
If you are not completely certain, the odds of you guessing parity, rotation, stripe or offset configurations correctly are tiny. Guessing incorrectly can be catastrophic. The operating system may notice array or file system “corruption” and start running “repairs” which will be catastrophic. The file system indeed is corrupted from the operating systems point of view. The problem being it only appears corrupted because you have the wrong configuration. After these “repairs” are complete, it will be too late to salvage any of these file definitions that were “repaired.”
It is alarming the amount of folks we talk to who have removed all the individual members of an array and plugged them into USB chassis to run data recovery software to try and recover data. Not only is this a waste of time, but it could be highly destructive as well. The operating system has no concept that it is looking at a portion of a RAID. It may automatically “fix corruptions” in the partition table, indexes, or master file table. There’s a high probability the drive will show up as unallocated or available space. Some misinformed IT staff may actually “initialize” the independent drive with a new volume in order to “access its data.”
These drives weren’t corrupted in the first place, so “fixing” the “corruptions” will typically lead to massive data loss. Running off-the-shelf data recovery software on 1/3 of a 3 drive RAID 5 will yield 1/3 of the file definitions. None of the run-list entries will be correct (file definitions only make sense in the context of the full partition). And the only data yielded will be extremely tiny file definitions where the data was “resident” to the file-definition itself (tiny ini files or log files).
Don’t panic. Approach these situations with full knowledge of how RAID 5 works. If the RAID configuration utility warns you that you are about to destroy all the data with a particularly action, don’t do it. You should read and understand the RAID manufacturer’s manual before doing anything. You should only rebuild to a newly added drive if the volume is currently peachy but running degraded. Don’t re-use any of the drives in the failed volume until verifying your backups on a different set of hardware. If more than one drive is offline, remove the drives from the array and contact a data recovery professional to assist you.