fbpx

Should I RAID Rebuild? When Rebuilding a RAID Is OK, and When It’s Not OK

In the data recovery business, we hear it all the time. 

“The data volume was missing, so I did a RAID rebuild.”

“The partition would not mount, so I did a RAID rebuild.”

“All my data was corrupted, so I did a RAID rebuild.”

These quotes go show how little most people understand about what a RAID rebuild is and what it accomplishes.

The RAID card/controller has no idea what your partitions look like, doesn’t care about your photos, and has no idea if your data is corrupted or not. It has one job: take a bunch of disks and make it one logical volume. It doesn’t look at that volume, inspect that volume, or understand that volume, nor does it want to.

When a RAID has redundancy (like a RAID-5, RAID-6, or RAID-1), the basic philosophy is that the RAID controller will utilize some percentage of each individual disk to supply that redundancy.

There is only one scenario where it makes sense to rebuild a volume:

The logical state of the union is good. The volume mounts, the system boots great, none of the data is corrupted – but one physical member disk is either out of sync or is a new replacement.

If the logical state of the union is bad, a RAID rebuild is incredibly dangerous.

Our data recovery specialists commonly see error conditions where an IT service provider noticed 2 failed drives in a RAID-5. They pick one at random and force it online. Now the volume boots, but lots of the recent data is missing or corrupted. This is because the member they just forced online failed 4 years ago and nobody noticed. So, the operating system can boot because it hasn’t moved or changed substantially in 4 years, but all the data that has been created or updated over that period is missing or corrupted.

Now, the IT service provider doubles down on their mistake and decides to run a RAID rebuild. What a RAID rebuild accomplishes in this scenario is to take the current logical state of the union which is, again, very bad, and then make that bad data redundant by calculating and writing parity data to the other member disk that was offline. That member is the one that failed yesterday and contained the data required to have a healthy volume. The RAID rebuild trashes all that recent data in a quite permanent way.

At the end of the rebuild process, the RAID is redundant, hurray! But the logical state of affairs is horrific, as you now have perhaps 2 good disks with bad parity, a disk that fell offline 4 years ago, and a disk that contains a bunch of rebuilt garbage and trashed XOR. But the RAID controller is quite happy that everything is perfectly synchronous and redundant. Sometimes being redundant isn’t good!

The moral to the story: be incredibly careful about auditing backups before you go screwing around with the RAID controller. Restore those backups to a DIFFERENT set of drives, and make sure your backups are complete and current. Having established that the backups are great and complete, now go ahead and start working to get the physical array back online.

Need Help Now?
We Can Help with RAID Rebuild Problems

Talk to an expert about your RAID errors and getting your data back. Get a no-hassle consultation today!

How To Rebuild a RAID 5 Without Losing Data

Ideally, an IT professional will be notified of the degraded state from monitoring software, like Nagios, or from the RAID controller. As discussed previously, the logical state of the union should be good before attempting a rebuild – that is, the unit may be degraded, but there should be full data access and functionality.

Once again, if data is inaccessible or the volume isn’t online, a rebuild will never help. In fact, it will potentially be catastrophic. All a rebuild does is take the logical state of affairs and make it redundant. If the logical state of affairs is bad, that is something we don’t want redundant.

After confirming that everything from the end user’s perspective is great, the IT professional must pull and then replace the failed drive. This positive logical state will then get back to redundancy by a RAID rebuild process. Some RAID controllers will be configured to automatically rebuild when they detect the new healthy drive. Some will require an operator to run proprietary commands on the RAID card, sometimes through a GUI. The RAID controller will read all the information from the current drives, and run an XOR parity calculation, writing the results to the new healthy drive. At the end of this process, the array is no longer degraded and you’re back to redundancy–back in position to have another drive go down safely.

Can You Recover RAID 5 Data from a Single Drive?

With only one drive of a RAID 5, you will not recover any significant amounts of data. This is because most important files are going to be significantly bigger than the stripe size of the array. A four-megabyte picture will be broken up into hundreds of 64-kilobyte pieces, and in a 4 drive RAID 5 array, each drive will contain 25% of those pieces. We need to analyze and copy each drive so we can determine the overall geometry of the array and to determine which drive set is optimal, but we must have all pieces of the puzzle.

If RAID-5 Is So Great, Then Why Does Gillware See So Many of Them Come into Its Data Recovery Lab?

RAID 5 offers, for a small reduction in overall storage, one disk worth of data redundancy. RAID 5 arrays will need at least 3 drives but will commonly consist of up to 8. The main idea is that if one drive fails, the RAID controller will remove it from the group but there is no immediate impact to the data or the business. In this degraded state, you can still boot, access data, and write new data. It sounds like a well-maintained RAID 5 array should never crash.

But even well-looked-after RAID arrays can need a trip to a data recovery lab – sometimes even RAID recovery software technology can’t cut it. Here’re a few of the reasons why:

1. RAID 5 Is Not a Logical Backup

Even with the redundancy provided by RAID 5, there is no second copy of your data anywhere unless you fastidiously back up the data using a data backup service. If a human makes a mistake and deletes a bunch of data, RAID 5 cannot help you. If a malicious human or process infects the contents of a RAID 5 with a ransomware virus, RAID 5 cannot help you. If you fail to make this distinction between RAID and backup you can get burned by all manner of logical problems and end up calling into a professional RAID data recovery lab.

2. Two Drives Fail in a RAID 5

Multiple drive failure is the most common cause for needing a RAID 5 data recovery. We are commonly called on to recover data from a RAID 5 configuration with 2 failed drives. Sometimes two or more drives will fail simultaneously because of sudden power loss or a power surge. More often one of the drives has failed historically and we regard its data as stale. Depending on how long a drive has been stale it will have limited usefulness in a RAID 5 data recovery effort.

Regardless of how we got here, with only n-2 drives in the volume, the XOR parity cannot be fully calculated, and you are left with 50% or less of the RAID 5 volume’s binary.

The volume’s binary content is striped across the drives and the stripe size is typically measured in kilobytes. So, you don’t have access to 50% of your files, you only have access to 50% of the binary of each file.

Worse, you’ll only have access to 50% of the file definitions. And this is only if you can even get the RAID controller to serve up this partial RAID–which is a terrible idea because file system consistency checkers will run around destroying data that seems to them to be corrupted but is simply incomplete.

Want your RAID data back? Get in touch with Gillware Data Recovery Services today!

Need Help Now?
We Can Help with RAID Rebuild Errors

Talk to an expert about your RAID errors and getting your data back. Get a no-hassle consultation today!

Default image
Gillware Data Recovery
Articles: 15