VMWare ESXi Data Recovery Case Study: 4-Drive RAID-5 Array

In this case, our client had four Western Digital WD6000BKHG-18A29V0 hard drives arranged in a RAID-5 array. On that array, they had stored several virtual machines. Data recovery from a virtual hard disk or virtual machine can be considered a data recovery case within a data recovery case. This case was a RAID-5 VMWare ESXi data recovery turducken.

The client’s RAID-5 array had been set up with four hard drives. After one drive failed, it was replaced by another hard drive. The RAID controller successfully integrated the new drive into the array. But soon after the drive was replaced, the new drive failed.

This RAID-5 array’s owner was unfortunate enough that another of the 600 gigabyte hard drives failed almost immediately after that. Regardless of how many hard drives you have in a RAID-5 array, it can only handle a single drive failure. Any subsequent failure will cause the entire array to stop functioning.

The owner of this RAID-5 array first had it brought to a local computer repair shop. Upon receiving it, the computer repair technician had first sent the array to one of our data recovery competitors. Our competitor had charged several thousand dollars, recovered a portion of the user’s data, then demanded an even larger sum of money on top of that to recover the rest of the data. This was unacceptable to the end user.

VMWare ESXi Data Recovery Case Study: Multiple Drive Failures in 4-Drive Western Digital WD6000BKHG RAID Array
RAID Level: 5
Total RAID Capacity: 1.8 TB
Operating System: Windows
Situation: Two hard drives in RAID array failed simultaneously
Data Recovered: Contents of VMWare ESXi VMDK files
Binary Read: 99.9%
Gillware Data Recovery Case Rating: 9

RAID-5: A Brief Overview

A RAID array is a system of interconnected hard drives which behave as a single storage space. There are many different ways to connect the drives in a RAID array. Some levels provide increased storage capacity and read/write efficiency. Some provide a limited amount of fault tolerance. Some provide a little of both. One such level of RAID that offers both increased storage capacity and performance is RAID-5.

RAID-5 is a rather popular type of RAID array. It’s one our RAID data recovery engineers see the most frequently in our lab. Like other forms of RAID arrays that offer increased storage capacity, it breaks the data written to it into “stripes” and spreads these stripes across the disks in the array. The defining feature of RAID-5, though, is that it also creates stripes of “parity” data.

This parity data, in total, uses up the capacity equivalent to one drive in the array. The data is spread across all of the drives in the array. It is not a 1:1 backup of all of the data written to the array (which could not be fit onto a single drive). Rather, it is a special sort of data that the RAID controller can use to recreate any missing data by way of XOR logic.

Because of these parity stripes, a RAID-5 array can keep functioning if one of the hard drives in the array fails. To briefly illustrate how parity works, imagine a simple, hypothetical three-drive RAID-5 array:

RAID-5 Parity Chart

Each horizontal row in this chart makes up a single rotation through all of the disks in the array. At any point in this rotation, one of the drives in the array acts as the parity drive. Which drive contains the parity data changes during each rotation.

If Drive 0 fails, you lose stripes 1 and 3, but the parity stripes on Drive 1 and Drive 2 reconstruct the data that had been on Drive 0. Likewise, if Drive 1 fails, you lose stripes 2 and 5, but you have the parity data on Drives 0 and 2 to fill in the blanks. If Drive 0 and Drive 2 both fail, though, there isn’t enough parity data on Drive 1 to recreate the missing data on both of the failed drives.

The first step in any RAID data recovery case is for our cleanroom engineers to assess the conditions of the failed hard drives. We make repairs as necessary and try to image the failed drives as close to 100% as possible. In this case, our engineers were able to get a 99.9% binary read on the failed drives. The next step is for our logical RAID data recovery engineers to piece together the array in the proper order. When the RAID puzzle has been solved, the data contained within the array can be recovered just as in any other data recovery case.

But this data recovery case wasn’t going to be like any other. The most critical data the client was looking to recover from the failed RAID-5 array was a collection of virtual machines.

Virtualization: A Brief Overview

Virtualization is a method of making an “image” of an entire data storage device and containing it on a single file. For example, a physical CD can be made into an ISO file and used on a computer in lieu of the physical disk.

Entire hard drives can be virtualized and contained in hard disk drive image files. These virtual hard drives can behave exactly like a real physical computer, complete with an operating system. Virtual hard drives that behave like actual computers are known as “virtual machines”. They have the VMDK file type.

Unlike simpler virtual hard disks, virtual machines must be created and run on the “host” machine by a hypervisor. In this data recovery case, the user of the RAID-5 array had used the ESXi hardware hypervisor. The ESXi hypervisor created and ran the virtual machines, which were stored on the RAID array.

One of the useful features of virtual machines is the ability to take a “snapshot” of the virtual machine at any point in time and keep it on hand. In case of any risky operations going poorly, the user can go back to one of their snapshots of the virtual machine and have everything in working order again relatively quickly.

When a snapshot of a virtual machine is created, it creates a “delta” file. Any changes to the virtual machine are stored in the delta file. The base VMDK file for the machine is unchanged. If any changes to the machine negatively impact its performance, the delta file can simply be deleted. The virtual machine will immediately return to its pristine state.

Virtual machines are extremely useful tools for IT departments, but are also fairly complex. VMware ESXi data recovery can be a very complicated procedure. After the physical device containing the virtual environment has been worked on, our engineers must take the virtual hard drive and work on it separately.

The RAID-5 VMWare ESXi Data Recovery Process

Greg Andrzejewski, our Director of Research and Development, handled both the task of reconstructing the RAID-5 array as well as the VMWare ESXi data recovery process. Figuring out the RAID array’s particular geometry is quite a puzzle. Our RAID data recovery engineers must comb through the metadata on each drive in order to develop an idea of where its parity data lives and in which order its stripes are arranged.

Virtual machine snapshots
A snapshot of some of the user’s virtual machines, and their snapshots. Greg’s VMWare ESXi data recovery process involved taking these snapshots and flattening them.

In this case, Greg was able to use the metadata from all five of the drives that had been part of the RAID array, including the first drive that had failed and been replaced, to rebuild the array with the most optimal results. This is why it is important for clients with failed RAID arrays to send us every drive that has ever been connected to the array for data recovery, and not just the ones that were connected to it at the time of its failure.

After putting the RAID-5 array back together, the next step for Greg was to recover the data from the user’s virtual machines. This user had created three virtual hard disks. One was a boot drive, containing the operating system.

The other two had been combined into a single partition using the Windows Logical Device Manager. Windows Logical Device Manager sticks two or more hard drive partitions together end-to-end. This creates a single extended partition, which can be expanded as needed by adding more hard drives. Or, in this case, more virtual hard disks. These latter two virtual hard disk drives contained the important data the client needed recovered.

To recover the data from these virtual machines, Greg had to flatten the delta files into the user’s VMDK disk images. Then  he had to put the images onto physical hard disk drives. But first, Greg took the status maps from the RAID array he’d reconstructed and cross-referenced them with the VMDK files. This helped him determine exactly which parts of the virtual machines were unread due to the damage to the failed hard drives in the array. This was so that Greg could be absolutely certain he was giving our client the best possible results for this VMWare ESXi data recovery case.

Afterward, Greg was able to go through the virtual machines using our data recovery software HOMBRE, just like any other normal data recovery case. But the road to reach that point was a long and winding one.

VMWare ESXi Data Recovery Results
Some of the contents of the user’s virtual machines after Greg’s VMWare ESXi data recovery efforts, as they appear in HOMBRE.

After all of Greg’s hard work, we were able to get this RAID-5 array rebuilt and recovered the vast majority of the data from the virtual machines contained therein. Both were complicated procedures in their own right. This was all done for far less than our competitor was demanding. We returned the recovered data to the RAID array’s owner in a timely fashion and rated this VMWare ESXi data recovery case a 9 on our ten-point scale.

Here at Gillware, we have incredibly skilled data recovery engineers capable of recovering data from both failed RAID arrays and virtual environments. If you are experiencing a failure of your RAID array, virtual machine, or both, you can trust engineers like Greg to do everything in their power to reunite you with your critical files.

Will Ascenzo
Will Ascenzo

Will is the lead blogger, copywriter, and copy editor for Gillware Data Recovery and Digital Forensics, and a staunch advocate against the abuse of innocent semicolons.

Articles: 213