VMWare ESXi VMFS Recovery Case Study: RAID-5 Failure

In this data recovery case study, the client had a failed RAID-5 array in their server. The array consisted of four enterprise-grade Seagate Constellation hard drives inside it, providing 12 terabytes of storage space. While the client could still mount their RAID array, the array would become inaccessible almost immediately afterward. This RAID array used the VMFS filesystem to contain the client’s ESXi 5.0 virtual machine datastores. None of the data in these datastores could be retrieved. The client brought the failed RAID-5 array to Gillware Data Recovery for our VMFS recovery services.

VMFS Recovery Case Study: RAID-5 Failure
Drive Model: Seagate Constellation ES.3 SED ST4000NM0043
Total Capacity: 12 TB
File System: VMFS (for VMWare ESXi datastores)
Situation: RAID became inaccessible shortly after mounting
Type of Data Recovered: ESXi 5.0 datastores
Binary Read: 99.9%
Gillware Data Recovery Case Rating: 9

How a RAID-5 Fails

RAID-5 is a type of RAID with one hard drive’s worth of fault tolerance. This means that no matter how many hard drives make up the array, you won’t lose any data if one hard drive fails. RAID-5 manages its fault tolerance by taking advantage of XOR calculations. By creating special XOR parity data whenever you write data to the array, a RAID-5 array can use that data to fill in the blanks if any single hard drive goes missing.

For example, if you have a four-drive RAID-5 array, as this client did, one out of every four blocks of data will hold this parity data, while the other three hold onto the blocks filled with the data you create. If one of the three drives containing your original data goes belly-up, the RAID controller performs parity calculations on the remaining two blocks and the parity block. With these calculations, the controller can reconstruct the data that used to be on the now-absent block. XOR parity is a smart way to pack data redundancy in as small of a space as possible. No matter how many drives you have in a RAID-5 array, you only need one out of every n blocks (where n is the number of drives in the array) to contain the parity data.

The downside of the XOR parity RAID-5 uses is that no matter how many hard drives you put in the array, you can only fill in one blank space. And so RAID-5 arrays fail when two or more hard drives fail. Two hard drives will rarely fail at the same time. In this VMFS data recovery case study, the client lost their RAID-5 array because a second hard drive failed before they could replace the first failed hard drive.

Salvaging Data from a RAID-5 Failure

RAID-5 arrays fail when two or more drives in the array become unusable. A server’s RAID controller constantly monitors the health and performance of every drive inside it. Some controllers have very low tolerances for a hard drive’s failures or lapses in performance, and will take a hard drive offline if it so much as coughs or sniffles.

The first drive in a RAID-5 to fail is considered “stale”, because none of the information on its platters has been altered or updated since the controller took it offline. Depending on when the second drive in a RAID-5 server fails, the data on the first drive can be minutes, days, weeks, or months out of date compared to the rest of the drives.

Upon analyzing the four hard drives in this client’s RAID array for VMFS recovery, our engineers discovered that the stale hard drive was actually healthy. A single slip-up had sent the client’s server into panic mode and convinced it to take this healthy drive down. Nobody had noticed that the RAID array had been operating minus one drive until the second drive failed.

The second drive, unlike the first, had suffered a legitimate hardware failure. After replacing its failed read/write heads in our cleanroom data recovery lab, our engineers recovered 99.9% of the data from its platters. The platters were very slightly damaged. Only around 2,000 sectors out of the trillions of sectors on the four terabyte Seagate Constellation’s platters were unreadable.

With the failed hard drive repaired and imaged by our cleanroom engineers, the RAID-5 array went over to our RAID recovery experts for the next stage of VMFS recovery.

VMFS Recovery

To access this failed RAID array’s VMFS filesystem, our RAID recovery technicians first had to rebuild the array. To do this, we needed the write-blocked images of the two healthy drives and the last drive to fail. While the first failed drive was also healthy, because its data was stale, introducing it into the rebuilt RAID would have been catastrophic.

By analyzing the metadata on each drive, we could piece the three current hard drives together properly. The array’s VMFS filesystem contained the datastores for the client’s ESXi 5.0 virtual machines. In order to make sure that the bad sectors hadn’t affected the client’s critical files, we mounted each virtual machine to explore its contents.

Our technicians found no corruption affecting the vast majority of the client’s critical files. This VMFS recovery case turned out to be a success. We rated the case a 9 on our ten-point case rating scale.

VMFS RAID-5 Failure

How a RAID-5 Fails

Salvaging Data from a RAID-5 Failure

VMFS Recovery

Will Ascenzo