The client in this case had a Dell EqualLogic PS4100 SAN. This SAN was filled with 24 300-GB enterprise-grade Seagate hard drives. These 24 hard drives were combined in a RAID-50 array with two hot spares. RAID-50, in general, has good fault tolerance. Even when four drives in this array had failed, it still worked. But there was one problem. The drive failures had caused its performance to tank dramatically. The client was able to pull most of their VMWare ESXi virtual machines off of the SAN. But by the time the client reached the last VM, the SAN’s performance had dropped even further. It was still running, but its transfer speed maxed out at a whopping eight kilobytes per second. The client hoped that our Dell EqualLogic PS4100 data recovery experts could retrieve their last VM.
Dell EqualLogic PS4100 Data Recovery Case Study: 24-Drive RAID-50
Drive Model: 300 GB Seagate Savvio 10K.3
Operating System: Windows Server
Situation: After 4 drives failed, RAID performance dramatically decreased and the SAN became unusable
Type of Data Recovered: ESXi virtual machines
Binary Read: 100%
Gillware Data Recovery Case Rating: 10
RAID-50 is a nested RAID array. It combines RAID-5 and RAID-0 together, much like how RAID-10 combines RAID-1 and RAID-0. RAID-50 takes RAID-5’s fairly good fault tolerance and RAID-0’s utter lack thereof and combines them into an array that usually has very good fault tolerance.
We have a bit of an explanation of how RAID-10 works here. In RAID-10, you take several drives and stripe them into a RAID-0 array. Then you mirror the RAID-0 array across an equal number of drives. RAID-10 has excellent fault tolerance—but only if certain drives fail. If two identical drives fail on either side of the mirror, the entire array crashes. Another drawback of RAID-10 is that you only have half of the total capacity of all your drives. We don’t like RAID-10 very much.
RAID-50 is a bit like RAID-10. But it replaces the “1” with a “5”. In a RAID-50, you have several RAID-5 arrays, striped together like individual drives in a RAID-0 array. RAID-5 is a fairly fault-tolerant array. Thanks to the use of parity bits, you can lose one drive in the array and keep going. If you have several RAID-5 arrays, you can lose one drive from each array without any of them failing. By striping these individual RAID-5 arrays together using RAID-0, you end up with a RAID-50.
Your RAID-50 can lose as many drives as there are RAID-5 sub-arrays beneath the RAID-0 level—as long as only one drive from each sub-array fails. If two drives from one sub-array fail, you’re toast. So RAID-50, like RAID-10, can fail if two drives fail. It’s more fault-tolerant than RAID-5 (which absolutely will fail if two drives fail). But it’s roughly equally as fault-tolerant as RAID-10 (which can withstand multiple drive failures, but might fail if two drives fail). In terms of storage capacity, though, RAID-50 beats RAID-10.
In this Dell EqualLogic PS4100 data recovery case, the client’s 24-drive RAID-50 had been divided into four 6-drive RAID-5 arrays, allowing four drives to fail.
We approached this Dell EqualLogic data recovery case differently than a normal SAN or server data recovery case. Unlike in most of our cases, the SAN was still operational, even though it was operating extremely poorly. You can see in the screenshot of the log that the unit is reporting Unrecoverable read media error messages on various drives. We had the client send the original Dell EqualLogic server setup along with their drives, which we normally wouldn’t require. Instead of imaging each drive and manually reconstructing the 24-drive array, our engineers would focus only on the problematic drives.
We connected the Dell EqualLogic SAN to a Windows machine that would act as an administrator and talk the SAN into mounting. We also connected the SAN to one of our Linux machines. This Linux machine would, if all went well, act as an iSCSI initiator. This SAN used iSCSI protocols to divvy up its volume into multiple logical unit numbers, or LUNs. Each LUN was an iSCSI target, which had to be connected to an initiator in order to pull data off of it. First, we tested the array to confirm the client’s low I/O speeds. It hadn’t just been a fluke of their setup: We saw 8 KB/s transfer speeds as well. Once we’d confirmed that and gotten our client’s approval of our price quote, we got to work on the Dell EqualLogic PS4100 data recovery.
Our engineers’ analysis showed that of the four failed drives, the real troublemaker was Disk 16. For some reason, its failure had been entirely responsible for the server’s abysmal performance. Neither of the hot spares had properly engaged when the four drives had failed. When Drive 16 had failed, the rebuild process automatically began (which is what hot spares are for). However, a rebuild I/O failure was preventing the rebuild process from continuing, leaving the array in a state of limbo. It was a bit like stuffing a potato in a car’s exhaust pipe. This had caused the extremely poor I/O speeds from the array.
We made a perfect image of Disk 16 and replaced it in the SAN. Now the SAN was up and running. Our technicians checked the I/O speeds and were pleased to see transfer speeds of over 40 megabytes per second. We used our intelligent, fault tolerant forensic imaging tool HOMBRE to copy the contents of the client’s SAN. Within a few hours, we had transferred the client’s last virtual machine off of the SAN.
The next step in this Dell EqualLogic PS4100 data recovery process was to make sure the ESXi virtual machine, and its contents, were safe and sound. Once the disk image was completed, we set it up as an NFS share and mounted the client’s critical VMDK file on our ESXi server. Our ESXi data recovery technicians explored the virtual machine’s contents and found everything to be in working order. There were no signs of corruption amid the client’s most critical files, which all appeared to work perfectly. We sent the client a list of the contents of their ESXi virtual machine so they could verify that all of their critical data had been salvaged.
After the client paid the bill for our data recovery efforts, we extracted their virtual machine to a password-protected external hard drive and sent it to them, along with their original drives and server equipment. The case turned out to be a great success. We rated this Dell EqualLogic PS4100 data recovery case study a 10 on our ten-point case rating scale.