HP ProLiant Data Recovery Case Study: Yellow Means Caution
Server Model: HP Proilant DL360 G5 Server
RAID Level: RAID-5
Drive Model: Seagate Savvio 10K.5 9TE066-150
Total Capacity: 600 GB
Operating System: Windows Server 2008
Situation: Yellow status lights for all three drives; SQL server inaccessible
Type of Data Recovered: SQL database and Clarity profile
Binary Read: 99.99%
Gillware Data Recovery Case Rating: 9+
In this HP ProLiant data recovery case, the client came to us after a server crash. The SQL server with the client’s database and Clarity profile had become inaccessible, and as their IT department soon found, all three of the hard drives in their RAID-5 array seemed to have failed. The client needed their data recovered and their server up and running again. Fortunately, they chose just the right data recovery company to solve their problem. Gillware’s RAID server data recovery experts were on the case.
On the road, green means “go”, red means “stop”, and yellow means “caution”. Server designers use a similar language of lights to communicate the health of your server’s hard drives with you. Many models of server, such as the HP ProLiant DL360, have two LED lights in each hard drive bay chassis—a green LED to indicate whether the drive is online and a yellow and blue LED light. These two lights shine and blink in tandem to indicate the online and fault status of each hard drive.
If the amber light is steady or regularly flashing, it usually indicates that the drive in question has failed, or is about to fail (depending on the behavior of the green light). In this HP ProLiant data recovery case, all three of the client’s drives had a steady amber light with no greens. This meant that the RAID controller had detected a critical fault in all three drives and had taken all of them offline.
Servers typically connect hard drives together using fault-tolerant methods, such as RAID-5. A RAID-5 array has one drive’s worth of fault tolerance, meaning one hard drive can fall offline without jeopardizing the client’s data. Servers are meant to run continuously, so they constantly monitor the health of the hard drives in the array. If the RAID controller senses that one drive is behaving oddly or about to fail, it will kick the drive offline and let the RAID array’s fault tolerance on the remaining two drives fill in for it. Because the drive is offline for months, the data on it becomes out-of-date and “stale”.
Some RAID controllers are very sensitive, and because hard drives in a server need to run continuously, the controller will kick off a hard drive if it feels the drive might start failing intermittently. This means the server might kick off an otherwise healthy drive because it hiccups. Since a RAID-5 array can only handle one drive loss, if the controller does this twice, the server crashes. Unfortunately for our client, this happened to not one but all three drives in the array. That had not been a good day for our client.
Our data recovery engineers inspected the hard drives and found that of the three, two drives were mostly healthy. The RAID controller had seen the drives sneeze and, in its infinite wisdom, decided to take them offline. The remaining drive had suffered a minor failure of its read/write heads. By carefully and skillfully using our fault-tolerant data recovery tools, our engineers could create a near-perfect disk image of the failed hard drive.
After imaging the failed hard drives, one of our RAID data recovery experts could piece together the RAID array. With custom RAID controller emulation tools, our RAID technicians can reverse-engineer the way the drives in the array are arranged and piece them together properly by examining the metadata on each drive. Piecing the data together properly produces valid data. Piecing it together improperly leaves you with Picasso’s Guernica. The metadata told our RAID engineer Cody that the data on the drives was cut into 64-kilobyte blocks and striped together in a particular “left-synchronous” pattern.
After arranging the drives properly, our RAID recovery technician Cody could examine the data for any signs of corruption. Of the three hard drives in the client’s ProLiant server, two had been fully imaged. But one had a handful of bad sectors preventing a complete recovery. There was no way to tell which parts of the client’s data the bad sectors had impacted. For a SQL server data recovery procedure, testing the data was absolutely crucial, even with only a few bad sectors. Even a little corruption in just the wrong location can compromise a SQL database.
Our examinations of the recovered data showed that there was very little file corruption as a result of the damage to the client’s RAID-5 array. The client’s most important data, their SQL database, functioned perfectly. We rated this HP ProLiant data recovery case a high 9 on our ten-point case rating scale.