Most RAID devices were designed and built for data reliability. They work so well that many businesses and organizations rely on these storage devices to maintain the data for an entire network. So, when they fail, it really hurts.
Consider being the IT manager of a school district that is using a Seagate 8TB BlackArmor NAS 440 in RAID-5 to protect the network’s data. Classes have wrapped up, summer is in the air, and a whole year’s worth of memories are just about ready to go to the printer and become the school’s yearbook.
But all of a sudden, the data is gone. It’s a stomach-turning, cold-sweat experience. Joshua Trisler, lead technician for Augusta Public Schools, tells it best in his own words:
Augusta, Kansas is a small, suburban community outside of Wichita. It is the classic Midwestern small town. Many of the residents look back with fond memories of growing up here. Something happened, though, to threaten the memories of the students in our public school district at the end of the 2010-2011 school year. Let me explain.
Our yearbook class, consisting of about 15 students, had worked for a sum of hundreds of hours over the past 9 months on the quintessential memory-recorder. They had sweated over page layout designs and wavered over choosing just the right pictures to represent the Augusta High School. Classes had just been let out for the summer and the yearbook teacher was within one day of being finished with the publication and sending it off to the printing company. It would then be ready for distribution to students by the start of the next school year.
Our district had purchased and maintained a top-of-the-line backup system – the 8 Terabyte Seagate Black Armor NAS (Network-Attached Storage). This device would provide plenty of storage for the yearbook team. A week earlier our High School had been hit with a bad power outage. Everything came back up and was running fine – or so we thought. I entered our server room to check things over one morning and discovered a warning light on the front panel of our backup device. After checking things out, I discovered that one of the four drives in the RAID 5 array had failed causing the array to become degraded. Well, not a big problem. That’s why we chose a RAID 5 configuration – you can lose one hard drive and still keep running. I restarted the device – maybe it had a bad case of the hiccups. When it came back online the web interface said it was recovering itself. Great! The restart must have brought the hard drive back up and now it was repairing itself automatically. I like it when things are easy. I left and returned a few hours later to give the device time to work.
When I returned, warning lights started flashing in my head. The device had stopped the recovery process and the array status no longer said “Degraded”, now it said “Failed”. Do you know that feeling where time slows to a crawl and your peripheral vision gets fuzzy? That’s what was happening to me. I tried to access the files – nothing. I tried again – nothing. I was not looking forward to telling the yearbook teacher about this.
As I expected, the teacher took the news pretty hard. She was choking back tears but I tried to reassure her that all hope was not lost. A few months earlier we had used the services of Gillware Inc. to recover the data from a server using a RAID 1 array and they had successfully recovered every file. Their knowledgeable and friendly staff has always been so helpful and taken the time to keep the lines of communication open.
With the successful recovery of our yearbook data, we will be able to go to the printing company with enough time to get the yearbook into the hands of waiting students. Their memories are safe.
The heartbreak of data loss is what we confront every day, and it is always a great feeling to reunite our clients with what they’ve lost, whether it’s family photos, key business records or a manuscript. In this case, the stakes mattered not only to our client, but to a high school of students.
The case was similar to other RAID-5 recoveries. As Joshua noted, one drive of a RAID-5 array can die and the system will limp along and run in a degraded mode. It’s when the second drive has problems that the whole system turns into a brick.
We found that one drive had serious issues. Its read/write heads needed repairs in a cleanroom. These tiny conductive loops that read and write binary data on the magnetic substrate of the platters are amazing feats of engineering. They are suspended above the hard drive’s platters by a tiny cushion of air – technically an air bearing – which can be as thin as 3 nanometers. That’s only about eight times larger than the diameter of an oxygen molecule. Another drive had some degradation of the magnetic material that holds data. The remaining two drives looked bad to the RAID controller but could be read by specialized equipment.
After replacing the damaged read/write heads and calibrating them to read the platters, we created full binary copies of all the drives in the system. We looked at each drive independently with a binary hex editor, which shows where the 1s and 0s lie, to determine how the data was being divided or striped among the drives and in what order. Each RAID controller is different, and it’s a logic puzzle to determine how the data was being handled and what the file structure was before the system failed.
It was crucial at this point to determine which drive failed first. RAID systems use a clever calculation to store their redundant data. It’s called an “exclusively or” binary operator. You might intuitively expect that if you had four disks full of data that you’d need another four disks to have a redundant copy. But the “exclusively or” binary operator is a clever way to allow four disks to have their data redundantly stored on one disk. This efficiency of storage space, though, comes at a cost: complexity.
In a system in which four disks were backed up by four disks, if you lost one, you could just look at its copy. In a RAID 5 system, when you lose one disk, you have to run logical calculations comparing the remaining discs to determine the binary content of the failed drive. And these calculations give you old data or worse if you have the wrong diagnosis about what drive failed first.
Only after all this analysis, and a correct diagnosis on the drive failure order, could we begin to write the code that would rebuild this Linux based file system. We then tested our hypothesis by checking the integrity of a large recent file and proceeded to reassemble all the pieces in the puzzle into one contiguous physical volume.
From the delicate repairs of a clean room engineer to the logical analysis of a code writer, recovering data is always a team effort. And though we deal with thousands of cases a year, the thrill of recovering what once seemed lost is as strong as ever.
Soon those yearbooks will be ready to be passed around and signed. Here’s our kickoff: