In Gillware’s latest blog series, “Data Recovery 101”, our bloggers take a closer look at each of the different components of a hard drive and explain how they work, how they fail and how we recover the data from each failure situation. In this post, our CEO Brian Gill explores the drive’s firmware or hard drive operating system.
What is hard drive/SSD firmware?
Firmware is the storage device’s operating system. Just like you may run a Windows operating system on your computer and run an IOS or Android operating system for your phone, a complicated device needing to store and organize billions of bits onto platters or NAND needs an operating system. I personally prefer the term hard drive operating system or HDD O/S instead of firmware. Either is accurate, but the term firmware tends to have a connotation that it’s nothing special or unique. One would expect the firmware for multiple electronic devices coming off a manufacturing line to be identical, but this is not the case in the world of storage.
The firmware on a spinning disk will have all the compiled application code for doing everything the drive needs to do. This baseline firmware will vary slightly from O/S revision to O/S revision. Hard drive manufacturers are always making tweaks to this code for increased performance, security and reliability. Some manufacturers will produce hundreds of versions of their base firmware in a calendar year. For any particular drive-line, like the Western Digital Blue desktop series, they may have ten or twenty revisions a year.
HDD manufacturers will sometimes create custom firmware for different computer companies; Apple likes to have their own firmware as one example. They will code different drive behavior for drives intended for enterprise data centers, consumer desktops, consumer DVR units, etc. A consumer drive like a WD Green will spin down its platters and park the heads during inactivity, as opposed to a WD Enterprise drive that will keeping spinning until a RAID controller tells it to spin down. These behaviors are defined in the firmware.
The firmware zone is also where a lot of the drive’s unique calibrations, defect lists, zone tables, unique translation (addressing) information, performance logs and SMART attributes are stored.
How does firmware fail?
Firmware can become corrupted and require repair. Even though a manufacturer will typically keep at least one backup copy of this special set of data, unrecoverable corruptions can occur.
Ironically, I believe the majority of these corruptions are directly attributed to the very mechanisms that exist to prolong a drive’s lifespan and warn you of imminent failure.
All modern drives implement SMART (Self-Monitoring, Analysis and Reporting Technology). These drives are paying attention to their own behavior and performance, and when it starts deviating outside the norms, log that information in log files and SMART tables.
During a drive’s lifetime sectors go bad. The first time a sector is attempted to be read after it has failed it needs to get put on a list of sectors that we’d like to relocate. The drive can’t relocate it right away as the sector is corrupted as the drive does not know what data used to live there. If that sector happened to be in the middle of a payroll database, and the drive just handed back a bunch of random zeros instead of giving an UNC error, you might pay an employee $1000000 instead of $1000. But, at the next write opportunity, the sector will get remapped to a healthy sector that in a reserve area. It’s not a great situation when you try to load that database and the operating system says it cannot be loaded because of sector errors, but it is better than pretending everything is fine.
This information about which sectors are pending reallocation and have been successfully reallocated (and where) live on the platters in the firmware zone. Also in the firmware zone are the performance logs, events, and subsequently SMART attributes.
How do corruptions occur?
So let’s imagine a scenario where a headstack is in the early stage of failure. It’s taking multiple read attempts to successfully read data, those read events are having unacceptable latency, and lots of sectors need to be added to the growth defect list. The drive needs to use those same heads to save this performance and sector information to the platters! So, one can easily understand how they might write a bunch of gibberish to the firmware zone.
Let’s imagine another scenario where a drive is in the middle of doing a bunch of this sector reallocation and subsequently a bunch of performance bookkeeping in the SMART tables. The end user is experiencing I/O lag on the drive and is getting frustrated. The frustrated user decides to do a therapeutic shutdown and cold reboot. The operating system notifies the drive that it wants to perform a shutdown. The drive replies “gimme a minute I’m in the middle of some bookkeeping” so the O/S blocks the event temporarily and is going to wait until the drive tells it “cool I’m done, go ahead and shut down”. The human is now having their blood pressure raised as even the shutdown is taking 30 seconds! And they perform a hard shutdown or just yank the power cord rather than wait, while the drive was right in the middle of altering its operating system. Once again, it isn’t difficult to comprehend how the HDD O/S can be adversely affected.
How does Gillware recover data from cases with firmware corruption?
When these firmware areas are corrupted it will need to get repaired or the drive cannot boot itself, just like if Windows has corrupted O/S files and cannot boot itself you’ll need to grab an O/S disk and troubleshoot. There are many approaches for performing this analysis and repair. Here at Gillware we build our firmware library every single day and attempt to back up the firmware on every drive that enters our doors as part of our standard process. There are tools you can buy to perform the basic operations of reading/writing firmware, the most popular being the PC-3000 toolkit.
Typically the drive will not correctly detect in the BIOS. Instead of this vintage drive detecting as a 6E040L0 model with 40 GB of capacity, a common firmware corruption will cause that drive to detect with 0 GB capacity and as N40P. A more modern example is the Intel 8MB bug. When the firmware has a corruption, instead of this SSD drive detecting as an Intel 320 series with 160GB of capacity, it will detect as BAD_CTX with an error code and 8MB of capacity.
Ten percent of the cases we see here at Gillware have healthy internals, electronics and the data user area (partition tables, file system meta-data, binary user data) is healthy as well. They show up here because the only problem is they have a firmware bug. Ten percent may not seem like a lot, but there are other uses for firmware manipulation besides repair, and as a company that has spent millions of dollars to increase success rates, I’ll tell you the ability to recover data on an extra ten percent of cases is huge.
The challenge of reverse engineering complicated electronic devices, and applying that knowledge to get people out of jams, is actually fun for a certain type of engineer. It would be a lot more fun if the stakes and anxiety wasn’t so high for all our clients. If you’ve got a computer engineering and programming background, and dedicate about three years of your life to it, you’ll get pretty decent at it. A scientific background and ten thousand hours will make you a master of this very odd specialty. I’d estimate there are less than 300 humans worldwide that have put in these 10,000+ hours. I’ve met about 12 of them, and none of us want to do it very much anymore but it’s very difficult to get away from it entirely.
Knowledge of the storage device’s operating system is of paramount importance to any company serious about data recovery. It is when you use it in conjunction with other skills like electrical or physical rework that the applied knowledge is truly useful.