RAID (Redundant Array of Independent Disks) technology is used extensively in business IT systems, and it is fair to say that the overwhelming majority of business servers use RAID technology for data integrity and system performance. RAID is found in Network Attached Storage (NAS) devices and Storage Area Networks (SAN), and the use of RAID has also grown rapidly in the home consumer market.
According to this Dell Support Article, “when one or more drives in a RAID array contain data errors, and another drive in the array is no longer an active member of the array due to a drive failure, foreign configuration, drive removal, or any other reason, this creates a condition known as a Double Fault.” It is the double fault that triggers the puncture feature of the Dell PERC RAID platforms.
Is RAID Still Used?
There are many types of RAID configurations available. Some of the most popular are RAID 0, 1, 5, 6, and 10. RAID0 is sometimes called disk striping, which requires two or more hard drives and is used for high-speed performance. The data blocks are split up between each disk, allowing for amazing access times (low latency) and fantastic disk performance, but there is no parity protection in the event of a data error.
RAID1, sometimes called mirroring, is built for data protection, and the data blocks have a 1:1 copy on a neighboring hard drive. Essentially there are two copies of the data, one per hard drive. This allows for an entire disk to fail without losing any data!
Business and enterprise users typically focus on RAID5, RAID6, RAID10, and RAID50. RAID6 gives good performance and allows for two disk failures, and RAID5 is almost identical, but can only accommodate one hard drive failure.
RAID10 offers fantastic performance and can also allow for at least two disk failures, and RAID50 gives the best performance while also offering good data protection. Unfortunately, RAID10 and RAID50 are expensive as they require a large pool of hard disks. There are many other RAID configurations out there, but these are by far the most popular.
RAID products are a reliable solution and proven technology to manage multiple hard drives. However, RAID arrays are not immune to data errors and on occasion, despite the protective measures built-in, a RAID array can fail. When a RAID array does break, it is usually a catastrophic failure and a system administrator’s nightmare. Data loss is almost inevitable, and it is highly advisable to have a robust backup strategy in place. RAID does not substitute a backup, instead, it should complement an existing backup strategy.
Gillware’s data recovery specialists have decades of experience in reuniting users with their data. It’s what we do! Many of our skilled data recovery experts started when Gillware was founded in 2004, and they have already put in tens of thousands of hours of hard work into our data recovery services; they truly are experts in the field, and we are proud to have them on board at Gillware.
We specialize in complex RAID failures, and our R&D teams have invested several years into creating recovery techniques for multiple RAID solutions. If you are experiencing problems with your RAID configuration, we highly recommend you stop using the system to prevent further damage or faults from occurring; this includes refraining from the use of data recovery software.
Contact Gillware about our RAID Recovery Services. Our expert RAID recovery engineers are available for data recovery from RAID arrays, even from the most severe punctured RAID scenarios. We are standing by ready to help you with a RAID recovery.
Need Help Now?
We Can Help with RAID Puncture Errors
Talk to an expert about your RAID puncture errors and getting your data back. Get a no-hassle consultation today!
What Is a RAID Array Puncture?
A punctured RAID is a data error protection feature only found in Dell storage controllers, specifically in Dell PowerEdge RAID controllers (PERC). PERC is an enterprise-class server family designed for intensive business workloads. PERC controllers are found within regular Dell servers with locally attached storage, they are also found in external storage area networks (SAN).
The RAID controller, typically a hardware device, is used to manage the hard disk drives (HDD) or solid-state drives (SSD) found in a server or storage system. It allows the logical presentation of disks to the operating system. A multiple disk array can be used to create a logical volume which in turn can be presented to the server as a single hard drive.
The RAID controller firmware manages the hard drive configurations using RAID technology, and system administrators use the PERC controller to control the type of RAID used, the redundancy, performance, and also for running maintenance tasks, such as replacing the hard drive after a failure.
A punctured RAID is a manufactured feature of the PERC controller designed to allow the controller to restore the redundancy of an array despite the loss of data. The event is typically caused by a double fault condition. What this means in English is that the RAID controller firmware can rebuild a double fault RAID array issue, even if the RAID condition should not allow it and most other manufactures would fail the array or place it in a degraded state. This in turn can result in a data error, a major drop in performance, or significant data loss.
How Do You Know If RAID Arrays Are Failing?
A punctured RAID, although rare, is caused when bad blocks are copied to a multiple disk RAID configuration. If a server has a failed disk (amber light), an engineer would normally contact Dell support to replace the disk, and the engineer will pull the disk from the server/storage and replace it with the new disk. The controller will rebuild the disk using the RAID parity pieces from another disk.
However, if one or more of the parity pieces has a block error, sometimes called a predictive failure, it can result in the newly replaced disks having bad blocks copied to them during the RAID array rebuild process – essentially destroying the array.
This might seem like an unlikely scenario, but it can happen to anyone, and it does happen. It usually occurs when a system administrator has not taken the time to read the PERC controller log, or if an automated monitoring tool hasn’t picked up the faults.
The controller logs are the first place to start when troubleshooting puncture faults and problems. Each issue is given a unique HEX code by Dell. Modern storage systems will dial home automatically to Dell to report a fault, but regular drive checks should still be completed to ‘catch-all’ problems.
The error log will likely contain these errors during the rebuild process:
Error: Bad Block - Priority Medium: EV_REC Medium Error - Patrol read found and uncorrectable medium error - Disk1 Error: Predictive Failure - Priority: Predictive : Predictive Failure Disk2
In reality, what is happening in this situation is that you have two disks with bad blocks, and to stop another disk from having the bad blocks copied to disk, the controller intercepts the fault and the Dell Puncture feature kicks in to stop any further failure from happening.
The PERC controller detects and predicts the disk errors in the Logical Block Addresses (LBA) of the array.
Error: Puncturing bad block on disk1 at address X Error: Puncturing bad block on disk2 at address X
The punctured RAID enables the storage system/server to keep running and serving data, but it is more of a fail-safe than a fix. You may also see error correction events in the controller logs or get warnings about the drive, something similar to ‘replacement drive read error’. It is difficult to advise the exact read error fault from the controller logs as they vary between the make, model, and firmware revisions of the controller.
The puncture can prevent total data loss, but you will continue to get a fault warning in the controller logs. Eventually, the punctured RAID array will need to be destroyed and rebuilt to completely fix the original fault that triggered the original puncture.
The major benefit of the puncture issue is that the rebuild process can be done in a controlled manner while at the same time protecting from data loss. Storage experts will need to transfer data between the punctured array and a new LUN. The best thing to do is make an immediate backup, dump the disks/array, rebuild, and restore from the backup (if possible). This is a relatively simple process, and only takes a matter of minutes on modern SAN devices.
RAID Array Data Errors
The puncture data error will take place in one of three array locations.
You may get lucky and hit blank space on the array stripe, something that contains no data – as there is no data where the puncture is, there is no significant impact. If the operating system attempts to write to the punctured location, the write will fail, and the server will simply write to another location.
The puncture may hit a stripe containing an insignificant file, such as a text file or unused temporary file. If the file remains dormant and never accessed, no errors will be written to the log. If a full system backup is performed it will likely fail due to the punctured file.
If you are unlucky, the puncture may hit data that is accessed regularly, such as system files, page files, or databases. The severity can vary, but these punctures are the most likely to cause an outage.
It does not matter which type of punctured array you are experiencing, the only way to permanently fix the issue is to destroy and recreate the array.
Gillware has been partnered with Dell for many years, and we have a great relationship with them. A punctured array is a feature exclusive to Dell products, and the feature has saved many of our customers from losing critical business information.
Unfortunately, some customers do not have a maintenance contract for Dell products, so once the manufacturer’s warranty has lapsed, the customer is left with the bill. If a RAID fault is not identified by the user, it can result in a production outage of critical data.
If you are in this difficult position, please don’t panic; Gillware is standing by to assist.
How Does Gillware Recover RAID Data?
The customer needs to get in touch with the Gillware customer team. There are several ways to do this; you can call us at 877-624-7206, or email [email protected], or you can log a support request via our website.
As RAID data errors usually require a complex fix, you will most likely be called by our data recovery experts to discuss your case. Our experts may have some additional questions about the failed RAID, such as model and make, a description of the fault, and if you are able to, we will ask you to upload the controller logs.
Our team will next provide a shipping label to mail the RAID device for free. Alternatively, if you are in the Madison area, or if you are close to our newly opened Detroit Offices, you can arrange to drop off the device in person.
Upon arrival, the device is cataloged and our experts will perform an initial assessment of the issue. During these initial diagnostic checks, Gillware will attempt to confirm the root cause of the issue. The RAID recovery experts at Gillware have extensive knowledge of all of the popular models of SAN, NAS, and server hardware from all of the major manufacturers, including brands such as Synology, Dell, HP, IBM, XenServer, SnapServer, Buffalo, Drobo, and FreeNAS, to name but a few.
An Example of How Bad Block Data Errors Can Destroy an Array
For this example, consider a typical server with some locally attached storage running a Windows Server 2019 Operating System with a few core applications installed. The local disks are configured using RAID, and the data is striped with parity to create redundancy. It might be a RAID1, RAID5, or RAID10 configuration.
The user is experiencing an “Operating System Not Found” error or may see some other specific errors relating to a damaged data array. The user received the Dell Puncture warning a few weeks ago, but no action was taken. Eventually, the RAID configuration failed, result in zero bytes of storage available.
This type of issue manifests itself as a data error on the RAID configuration, but because no action was taken, the data error blocks were copied around the RAID drives, resulting in the “Error: Puncturing bad block on diskX at addressX”. If no further action is taken, the array will fail and data loss will occur in a matter of days, weeks, or months.
This scenario may seem unlikely to occur, but it absolutely can and does happen. Storage systems and servers will happily operate 24/7 in a data center, but if alerting or SNMP traps are incorrectly configured, the array might be triggering regular errors, and unfortunately, these are not picked up, resulting in a failed system over a period of time. This problem is compounded if regular data center checks are not completed by the technical teams.
How Gillware Data Recovery Experts Fix This Issue
Gillware can reverse-engineer the array configuration and write a custom emulated RAID controller from the metadata recovered from the original hard drives. Emulation creates a software (virtual) controller that can attach the data blocks together into a readable format. Our team analyzes what files are recoverable and we make changes to the configuration if bits of data are still missing.
Our in-house proprietary data recovery software, HOMBRE, will mount the virtual controller and present the metadata in a readable format to an alternative Gillware server, which will enable our engineers to copy the data from the damaged disk array. The data recovery software is used over an ultra-fast network connection to speed up the process and reduce the risk of failure.
Once we have made the recovered files safe, we can run further checks to determine the root cause of the failure. We have vast reserves of spare parts and we can swap components that appear to be failed to help pinpoint exactly what has happened. In most circumstances, we recommend that our RAID recovery clients replace the original hardware. We then arrange to securely transport the recovered files back to the customer.
How to Mitigate a RAID Puncture and Data Errors
In the grand scheme of things, data errors only happen in a very small percentage of storage systems, especially when you consider the billions and billions of RAID devices operating around the globe. The Dell hardware in particular is robust and reliable, and the majority of new devices dial home automatically anyway.
Problems occur over time, and it’s important to have a number of safeguards in place. Remember that RAID is a protection status, it is not a backup solution. Gillware highly recommends our customers have a tried and tested backup solution in place. Backups can be at the storage level, or backups that image each individual machine attached to the storage.
If your Dell array does develop any puncture faults, then in the majority of circumstances, the backups will still be usable. This is because backups are a copy of the source data, and it is unlikely that bad blocks will follow the backup.
Bad blocks don’t always mean your backups are also bad. If you haven’t experienced any performance problems or damaged files, then your backups should still be complete enough to finish a restore. To test, take your most recent backup and examine your most important data. If it’s still intact, you likely have a good backup.
Ensure that your controller card is configured to alert upon errors. This can either be a regular daily check to view the latest controller log warning messages (manually), or you can create automated alerts to be emailed to a monitored inbox or a helpdesk ticketing application to be expedited. Follow this up with regular visual inspections of the equipment; amber lights are dead easy to spot in a data center, and some storage controllers will emit an audible error for media errors.
Another remedial action that can improve the health of your storage solution is to update the drivers and firmware on controllers, hard drives, backplanes, and other devices such as canisters. Unfortunately, if firmware updates are being performed on a local server, these updates may require downtime, but if you are using a Storage Area Network (SAN), these updates can be done with the system still live; first complete path A, followed by path B.
If you suspect that your system is suffering from a puncture, try performing routine consistency checks on the storage device. A consistency check verifies the correctness of data in a redundant array. In a system that uses parity, checking consistency involves computing the data on one physical disk and comparing the results to the contents of the parity physical disk. The consistency check can be stopped and started on demand.
Review the logs collected by the PERC controller, you can install a software tool such as PercCLI or MegaCLI to manage the controllers, drives, etc. It also gives you granular control over the logs.
Determine if your system has a hardware error, usually indicated on the server’s light path diagnostics panel. This is a panel that indicates hardware faults using LEDs against a corresponding hardware device. If you suspect a hardware fault, further diagnostic tools can be used to validate a hardware issue – download links can be provided by your hardware vendor.
Can I Fix a RAID Puncture Myself?
Yes, it is possible to fix the punctures yourself, but serious consideration needs to be taken over the data. Fixing a puncture will destroy the data, so unless you have a backup, your data will be gone forever. If you need this data, you need to speak to a Data Recovery Specialist like Gillware. Please understand that continuing to work on an array puncture or drives with a known data error can cause you more problems at the drive block level.
You can always reach out to Dell technical support; you will find a support tag attached to your server or storage device. Simply call the Dell support team for further advice. However, as you are reading about recovering data from a punctured array, it’s quite likely that your technical support contract with Dell has lapsed. This is not an uncommon scenario, as enterprise support is very expensive, and the cost usually increments up the older the storage gets.
This is where Gillware can help. Let our team of Data Recovery Experts recover your data, and once you have the data secured, you can proceed with destroying the array. Dell provides a useful support document here that explains the next steps.
Warning: Following these steps will result in the loss of all data on the array. Please ensure you are prepared to restore from backup or using the data recovered by Gillware. Use caution so that following these steps does not impact any other arrays.
As this procedure will cause permanent loss of all data on this physical hardware, you must first verify your backups are 100% valid, current, and complete on different storage equipment. Do not wipe this array and hope the backup was configured properly and is current. Don’t fall victim to the seeming convenience of re-purposing this hardware because it’s currently available and it would be a handy and speedy thing to do. If you follow these steps and later realize the backup is 3 months old or the backup wasn’t configured to capture the accounting team’s share, it will be too late to turn around. Test the backups on different hardware and be 100% sure there is no need for any of the data on this array.
Any drives that were showing as failed in the server logs or showing amber lights on the RAID controller should be replaced. If the remaining drives are more than 5 years old, their failure may also be imminent, and it’s recommended to set up all-new modern drives. These failed drives should not be repurposed to hold any data, and you must wait to RMA them until you confirm you have fully restored from a backup successfully.
Once you have confirmed you understand and have completed steps #1 and #2, you may proceed with the following steps:
- Discard Preserved Cache (if it exists)
- Clear foreign configurations (if any)
- Delete the array
- Recreate the array as desired
- Perform a Full Initialization of the array (not a Fast Initialization)
- Perform a Check Consistency on the array
Source: Dell KB000139251
The Next Steps…
We hope this page has been informative and helped you diagnose any RAID puncture issue you may be facing. Our experts are standing by to get your data back. Our RAID recovery service team has a very high success rate in what is one of the most complex fixes we undertake at Gillware.
Using our state-of-the-art engineering techniques, Gillware has been able to achieve significant cost savings that we gladly pass along to our customers. The best example of Gillware’s cost-controlling methods is the use of flexible ISO-5 certified horizontal flow class-100 workstations instead of traditional cleanrooms. Hard-walled, class-100 cleanrooms can cost millions to construct and are extremely expensive to maintain. Gillware’s clean flow benches are a fraction of the cost of a traditional cleanroom and allow growth and scalability.
Since we perform all of our data recoveries in-house, Gillware has chosen to invest in producing our own recovery tools that address the many different RAID array failures we encounter. Each year, several new RAID models are released, each with its own new type of hard drive firmware, controller firmware, canister firmware, and all with different mechanical and electrical designs.
All these developments can result in new RAID failures that require different methods to recover data. Gillware can react quickly to new failures by working with our team of electrical, mechanical, and software engineers to develop our own proprietary recovery tools.
Want your RAID data back? Get in touch with Gillware Data Recovery Services today!
Need Help Now?
We Can Help with RAID Puncture Errors
Talk to an expert about your RAID puncture errors and getting your data back. Get a no-hassle consultation today!