VMDK Repair: Soft vs Hard Corruption and Why vmkfstools Can Make It Worse

Table Of Contents show

IT administrator in server room looking at VMDK corruption error message on monitor

Most of the searches that bring people to this topic start with a VM that won’t power on, a screen full of error messages, and a sinking feeling. File system specific implementation of LookupAndOpen[...] failed. DISKLIB-LINK : "myvm.vmdk" : failed to open (The system cannot find the file specified). The parent virtual disk has been modified since the child was created. Could not open the disk. The natural next move — and the one almost every blog post on the internet will tell you to make — is to reach for vmkfstools or VOMA and try to repair the file.

Sometimes that works. Often it doesn’t. And in the specific scenario that brings the most VMDK corruption cases to a professional recovery lab, running a repair tool will take a situation where the data is fully intact and write over it permanently. This guide is for IT professionals and virtualization admins who are looking at a corrupt VMDK and want to understand what’s actually broken before deciding how to act. We’ll walk through how VMDK files are actually structured, what the most common error messages really mean, why hardware-level problems are mistaken for file-level problems more often than people realize, and the specific scenarios where a repair attempt converts a recoverable case into a destroyed one.

What a VMDK file actually is (and why “the VMDK file” is usually not one file)

A VMDK isn’t a single file. For a typical ESXi virtual machine with no snapshots, a single virtual disk consists of two files on the datastore:

vmname.vmdk — the descriptor file. A small (typically under 1 KB) plain-text file that describes the layout of the virtual disk. It contains a content ID (CID), a parent CID if the disk is part of a snapshot chain, the disk geometry, the SCSI adapter type, the extent map pointing at the data file, and metadata fields collectively known as the DDB (disk data base).
vmname-flat.vmdk — the data file. The actual content of the virtual disk, stored as raw blocks. For a 200 GB virtual disk, this is a 200 GB file (thick-provisioned) or a sparse file that grows on demand (thin-provisioned).

Once snapshots get involved, the picture gets considerably more complex. Each snapshot adds at least two more files:

vmname-000001.vmdk — the snapshot’s descriptor file, with its own CID and a parentCID that points back to the parent disk.
vmname-000001-delta.vmdk (older format) or vmname-000001-sesparse.vmdk (newer SEsparse format on VMFS-5/6) — the snapshot data file, which holds the blocks that have changed since the snapshot was taken.

A VM with three snapshots will have a chain like: base disk → snapshot 1 → snapshot 2 → snapshot 3. Each child references its parent through the parentCID field in the descriptor. The running VM reads from the entire chain in order, with newer blocks shadowing older ones. Surrounding all of this you’ll also find the .vmx configuration file, the .vmsd snapshot manifest, .vmem memory state, and various log files.

What gets called “VMDK corruption” can therefore mean a dozen different things: a damaged descriptor file, a missing parent in the snapshot chain, a CID that doesn’t match what its child expects, a flat file with internally inconsistent extents, a sesparse file with a broken header, or — most commonly — a VMDK that looks damaged but is actually fine, sitting on top of a storage layer that’s lying to it.

That last category is where most cases that end up at a recovery lab actually live, and it’s the one most repair guides on the internet quietly skip over.

The error messages people actually see (and what they really mean)

A short tour of the most common error strings, in roughly the order they show up in support forum threads:

“File system specific implementation of LookupAndOpen[…] failed” — A common ESXi 6.5+ error when a VM fails to power on. The host couldn’t open one of the VMDK files in the chain. The cause is almost always at the descriptor or chain level rather than the flat data.

DISKLIB-LINK : "myvm.vmdk" : failed to open (The system cannot find the file specified) and the corresponding DISKLIB-CHAIN line in vmware.log. These tell you exactly which file in the chain ESXi tried to open and couldn’t find. On a healthy datastore, the file is genuinely missing. On a sick datastore, the file is there but the storage layer isn’t returning it consistently.

“The parent virtual disk has been modified since the child was created” — A CID-mismatch error. The parentCID stored in a snapshot’s descriptor doesn’t match the actual CID of the parent disk. This happens when someone manually edits a descriptor, when a snapshot consolidation is interrupted, or — relevant for us in a moment — when the storage layer is delivering blocks from a stale moment in time, so the parent disk’s CID has effectively rolled backward.

“Could not open the disk” / “Failed to start the virtual machine” — Generic high-level errors. The detail is in vmware.log.

Purple Screen of Death (PSOD) on the ESXi host itself — A host crash. The host’s vmkernel logs (preserved across reboots if a core dump partition is configured) are where the real story is. PSOD events that involve VMFS or storage stack panics often correlate with underlying hardware-level RAID issues that surface as VMDK corruption on subsequent boots.

VERR_NOT_SUPPORTED and VBOX_E_IPRT_ERROR — VirtualBox variants of the same family of problems. Almost always a damaged descriptor embedded inside a monolithicSparse VMDK header.

“Could not get the storage format of the medium” — VirtualBox/Workstation. Same family. Header damage at the start of the file.

Reading the error message is genuinely useful because it tells you which file in the chain is the problem. But it tells you almost nothing about why that file is in the state it’s in — and that’s where most repair attempts go wrong.

Soft corruption vs hard corruption

We use a distinction internally at Gillware that we’ve found useful: the difference between “soft” and “hard” VMDK corruption.

Hard corruption means the file is legitimately damaged. The bytes inside the VMDK have been altered, deleted, or overwritten. This happens when:

A ransomware actor encrypts the file (or parts of it) in place.
A storage system truncates the file because of running out of space mid-write.
Internal metadata gets physically corrupted by, say, bad sectors on the underlying disk that happen to fall on descriptor blocks.
A snapshot consolidation crashes partway through, leaving a file with internally inconsistent extents.

In hard corruption, the bits on disk are genuinely wrong. Recovery depends on what was damaged, what was preserved, and whether good backups exist for the lost portions.

Soft corruption is different. In soft corruption, the file looks damaged — checksums fail, descriptors don’t validate, ESXi throws CID mismatch errors, VOMA reports inconsistencies — but the bytes that should be there are still on the underlying physical media. The problem is that the layer beneath VMFS is delivering the wrong bytes to ESXi. The VMDK isn’t corrupt; the storage system is misconfigured, and ESXi is being shown an inconsistent view of the data.

The reason this distinction matters so much: running vmkfstools -x repair, VOMA -m, or any other write-back repair tool on a soft-corruption case will convert it into a hard-corruption case in seconds. The repair tool is doing exactly what it was designed to do — it sees inconsistent metadata, and it writes consistent metadata in its place. But the input it’s working from is wrong, so what it writes is also wrong. And once those writes commit to the underlying storage, the original data they overwrote is gone.

The single most common cause of soft corruption — by a wide margin in the cases we see — is a RAID configuration problem on the storage hosting the datastore.

The stale drive scenario: how a six-month-old failure destroys a VM today

A diagnostic top-down view: the VMDK error at the top is a symptom; the actual cause is a stale drive at the bottom of the storage stack.

Here’s the canonical case. Walk through it carefully because it’s the one that traps the most people.

A server has a five-drive RAID 5 array on a Dell PERC (or any LSI/MegaRAID lineage controller). Six months ago, one drive failed. For whatever reason — no spare on hand, no monitoring alert raised, “we’ll get to it later” — that drive was never replaced. The array continued running in degraded mode. The failed drive stayed physically in the chassis. ESXi continued running. The VMs kept working. Nothing was obviously wrong.

Then, last Tuesday, there was a power event. The whole server rebooted. When it came back up, the PERC controller saw all five drives again — including the one that “failed” six months ago, which is actually still alive enough to identify itself to the controller. The problem is that this drive has been offline for six months. Its on-disk DDF (Disk Data Format) metadata — the headers that tell the controller what stripe of what virtual disk it belongs to — reflects the state of the array as of six months ago. The other four drives’ DDF metadata reflects last Tuesday. The controller sees the mismatch and presents the stale drive’s configuration as foreign, with the BIOS or iDRAC prompting the operator: “Foreign configuration found on adapter. Press ‘F’ to import or ‘C’ to clear.”

This is where the disaster usually happens. The on-call engineer, looking at a server that won’t bring up the datastore and seeing the foreign-config prompt, doesn’t know which drive is the stale one. The controller doesn’t make it obvious. There’s a timestamp in the controller logs if you know where to look, but in the heat of a 3 a.m. incident, most people don’t. The instinct is to import the foreign configuration to get the array back. So they hit F.

What just happened: the controller wrote the stale drive’s old configuration into NVRAM, and is now assembling the array from a mix of four drives in their current state and one drive frozen six months in the past. Every read that lands on the stale drive returns six-month-old data. Parity calculations across stripes now use a mix of fresh and stale blocks. The VMFS metadata layer reads bytes that don’t make sense. ESXi sees CID mismatches, descriptor errors, broken snapshot chains. The VMDKs look corrupt.

The VMDKs are not corrupt. The data is sitting intact on the four good drives, and is also sitting on the stale drive in the form it existed six months ago. The array just needs to be reassembled with the right drives in the right state. But to ESXi, every diagnostic looks like file-level corruption. Dell’s own knowledge base acknowledges this danger: “Importing a single-disk foreign configuration into an active array may cause data corruption.” That warning exists for exactly this scenario.

The catastrophic next step: the engineer, looking at corrupt-looking VMDKs, runs vmkfstools -x repair against them. This commits writes. The writes are based on the wrong view of the data — the view the stale drive is poisoning. Now the corrupted-looking metadata becomes corrupted in fact, written across stripes that may include the very blocks where the good data lived. The case has just been converted from soft to hard. Even if the original RAID configuration is later figured out, the data that was overwritten during the “repair” is gone.

Punctured arrays: when a rebuild fails midway through

A close relative of the stale-drive scenario, and one Dell uses a specific term for. A RAID 5 or RAID 6 array has a failed drive. The operator inserts a replacement, the rebuild starts. The rebuild reads every sector of every surviving drive, recalculates parity, and writes to the new drive. This takes hours on a multi-terabyte array.

Eight percent of the way in, something happens — a second drive throws errors mid-rebuild, a power event hits, the controller resets, anything. The rebuild aborts. Now the array contains two distinct epochs:

The first 8% has been rebuilt to a new state (the new drive contains valid reconstructed data for the early portion of the array).
The remaining 92% is still in its pre-rebuild degraded state.

The PERC controller’s lifecycle log may mark this state as “punctured.” A punctured array is internally inconsistent in a way that can’t be undone by the controller — there’s no clean state to roll back to. Some controllers will try to come back online and present the array as functional, which is when VMDKs sitting on top of that array start failing in unpredictable ways. Reads from blocks in the rebuilt 8% return one thing; reads from blocks in the unrebuilt 92% return something else; and the parity between the two regions doesn’t agree.

In a properly handled recovery of a punctured array, an engineer has to identify the puncture point, treat the two regions as separate logical views, and reconstruct the original pre-rebuild state by sampling both regions and merging them. None of that is something vmkfstools can do. None of it is something ESXi knows is happening. From ESXi’s perspective, the array is fine but the VMDKs are full of errors.

Other ways VMDKs go wrong

Not every corrupt VMDK is a misdiagnosed RAID problem. The cases where the file really is damaged generally fall into a few categories:

Snapshot chain breakage. Someone manually deleted a delta file, or a consolidation crashed, leaving the chain incomplete. Errors typically look like CID mismatch or “parent virtual disk has been modified.” If the parent flat file is still intact and the chain is just missing intermediate descriptors, the descriptors can usually be reconstructed.

Descriptor file damage. The small text descriptor got truncated or corrupted independently of the flat file. VMware’s own knowledge base walks through how to rebuild a missing descriptor for a base disk, and the process works when the flat file itself is intact. The risk is that the SCSI adapter type and a few other DDB fields have to be guessed correctly, or the VM won’t boot even after the descriptor is rebuilt.

Ransomware encryption. A real, hard corruption case. The attacker has encrypted the VMDK in place, often targeting the headers and the first several gigabytes of the flat file to make recovery look impossible without paying. Depending on the ransomware family, partial recovery is sometimes possible from unencrypted regions of the flat file or from VMFS-level snapshots the attacker may have missed.

VMFS-level damage. VMFS itself can develop corruption — usually around heartbeat regions or volume metadata, sometimes from controller bugs, sometimes from power events during write-back cache flushes. VOMA can diagnose VMFS issues, but its repair option carries the same risks as vmkfstools: if you’re running it on top of a misconfigured array, the repair commits writes based on a wrong view of the data.

Truncation. The datastore filled up mid-write, or someone manually deleted a portion of a flat file thinking it was safe to do so. Some of the data is gone, period; what remains can usually be salvaged.

When vmkfstools and VOMA are safe to use

The condition under which vmkfstools -x repair and VOMA -m are safe is the condition most blog posts assume without saying: the underlying storage is delivering the same bytes today that it was delivering when the VMDK was written. If your storage isn’t doing that — if there’s any reason to believe the RAID array beneath the datastore is in a degraded, foreign-config, post-rebuild, or punctured state — repair tools can do irreversible damage. The repair commits writes; the writes are based on what ESXi can read; what ESXi can read depends on the storage layer being trustworthy.

A reasonable safety checklist before running any write-back repair:

Confirm the array state at the controller level. Check Storage Manager > Virtual Disks in iDRAC / OpenManage. Confirm no foreign configurations, no degraded virtual disks, no failed members, no recent rebuild events in the lifecycle log.
Check the vmkernel logs for storage-layer errors. I/O errors, path failures, abort tasks, or SCSI sense data preceding the VMDK errors all point to hardware rather than file-level problems.
Image first, repair second. If the data matters, image the underlying storage (or at least the VMFS extent containing the affected VMs) before running anything that writes. On a Dell host with an iDRAC, this might mean exporting the VMFS datastore via SSH and dd or a similar block-level copy to external storage.
Don’t accept foreign configuration imports without confirming epoch. If iDRAC or PERC BIOS is prompting you to import a foreign config and you don’t know exactly which drive went stale when, stop. The wrong choice here is the most common path to permanent data loss in the entire ESXi-on-Dell stack.
Don’t try to bring up the VM “just to see” if a repair worked. Powering on a VM on a misassembled array triggers writes — the guest OS will start journaling, the host will update VMFS heartbeats, snapshot consolidations may auto-run. Every one of those writes is a chance to overwrite recoverable data.

When professional recovery is the right call

The case for stopping and getting professional help on a VMDK corruption case isn’t symmetrical with the case for trying repair tools. If the data is replaceable from backups, vmkfstools is a reasonable thing to try — worst case, you go to the backups. If the data isn’t replaceable, the calculus changes completely. The wrong repair attempt can convert a recoverable case into an unrecoverable one in seconds, and the kinds of recoverable cases this happens to (soft corruption sitting on misconfigured RAID) are exactly the cases where the file looks the most damaged.

Professional virtual machine recovery on cases like these doesn’t start with the VMDK at all. It starts at the lowest layer — the individual physical drives — and works upward. Each drive is imaged independently with hardware write-blockers in place. The original drives are never written to. From the drive images, the RAID geometry is reconstructed offline, often by sampling stripes and identifying which drive’s metadata corresponds to which epoch. Once the array is virtually assembled in the correct state, the VMFS layer is parsed, the VMDK files are extracted, and the snapshot chain is reconstituted. Only at that point does anyone look at the VMDK files themselves — and by that point, the question is usually whether they need any repair at all, not whether to risk repairing the live array.

Frequently asked questions

Should I run vmkfstools -x repair if my VM won’t power on?

Only if you have either a verified recent backup or you’ve confirmed that the underlying storage is in a clean, non-degraded, non-foreign-config state. If the data is critical and there’s any doubt about the storage layer’s health, image first. The repair commits writes, and if it’s working from a wrong view of the data, those writes can be catastrophic.

What does “The parent virtual disk has been modified since the child was created” actually mean?

ESXi compares the parentCID field in a snapshot’s descriptor against the actual CID of the parent disk. When they don’t match, you get this error. It can mean someone manually edited the parent (rare), it can mean a snapshot consolidation was interrupted (common), or it can mean the storage layer is presenting a stale version of the parent so its CID has effectively rolled backward (more common than people realize, and the dangerous one to “repair”).

Can a corrupt VMDK be recovered without backups?

In many cases, yes — but it depends entirely on which type of corruption you’re dealing with. Soft corruption from RAID misconfiguration is usually fully recoverable if no write-back repair has been attempted. Hard corruption from ransomware, truncation, or physical media damage depends on what was damaged and whether unaffected regions hold enough of the data to be useful. The single most important factor in recoverability isn’t the original failure mode — it’s whether anything has been written to the array since the failure.

Why does ESXi sometimes show VMDK errors when the real problem is hardware?

VMFS is a thin layer over block storage, and a VMDK is a structured file inside that layer. Both layers depend on the storage beneath them returning consistent, repeatable reads. When that assumption breaks — a stale drive in a RAID array, a punctured rebuild, a flaky HBA, a failing SAS expander — the reads ESXi gets back become inconsistent. The errors surface at the file layer because that’s the layer doing the consistency checking, but the actual fault is much lower down.

What’s the difference between vmkfstools and VOMA?

vmkfstools -x repair operates on a specific VMDK file. It checks and tries to fix the descriptor and the file’s internal consistency. VOMA (vSphere On-disk Metadata Analyzer) operates at the VMFS volume level, checking and optionally repairing the file system that hosts the VMDKs. Both can write changes if invoked with their repair flags, and both carry the same risk profile when run on a misassembled storage stack.

Can I just copy the -flat.vmdk to a new datastore and recreate the descriptor?

For a base disk with no snapshots, this can work if the flat file is intact. VMware’s own documentation walks through the descriptor recreation steps. The trap is when there are snapshots in the chain — the new flat file will have a new CID, and the existing delta files’ parentCID values will no longer match it, producing exactly the “parent virtual disk has been modified” error. Snapshot chains generally need to be handled as units, not as individual files.

How do I know if my array is in a state that’s safe to run repair tools on?

Check the controller. On Dell PERC, look at iDRAC’s Storage section: every physical disk should be in “Online” state, every virtual disk should be “Online” (not “Degraded” or “Foreign” or “Offline”), and there should be no recent rebuild events in the lifecycle log that didn’t complete cleanly. If anything looks off at the controller level, treat the VMDK errors as a symptom of the controller-level problem, not a separate file-level issue to be repaired.

The bottom line

The instinct when a VMDK won’t open is to repair the file. Sometimes that instinct is right. But the cases that bring the most VMDK corruption traffic to a professional recovery lab are exactly the cases where running a repair tool will destroy the data — because the file isn’t really the problem. The storage stack underneath the file is, and the repair tool will dutifully write a consistent VMDK on top of an inconsistent array, overwriting the very data that could otherwise have been recovered cleanly.

Before reaching for vmkfstools, VOMA, or any other write-back repair tool: confirm the storage layer is healthy. If it isn’t, stop. If the data matters and the storage layer’s health is in any doubt, image first.

If you’re looking at a VMDK that won’t open and the data on it is important enough that you can’t afford to make the situation worse, Gillware offers a free consultation and scoping call for virtual machine recovery. Our experts are a simple phone call away, 877-624-7206.

Repairing Corrupted VMDK Files: What’s Really Going Wrong, Why vmkfstools Might Make It Worse, and the Difference Between Soft and Hard Corruption