Microsoft Volume Shadow Copy causing index file to consume entire drive: Cause and Workaround.

Posted by inucune@reddit | sysadmin | View on Reddit | 49 comments

After a full year investigation with Micro$oft and another impacted vendor, Micro$oft has informed us that they will not be fixing the bug below, and will also not release any official documentation. As such, I will provide what technical information I can here to save some poor soul a year of pain.

I will only be referring to the vendor as such. They will be spared a direct name-and-shame (this time) given that they were also not aware of this issue when they made the decisions they did, and have been provided a technical breakdown of this impact as well.

This issue has been observed in our environment on server 2008 through server 2019.

The Setup:

Our Antivirus software began leveraging Volume Shadow Copy (VSS) to take a snapshot of all drives (usually 2) on all servers every 4 hours. The vendor's intent with these snapshots was to provide a rollback feature in the event of a cryptolocker event. I have not been provided any disaster recovery literature utilizing this feature for our environment, but that does not mean it doesn't exist outside my scope.

The Problem:

My team responds to automated alerts for disk space exhaustion. These can also result in an on-call being notified as a drive filling can result in a larger cascade failure across our environment. We noticed an uptick in calls, and after investigating one of the impacted machines, we noticed a discrepancy: while the drive was reported by Windows as full, Spacemonger and wintree showed the space as available. A quick file copy test showed that the space was indeed unavailable to write into.

The first machine was recovered with a reboot. An investigation ticket was raised after the second machine was found with this behavior and placed in my queue, and I tapped a coworker to tag along for the ticket as a second set of eyes and because they were also interested in it.

The Investigation:

My teammate was investigating an impacted machine with me, and found that running chkdsk [drive letter] /v and waiting 10 minutes caused all the space to return. This confused both of us as this command shouldn't change anything, only display information. This quickly became our triage path moving forward: run the check disk command, wait 10 minutes, reboot if it didn't recover.

Running Spacemonger as system displayed accurate Volume System Information file sizes and drive state, allowing us to quickly identify the footprint moving forward.

One of our impacted machines did next to nothing, acting as a relay for some web traffic. It has ~1GB of actual data on a 60GB F: drive, and would fill every 3 weeks. This box quickly became our main investigation machine. Being a virtual machine, snapshots, and even full dumps to convert to windows debug files were taken.

I traced the activity of this box down to a hidden system file in the Volume System Information folder, but it was only identified as a GUID. I would later identify this as a system Index file. Further investigation with Windbg showed these as being Volume Shadow Copy files. The only 'service' on our investigation machine that used Volume Shadow Copy was our Antivirus, in order to take snapshots every 4 hours. It wasn't long before I had the vendor engaged.

This same week, this failure occurred on a database server. Rather than running the check disk, the tech attempted to extend the drive. This resulted in a corrupted drive that had to be restored from backup, and suddenly there was great interest in our investigation. This quickly resulted in both Vendor and Micro$oft being on investigation calls. There was much arguing and passing the blame: Microsoft claimed Vendor was not using Volume Shadow Copy properly and that was resulting in the failure. Vendor pushed back that there was no literature or behavior to indicate they were causing this issue. Eventually I managed to get both entities to recreate the failure in their respective labs.

The Failure Chain:

As snapshots are created and removed, VSS tracks the changes in an ‘index’ file.
This index file is a hidden system file located in the System Volume Information folder, and does not have a proper file name, only a GUID (system identifier). This file is usually ~3KB under normal operation.
Other file system operations are also tracked in the index file.
Per Microsoft, the maximum number of snapshots that can be tracked in this index file is 512 (since last reboot).
Once this 512 count has been exceeded in the index, null data begins to write to the index file at a rate of ~10KB/s.
This write will continue until all available drive space is consumed by the index file.

Some of our Volume Shadow Copies are configured to route both drive C:/ and F:/ to F:/ (Such as Databases). This cuts the time to failure down as 2 drives worth of snapshots, in addition to any other application using Volume Shadow copy quickly exhausting this 512 figure.

Kick in the teeth:

Micro$oft confirmed they had internal documentation of this issue, but both declined to fix this issue or release any official documentation concerning it. Micro$oft confirmed many times during the investigation and during the resolution that we are not in any way misconfiguring Volume Shadow Copy, and that there is no expectation for our configuration to not work as intended.

Vendor has also taken our finding back to their internal teams, and I hope will be adjusting their practices and internal literature.

Resolution:

Our internal team, given the above information, has elected to disable the snapshot feature. I am providing this post in hopes to save someone else out there the headache this all has been.

[-]

The_Koplin@reddit

Thank you for the write up. However it seems that relying on shadow copy to protect systems is the fault here. I get why, its enticing and saves on development costs, but doesn't protect you the way you think. IE WannaCry will wipe the VSS. Ryuk will stop VSS then wipe. Conti will resize shadow storage smaller. LockBit will attack VSS across an entire network. Clop will kill backup and AV software on endpoints directly.

Antivirus uses an MS tool in a way that was not intended/unexpected. (MS point of view)
AV software testing should have found this prior to implementing method. (Proper QA would/should have caught this)

You imply you have VM's (extending disks), use the VM snapshot or the storage array's version. Don't rely on in guest services for protection in a VM. Use tools to copy this data out, saves on loading the servers down.

Our system takes a VM/volume snapshot and copies it to tape every 6 hour or so. This way nothing inside the VM has to handle anything. The tapes are ejected from the library to prevent wiping/crypto. You can use 'immutable' storage options depending on the tools. (The key is not to tie these systems into the same authentication structures and shield them from lateral movement. VLANs, Cloud hosts etc)

Plenty of data storage systems offer a copy on write/ data protection options. Did you consider that if a virus got to RING 0 on the OS that VSS would just be encrypted or deleted along with everything else?

[-]

inucune@reddit (OP)

All good points. My team doesn't get to make these decisions. In the event of a crypto event, I'd probably just blow away the machine, build a base OS, and then restore the last known-good copy of a backup over the top.

[-]

LetSufficient5139@reddit

Your team however does get to manage the systems, and this could be avoided by limiting space shadow copies can consume.

This is on you.

[-]

inucune@reddit (OP)

the index file space consumption is not managed by VSS and does not respect the space consumption restrictions. When the index file exceeds the 512 number, it enters undefined behavior and expands to fill the drive. This is not something we can directly control as this index file also performs other system-critical functions. Our best option was to disable the rollback feature of the AV.

[-]

The_Koplin@reddit

That's pretty much our plan of action. 'Nuke it from orbit, its the only way to be sure!'

And? if only there was a way to limit space shadow copies consumed.....

Kurgan_IT@reddit

If I get it right, the issue is that you cannot make and then destroy more than 512 shadow copies. Not make and keep them (which of course would exhaust resources) but make and then destroy and then make again. Like if you could not make and then delete a file more than 512 times.

That is correct: 512 snapshots created since last reboot on a single drive.

InterdictorCompellor@reddit

Over a decade ago I worked in backup and shadow storage filling up for no reason was a call we got all the time. That 512 limit is a crazy thing to learn all these years later.

Due_Peak_6428@reddit

sounds like a problem for the antivirus people to fix

They don't get a free pass, but they also didn't know (or test) for this behavior. I have other major concerns with their behavior but in this case i can't blame them.

Idk why you're entertaining it though. Big waste of your time. This ain't your problem to fix. It's your anti virus people.

It is my job in this company to run down this type of issue to resolution.

I would have a chosen a better product a long time ago

Loudergood@reddit

Some people are stuck with Windows for legacy reasons.

Just get rid of the AVG product. Windows defender does everything

This is what sysadmins do, ususally.

No. Sounds like a MS limitation, undocumented because "reasons". I should be able to make (and then destroy) as many snapshots as I like. Or, I should have documentation that says "don't do this".

TheFluffiestRedditor@reddit

512 is a low number of snapshots. That the process crashes out into crazy land afterwards is idiotic.

MaskedPotato999@reddit

Hello, an antivirus client relying on VSS and creating snapshots every 4 hours is an insane design choice. Drop it asap.

"512 snapshots per reboot should be enough for everybody"

Slopya Nadella

May I suggest that you reboot the server every 511 snapshots? /s

I read that and winced hard. Especially the catastrophic failure that occurs afterwards.

NeedAColdBeerHere@reddit

If it gets to the point where ransomware is able to execute on your servers it’s likely too late anyway. Many know to target and delete shadow copies.

I would be curious who this AV vendor is so I can add them to my list of vendors not to use.

The AV vendor makes a backup of that snapshot, if I get it right. It's not counting on the snapshot to be the backup in itself.

No, they're relying on the snapshot. I didn't say this was a bright idea, just what they're doing.

Ok, so it's indeed mostly useless. Maybe there are some situations where it makes sense. For example, if the ransomware is running on a workstation and it's encrypting data from a smb share, then it cannot disable VSS on the server itself, because it has no accesso to that setting via smb. In this specific case it makes sense. But... does this specific case still exist in today's ransomware world?

Frothyleet@reddit

I'm confused then, because lots of backup tools leverage VSS to snapshot data without having this problem. It sounds to me like they are just rawdogging the shadow copy API in the same way as someone using the built-in-but-no-longer-commonly-used "file rollback" functionality.

Slap a marketing term on it and increase your subscription fee.

ender-_@reddit

It's useful when ransomware runs on a client computer – it'll encrypt the fileshare where user has permissions, but won't be able to touch the shadow copy.

military_thicket@reddit

This is a solid technical writeup, and I appreciate you documenting the failure chain so clearly because that 512 snapshot limit turning into unbounded null writes is the kind of thing that should be public knowledge. The fact that Microsoft has internal docs on this but won't release them or fix it is frustrating, but at least now people searching for mysterious disk exhaustion with tools showing available space will find this.

Money_Literature_152@reddit

lol..

gif

Ferretau@reddit

I think it points to M$ using the onprem sw as a path to their cloud and the sooner they can push everyone onto their cloud solutions the quicker they will retire all onprem products. An in the interim allow bugs like this to remain in the product.

Smith6612@reddit

I've seen this or something similar occur with SentinelOne too. If you normally have System Restore or Shadow Copy disabled on your systems, or if you didn't manually specify the storage space VSS is allowed to use, then your disk will fill up.

The fix for this is to basically do the following in an admin terminal shell on the box: vssadmin Resize ShadowStorage /For=C: /On=C: /MaxSize=15%

Replace C: with the problem volume and adjust the MaxSize value to your liking.

Windows will evict the storage (and the Antivirus should obey) as long as you specify this.

FortLee2000@reddit

vssadmin list shadowstorage

If it shows Maximum VSS Shadow Copy as UNBOUNDED, you are asking for trouble. Gotta limit it properly.

Academic-Detail-4348@reddit

I agree with both of your pointers and I have done the exact configurations when deploying enterprise backup solutions. Issue here is thst this knowledge belongs in the backup realm and not server ops and would require proper investigation, as OP did.

Metalcastr@reddit

When I was a Windows admin, the issue we were seeing was VSS randomly deciding to not delete old copies. We had to clean them up manually.

MekanicalPirate@reddit

Wow, great find and thank you for the committment to see it through ("through"). Especially since Microsoft isn't doing anything about it.

1) Why wouldn't you name your vendor? It's not really a name-and-shame, you've laid out the responsibility fairly, people should know about this feature in case they are already a customer

2) If I'm understanding the problem correctly, I don't understand how it's not affecting the entire customer base of your A/V vendor (and how it wasn't picked up in their testing).

What are you doing differently that is causing you to encounter the problem, and not all of their other customers?

2 drives worth of VSS routed to 1 drive, causing the 512 figure to be reached quicker.

ludlology@reddit

It's SentinelOne right? I've been fighting S1 VSS gremlins for a long minute.

Looks like a duck, Quacks like a duck....

lol yeah. no reason to hide the vendor

also ducks are way better than S1 is

headcrap@reddit

Interesting read. Thanks for posting the info specific to the failure chain, which is the real meat of your post tbh. Friggin index file..

As for the rest.. indeed the resolution was the right choice imo.. VSS as a measure to counter an outbreak is useless today. If you were going to do that, run your backup software in a CDP mode instead and just use that to recover with. Vendor needs to reconsider their strategy in 2026.

No-Description2794@reddit

Man, maybe is that what happened with a customer of ours, where the server froze to "no storage" to the point that nothing barely worked, but a restart fixed it.
But there is backup and EDR solution, and the backup uses VSS. I didn't count, but maybe we reached the magical 512 snapshots number without restarts?

I had another customer with a 7TB VM that crashed 2 times in 2 years, and the techs alleged that was a VSS problems that locked up all storage avaliable and crashed the VM - but we didn't get to know more info about it, aside of the request to do the restore from backup (we only provide backup services there).

Mindestiny@reddit

I'm not reading the wall of text, sorry, but I will say that weve observed the same even on desktop versions of Windows 10 and 11.

Shadow copy will randomly ignore the default space limitations and snapshot until the whole drive is full, then keep trying. We've just taken to deploying a power shell script to all endpoints that auto remediates it whenever it crops up via intune, it's a simple registry toggle.

philrandal@reddit

You tease!

What's the registry value before and after? Any services need restarting?

I don't remember exactly what the key is (we solved for this years ago, it's an old one), but here's the powershell that remediates it in our environment on Desktop endpoints:

vssadmin resize shadowstorage /for=C: /on=C: /maxsize=15GB

Just using vssadmin to forcibly reset the VSS max disk size to where you want it should auto-cleanup all the superfluous orphaned bullshit that's not correctly reporting in disk usage. You'll want to test and tweak to make sure whatever is going on with your backup software plays nice and isn't like... purging your critical backups. Again these are endpoints in our environment that have experienced this and not servers so we don't really care if VSS snapshots are nuked as the risk is very, very minimal for us.

fdeyso@reddit

We had the same even when shadow copy was disabled on the drives, the “fix” i found was: enable shadow copy, then disable shadow copy and then it cleared up the occupied space. I managed to catch them before it caused issues with one exception but luckily it was a test box.

Civil_Inspection579@reddit

This is one of those nightmare infrastructure bugs that sounds fake until you spend days chasing it The part where Windows reports the drive as full while disk analyzers show free space is especially brutal because it completely breaks normal troubleshooting assumptions