Backup recovery testing best practice
Posted by bluecopp3r@reddit | sysadmin | View on Reddit | 12 comments
Greetings all,
I am seeking insight into how you approach backup recovery testing, specifically for VMs and guest files on VMs. My org is ISO9001 certified, and a recent internal audit highlighted that once per quarter backup verification, as stated in the backup policy, was insufficient.
How are you structuring your backup verification process? I'd also like to have an idea of the size of your org and IT team.
kvorythix@reddit
the only real test is a restore drill that matches your ugly day, not just a green backup job. if you can't restore fast and clean, the backup doesn't mean much
chickibumbum_byomde@reddit
quarterly is usually too little because it doesn’t reflect real recovery readiness, i have set up weekly and daily backups (rotational ofc to save space), most teams move toward smaller, more frequent tests e.g. automated VM restore checks weekly/monthly and occasional full recovery drills.
the key is making it routine and repeatable, not a big manual event once a quarter.....and the big owl in the room, proper monitoring, saved me lots of sleeples nights, every single backup job is monitored, so i know incase sth happens i know for sure i dont get double trouble, currently using checkmk, cant really complain.
enfier@reddit
Have you checked that your backup solution can't do this automatically according to a policy?
bluecopp3r@reddit (OP)
Yes the software does it but i haven't gotten around to building out the mechanism for that. Veeam surebackup
enfier@reddit
So enable it, schedule it, then modify the policy to match your schedule and you are done. This is compliance, not best practices. If you are getting dinged on the policy and you can close the gap with automated policy then just do it and move on.
bluecopp3r@reddit (OP)
Yea it does take a bit of config to setup lab env and proxy between the backup server and the lab env to deploy the backup data and run the verification tests. But yes it would be good to finally automate this process
Firefox005@reddit
What did your internal audit flag as the reason it wasn't sufficient? Because ISO9001 really just holds you accountable to your own processes and documentation. So like was there an event where backups or restoring was broken for an entire quarter and the verification process did or would have caught it?
Backup verification is like chasing the dragon, there is always something you missed or more you could be doing.
Personally I automate it, for file verification have a script that writes a file with random data to a guest, trigger a backup, wait for it to complete, then run a restore, then check if the file is there and the hash matches. You can do the same for other services like databases. But like I said until you are literally testing every single backup and file/service inside each of those backups (and didn't miss any!) there is always a chance something could be missing.
bluecopp3r@reddit (OP)
"This delay could affect the organisation’s ability to detect and resolve data loss or corruption in a timely manner." The delay here is between the backup being taken and the verification, which we do once per quarter.
Firefox005@reddit
Riiiight, so just navel gazing.
It's also kind of the wrong thing to look at, any data backup product worth anything these days isn't going to suffer from data loss or data corruption the built in redundancies and check summing is more than enough to basically make that a non-event.
Testing restores, at least in my opinion, is more around preventing ossification and making sure you aren't writing junk data to your backups (different from data corruption or loss, ie. the CBT bugs from VMware I wouldn't classify as data loss or data corruption but just invalid data). Once a quarter is IMO more than enough to 'exercise' those mechanisms to make sure that they still are working as expected. Plus it has been my experience that sometimes 'testing' systems like that can give a false sense of security if the wrong things are tested (for instance always using the same file or vm for a restore, it could be the only one actually working).
I mean as an example a life critical safety system like a fire alarm is typically only tested annually and visually inspected quarterly. But I digress, like so many things it lives in the realm of 'it depends' if the business is deadest on doing test restores more often well its not my money or time.
bluecopp3r@reddit (OP)
Thank you very much for your input. Yes i do try to randomize the files i check on from the file server backup
TxJprs@reddit
i have 12 restore test scenarios and run one per month.
bluecopp3r@reddit (OP)
Ok thank you very much for that info Appreciate it