How do you actually validate that your DR plan still works?

Posted by eudo69@reddit | sysadmin | View on Reddit | 25 comments

We run a fairly standard AWS setup, RDS, Lambda, S3, a few ECS services. We have backups configured, read replicas, the usual. But last month someone asked "when was the last time we actually tested a restore?" and nobody had a good answer.

I started looking at it more seriously and realized the gap isn't config, it's evidence. We know backups exist but we don't know if the recovery path still works end-to-end. The runbook references a replica that was decommissioned 3 months ago. The RTO estimate is a guess from 2024.

I ended up building a tool to automate this analysis (open source, read-only scanner) but I'm curious how other teams handle this:

\- Do you track evidence that DR mechanisms were actually tested?

\- How do you catch runbook drift when infra changes?

\- Do you reason about DR at the service level or per-resource?

[-]

Kardinal@reddit

Serious question. How does everyone here do this for their SaaS platforms? We have our own way of doing it, but it has some gaps and I'm genuinely curious how others deal with that.

[-]

eudo69@reddit (OP)

I introduced this tool to my team that i made full OSS it’s working good and not guessing anything. Give it a try :)

https://github.com/mehdi-arfaoui/Stronghold

[-]

Kumorigoe@reddit

Sorry, it seems this comment or thread has violated a sub-reddit rule and has been removed by a moderator.

Do Not Conduct Marketing Operations Within This Community.

It is not acceptable to advertise a product, service, Blog or FOSS Project within this community outside of authorized threads.
It is not acceptable to perform product research or market research within this community without permission.
The Reddit advertising system exists to help you reach out to new or existing customers.
Product Representatives are free to discuss their product in the context of an existing, naturally-occurring discussion. Astroturfing is not permitted.
As always, users must disclose any affiliation with a product.
Content creators should refrain from directing this community to their own content.

Your content may be better suited for our companion sub-reddit: /r/SysAdminBlogs

If you wish to appeal this action please don't hesitate to message the moderation team.

[-]

binkbankb0nk@reddit

Depending on your environment, you can even automate it every day or week.

This works if you use Veeam https://www.veeam.com/products/veeam-data-platform/orchestration-governance-compliance.html

I have no idea if there is already something native for your workloads or if Veeam works for those cloud workloads, as I haven't done any AWS, but there should be something like this available.

Your DR site should have seperate backups, seperate ISP connections, seperate location, region etc. So ideally you can spin it all up in an "offline" state and validate it at a set interval.

[-]

eudo69@reddit (OP)

Veeam's orchestration is great for scheduled failover validation on-prem. For AWS-native workloads it doesn't really apply, the recovery mechanisms are different (RDS snapshots, S3 CRR, Lambda redeployment, etc.).
The tool I'm building covers that AWS-native gap: it scans the actual AWS APIs, maps services and dependencies, and evaluates whether the recovery path is viable.
It's at github.com/mehdi-arfaoui/Stronghold if you want to take a look u/binkbankb0nk .

[-]

alpentrekr@reddit

Was going to ask if you'd share more about your automated solution. Thanks!

[-]

Helpjuice@reddit

The only way to validate that it works is to actually implement it to include switching over to it fully to see if it actually works end to end. If you cannot actually do this then your plan has not been fully validated.

[-]

eudo69@reddit (OP)

Fully agree that a real failover is the gold standard. Nothing replaces actually cutting over and proving it end-to-end.

The question is what happens between those exercises. Most teams do a full failover once or twice a year. That leaves 350+ days where infra changes silently invalidate the plan, a replica gets decommissioned, a backup policy changes, a new service gets added with no DR coverage at all.

You can't do a full failover every week. But you can continuously validate that the preconditions for a successful failover still hold: backup mechanisms are still in place, dependencies haven't changed, the runbook still references real resources, and the recovery path you proved in January hasn't been broken by a February deploy.

A full failover proves "it worked on that day." Continuous validation proves "the things that made it work are still true today."

Both are necessary. Neither alone is sufficient.

[-]

Helpjuice@reddit

These misses are due to operational hygiene issues. If something new is being created, taken down then it should be done in both primary and secondary locations. A replica should be replicating data to the DR site if hot, or sync to a store if warm, or archived and ready to be brought up from cold storage if it is a cold site. These functions can and should be fully automated, there is only human error and laziness if this is not done. Want to review, and enforce this in real-time use existing capabilities within AWS to enable this to include only deploying using cloudformation with no more hand crafted anything. If it gets deployed it goes through CI/CD and nobody gets access to mess around within anything on production systems.

When done right you can and should have your data setup in the appropriate setups and if needed store ready files in S3, Glacier for cold storage, etc. but there is no reason why this cannot be automated unless there is a broken process in the mix preventing this from being properly setup.

[-]

Claidheamhmor@reddit

Financial institution. We fail over to our alternate data centre every year, and run from there for a week.

[-]

KStieers@reddit

We do a section of our critical stack every quarter and the full thing every year, so everything gets looked at twice...

Recored and notes to the automated ticket that pokes us to do it.

[-]

eudo69@reddit (OP)

Wow solid cadence !

How do you decide what goes in which quarterly slice? By criticality, by blast radius, or just rotating through the stack?

[-]

KStieers@reddit

Just rotate through the stack.

[-]

DJDoubleDave@reddit

I have a recurring task in my ticketing system that pops up quarterly. We have to actually perform the DR process. We record in the ticket when it was, what steps were taken, and how long the RTO is.

Then when auditors ask for evidence, we just print out that ticket. We also know for sure that it works.

[-]

eudo69@reddit (OP)

that's more disciplined than most teams I've talked to. The ticket trail is smart for audit too

just curious do you ever find that the actual RTO drifts between quarters?

[-]

DJDoubleDave@reddit

It does drift a bit, yeah. I do have a problem where the backup testing isn't necessarily top priority for everyone, and they get distracted by time sensitive tickets happening at the same time, which slows the RTO. I don't have a great fix for that yet. My team is too small to have the people doing the test ignore potential production issues.

I just try to note that in this case the restore was delayed an hour because the DBA had to do some other thing.

[-]

eudo69@reddit (OP)

Small teams can't dedicate someone full-time to a DR exercise while production is on fire yeah

One thing that helped me think about this: separating the validation into two layers. The stuff you can check without human involvement (do the backup mechanisms still exist, are the dependencies still wired correctly, does the runbook still match reality), that can run automated, even in CI.
The actual hands-on restore test still needs people, but at least you walk into it knowing the preconditions are still valid instead of discovering mid-exercise that a replica was removed last month.

Doesn't fix the staffing problem but it shrinks the surface area of what the humans need to validate...

[-]

thenew3@reddit

We test DR copies once a year. (more often if we have time or are asked to do it by the Security team). Boot up critical VM's, test access to it and data integrity. We get end users involved to do the test so they can sign off on it (auditors like to see non IT folks sign of on these tests).

[-]

eudo69@reddit (OP)

This is solid practice, getting end users to sign off is something most teams skip.
The gap I kept hitting was the time between tests: 364 days where the runbook could drift without anyone noticing. A replica gets decommissioned, a backup retention policy changes, and suddenly the plan you tested this March doesn't match the live environment anymore...

[-]

thenew3@reddit

Yeah we also use the DR test opportunity to double check and update our DR documentation/plan if necessary.

[-]

eudo69@reddit (OP)

yup!! the tricky part is catching what changed between tests... infra moves fast and the doc doesn't update itself.
We found that even a month after a DR test, 2-3 runbook steps were already stale because someone rotated a replica or changed a retention policy.

Do you track those doc updates manually or do you have something that flags when the plan diverges from reality?

[-]