How do you actually validate that your DR plan still works?

Posted by eudo69@reddit | sysadmin | View on Reddit | 25 comments

We run a fairly standard AWS setup, RDS, Lambda, S3, a few ECS services. We have backups configured, read replicas, the usual. But last month someone asked "when was the last time we actually tested a restore?" and nobody had a good answer.

I started looking at it more seriously and realized the gap isn't config, it's evidence. We know backups exist but we don't know if the recovery path still works end-to-end. The runbook references a replica that was decommissioned 3 months ago. The RTO estimate is a guess from 2024.

I ended up building a tool to automate this analysis (open source, read-only scanner) but I'm curious how other teams handle this:

\- Do you track evidence that DR mechanisms were actually tested?

\- How do you catch runbook drift when infra changes?

\- Do you reason about DR at the service level or per-resource?