How to make a Disaster Recovery Plan when (almost) all services are managed by external parties?

Posted by Twaasar@reddit | sysadmin | View on Reddit | 23 comments

Hello,

I have to make a Disaster Recovery Plan (DRP) for a small Logistics company, but my problem is that almost al services that are used are managed by external parties. (examples of services are like the websites that are used in the different departments in HR or finance which are mostly websites for some specific function).

Some services we have a little control in for example the Office Suite, but if we have problems with that it goes first to the IT department if they don't know an external company will fix it.

The goal of the the DRP is "What to do when (acces to) data is lost".

I don't know how I have to do this in the DRP. My current idea was to write something like "If service XYZ is not avaiable or not working correctly then contact mail@xyz.abc or phonenumber.

Also some specific cases the IT department is only allowed to contact the service, but that's for just a few services.

But this way my DRP will look like and contact list book.

[-]

iamoldbutididit@reddit

I'm open to being corrected, but you're not making a DRP. You've been asked to document things for data owners that should have been done before they signed the contract(s).

Where a DRP covers minimizing downtime and data loss during a major event, a Business Continuity Plan considers what actions should be taken should one specific application or business function fail. There are entire career paths surrounding information assurance, risk management, and business continuity, which consider things such as:

Risk assessment: Identify threats and their impact on operations.

Critical systems inventory: Know which systems and data are essential.

Recovery objectives: Define RTO (Recovery Time Objective) — how fast you must recover, and RPO (Recovery Point Objective) — how much data loss is acceptable.

Backup and recovery procedures: How data will be restored and systems brought back online.

Roles and responsibilities: Assign who does what during recovery.

Testing and updating: Regularly test and refine the plan to ensure it actually works.

You're asking for people to provide you with specific examples of how to make something that, in reality, doesn't exist. And, if you do find a standalone DRP, understand that while it may be enough to pass a customer requirement, it will not survive an audit.

A DRP is not a standalone document. It has dependencies that rely on business-defined objectives and risk acceptance. It has linkages to KRI's, policies, procedures, and standards.

If you've been told to make a DRP because no one understands what a DRP is, and because no one has the time to make a proper one, put in as much effort as possible to satisfy the requirement. But realize that you can't make a DRP on your own and that any capable auditor can sniff out the deficiencies and lack of sufficient controls in a heartbeat.

[-]

Twaasar@reddit (OP)

Hello,

What can I do in my situation? I have a concise asset management document, which Includes al the services that are used per department. Which includes RTO Based on the thoughts of the manager per department. It doesn't include RPO.

The DRP also have to to approved by NIS2 and ISO27001.

[-]

iamoldbutididit@reddit

This sounds less like a sysadmin thing and more like a "we have all these requirements and no one knows what they are doing" thing.

If this is what you are doing now, you must find out why your company wants it. Is it to survive an audit, or is it a customer requirement? That will give you an understanding (and hopefully) the authority to actually create the things you are missing. Throwing out standards, such as 27001 (data security and privacy) compliance, and NIS2 (cybersecurity risk controls), and saying you just need a DRP in order to achieve compliance is crazy.

Does your manager understand what a DRP is? Do they know how it ties into risk, compliance, policy, and document and quality management? They may have been in a previous company that had a DRP, but that doesn't mean they knew how it worked or that it was correct.

A BCP and DRP broadly fall under the scope of risk management. Risk management has 4 categories within its lifecycle. Asset/Risk identification, Risk assessment, Risk response and mitigation, and Risk control monitoring and reporting. Working with each data owner, you identify the risks to their assets and data, and then work with the organization to determine the overall risk appetite. Any risk that exceeds the appetite has to be managed using an appropriate risk response. The controls implemented to address risk are then monitored to provide assurance that the risk stays within the accepted limits. Each identified risk is tracked via a risk register. Each risk in a register has a corresponding entry in a BCP, which documents what to do when that risk occurs. A BCP is built in partnership with the data owner, so that both the owner and operations understand the RTO and RPO.

A DRP utilises the data from the BCP and attempts to address how the business responds to major events such as earthquakes and tornadoes by answering the question, "If the whole business goes up in flames, what are the critical functions that need to be online for operations to continue?"

I've terribly simplified things, but for a sysadmin to be told to "Just make a DRP" is a fool's errand. As I've previously said, you can create a document that a customer might accept, but a risk management program is what is required to pass any of the audits you mentioned.

[-]

theoriginalharbinger@reddit

"If service XYZ is not avaiable or not working correctly then contact mail@xyz.abc or phonenumber.

AKA checkbox BCDR. It exists, yes, but is completely unusable in an emergency.

I was at Hartfield-Atlanta when it got shut down due to a really long power outage (was also a vendor, but it wasn't our fault). One of the great faults revealed in the analysis was that their BCDR had a bunch of contact info... in the computers... that had no power. They thus couldn't even begin to implement BCDR.

Getting people off planes with no power? Easy, just roll out a staircase. How tall of a staircase? Errr... that data was stored by the jetway operations teams, who also had no power, and thus could not look it up. No hard copies were available.

Even if your data is managed by third parties, it's not really a DR plan unless you account for a disaster. And this isn't hard, but it does require some thought.

"Access to data is lost" - did you lose it? If you lost it, was it because of something faulty like a bad upload of new data (SFDC breaks a lot due to this)? Did the vendor lose it?

In case of the former, then the DRP should be "Recover from backups." If the vendor lost it, it should be "Find another engine that can consume this data and switch." If you lost access because somebody with a backhoe discovered your Internet pipe to your remote warehouse, then it should be "Run off a Cradlepoint or Starlink" or "Use hard copies."

Just having contact info doesn't help. The playbook should cover the likely things. Even if it is "Contact vendor" - who is empowered to make judgment calls about how data is recovered? That needs to be in the playbook too, up to and including "If Senior Person X cannot respond within Y hours, then Junior Person Z will need to reference the criteria below for this vendor to determine whether to do a recovery."

[-]

gumbrilla@reddit

Well the first bit is not about what you have, but what are the scenarios..

Identify potential threats and their potential impact on business operations

So, get the usual ones down. Fire. Plague, Meteorite, Earthquake (if relevant), Volcano (if relevant), Flood, Hurricane, a work crew with a JCB, russians going anchor fishing for cables, Locusts..

Then they basically boil down to. We've lost our office/town/region. We've lost our systems. We've lost our people, We've lost whatever.

Then what's the plan.. so - it's a major incident. You need an Exec available. You need someone running comms. Someone leading the DR, actually min 2, 12 hours a day each (I know from experience).. basically it's your DR team - a collection of roles.

You need you plan to list contacts, systems - including contacts - especially account managers and their bosses, list of pre-agreed priorities with the business (it can be adjusted mid DR, but just need something to kick off), it needs on who can declare a DR.. what else.. external communication methods (assuming office stuff is down).

I've worked in places where we all drove around with the DR plan in the boot of our cars. Had the plan and all the numbers, the phone tree.. everything updated monthly.

Now in terms of third parties.. OK let's assume the disaster happens to them.. what do you do.. well, probably just follow them on Twitter as your a small company.. and pray. I guess - what are your options for alternative.. paper based if all else fails. Maybe you can make escrow agreements with some. Or figure TO BIG TO FAIL with MS/AWS..

In reality data is the key.. I think at a minimum, you independently back up your cloud service data. We do that with Github, Conflunce, Jira, MS Team, Email.. and maybe you look at standing up local stuff if required..

[-]

Twaasar@reddit (OP)

In my DRP I wasn't covering such be disasters. It's about small outages like if the portal of some service isn't working or if something is working after an update.

Maybe write about a fallback method.

This is the current table of contents:

Introduction 2 Scope 3 Main Objectives 4 Staff 5 Application Profile 6 Inventory Profile 7 Emergency Procedures 8 Communication Plan 9 Testing the Disaster Recovery Plan 10 Overview of Changes 11

[-]

gumbrilla@reddit

Ah.. I would call this a either a Major Incident Plan.. or a P1 or an average Tuesday. If I looked at a P1 response plan that called itself a DR plan.. then I'd be super confused.

A Disaster is a large-scale, catastrophic event that overwhelms a community's ability to cope,

A major incident plan is a structured process for managing high-priority incidents that cause significant business disruption. It focuses on rapidly restoring service by quickly assembling a dedicated team to identify and resolve the issue, while also ensuring clear and consistent communication to stakeholders. The plan outlines the specific criteria for classifying a major incident, outlines the roles and responsibilities of the team, and details the communication strategy to minimize negative impact.

A P1 response plan is for a critical incident with a significant business impact, requiring an immediate, all-hands-on-deck response. The plan involves immediate acknowledgment, assigning a dedicated incident manager and specialized team, quickly assessing impact and gathering information, and working to restore service as fast as possible. Key actions include coordinating a rapid response, potentially invoking major incident procedures, and performing a post-incident review.

I'd say you are more in a P1 territory.. so identify your critical systems that qualify for P1's, secure the data, figure out a plan on what to do with that data and how much it will cost. It might be going to a competitor, hosting locally.. then have the standard bits - comms channels, bridge line, pagers (well the modern equivalent).. and a nice report at the end of it.. but you seem to have most of that..

[-]

Twaasar@reddit (OP)

You are right, thank you. I realise now that I'm making an "Incident Response Plan", I will talk about this with my manager later during my next meeting.

Do you have any tips for me?

Because the goal is to have an DRP for the NIS2 requirement.

I don't know and don't think that what I do now is correct.

[-]

smokerates@reddit

Maybe do some research yourself, before asking others to do all the legwork for you?

Maybe I'm old, but jeepers... tried nothing and all out of ideas.

[-]

Twaasar@reddit (OP)

Alright, thank you very much for the help

[-]

gumbrilla@reddit

That was someone else, not me. I don't mind. Sometimes I do, today I do not.

[-]

smokerates@reddit

I hope this helps!

I realise now that I'm making an "Incident Response Plan"

But you still keep the original post up, so people can answer to your non-existent problem. But then you change the scope, and ask for help again. In a comment.

Kid you are golden, you got a bright future ahead.

[-]

gumbrilla@reddit

Oooh.. you've got NIS2 requirement.. ooh.. OK.

At the top level is the Business Continuity Plan, now that should be Organisation wide.. so not on you, but it's an important component.. as the IT DR plan(s) are a component of that whole thing.. and you should account for the scenarios they contemplate..

But yeah.. the stuff that you are coming up with would likely be the output of a Business Impact Assessment.. and separate DR plans - related to what's deemed critical
https://cfpa-e.eu/nis2-how-to-maintain-operations-if-a-crisis-strikes/

So.. cripes, I've been looking, I was hoping it's a bit like say something like SOC2, where we happily point at a supplier also with SOC2 Type 2 and say - good enough. But NIS2 really requires you to get your hands dirty.. Can't just point at MS or AWS and say - it's all good.. by the look of it.. it's not.

So I would suggest that you do generate DR plans, for each of the major systems.. (assuming it's not a Data centre), then figure out how you are going to get the data, and then what your going to do with that data should if be required, as it looks like you might have to prove it.

You know.. I'd be amazed if that many companies are able to do this well, but yeah, I see there are audit requirements..

[-]

neckbeard404@reddit

Start with a list of phone numbers / account numbers.

[-]

Twaasar@reddit (OP)

Hello, currently have an almost complete list of the email addresses and phone numbers

[-]

Cormacolinde@reddit

You need to define your scope first. Are external services fully covered? Are they written in as external dependencies? What kind of threat are you planning for?

[-]

Twaasar@reddit (OP)

Thank you for your comment, I already have an scope (I do have to specify it more).

The current scope in short:

"It specifically focuses on the recovery of data, access to data, and critical digital systems in the event of outages, failures, or loss."

[-]

Cormacolinde@reddit

Then you should focus on data format, residency, resiliency and recoverability.

For external services you especially need to address format and recoverability. Can you export the data? In what format? Can you access it after? How?

[-]

noideabutitwillbeok@reddit

I document support contact info (emails, numbers, etc) but also document phone & fax numbers, websites, etc that the end users can reach out to if need be. My DRP documents include end user instructions that have been vetted by them. You have to plan for maybe your normal staff not being around (or able to function) and others having to step in.

[-]

NiiWiiCamo@reddit

This is the start, but what if that company no longer exists as of five minutes ago? That can lead to a technical solution, or just an organizational "run with the status quo and evaluate current needs and possibilities".

Which for cloud services might not be enough, so another valid response might be "we are totally dependent on this provider / service, so we will send everybody home if an outage lasts more than 4 hours."

[-]

Twaasar@reddit (OP)

Do you maybe have an exmaple on how I could write this in the DRP document?

We have our own servers that are hosted and managed by an external party, But we don't restore backup's, I do have the SLA for this one.

[-]

22OpDmtBRdOiM@reddit

Ensure contracts (SLA, penalties, contact info) are in place. Ask prodividing companies about their desaster recovery plans.

/u/NiiWiiCamo also has a good start. You kinda still want a backup of your data because there are things you can't prepare for. (Service provider lying about backups, like backups are in the same DC and a fire hits it)

Also, talk about the impact (data loss, downtime, financial cost) to your supervisors and get a common understanding about the current situation and threats.

[-]

Twaasar@reddit (OP)

Do you maybe have an exmaple on how I could write this in the DRP document?