How to start planning for disaster recovery

January 4, 2023 8 min read

Table of Contents

The value of recovery planning
Looking back
The worst-case scenario
Planning for recovery
Invest today for the future

There is a famous quote I often think about at 3am on a Sunday morning as I am working with a client to recover from a large-scale cybersecurity incident: “Fail to prepare, prepare to fail.” It is painfully obvious which clients have done some disaster recovery pre-planning and which haven’t.

In the majority of incident response work, recovery costs as much as identification, containment, and eradication put together. However, the cost in terms of people, stress, and reputational damage is unfathomable.

I have witnessed people walking out of jobs in anger, heard language thrown across a room that would make a sailor blush, and once saw the CEO of a law firm punch their own CFO – all because of the stress of trying to get a business back up and running after a major cybersecurity breach.

A lot of companies see the cloud as a way out of this. Let Azure, AWS, Google, or Rackspace deal with it! To some extent this can work as cloud providers come with backup and recovery benefits, but there is a risk to having all your eggs in one basket. Companies – even the very smallest – need to think of the bigger picture and plan for recovery after a range of different scenarios. From how to recover a single file deleted by careless users all the way to a full network rebuild. Most importantly, any recovery plan can’t just be about the “technical stuff”: HR, PR and Legal, as well as the IT team, all have a role to play.

The value of recovery planning

In the cybersecurity world this is known as disaster recovery planning, crisis management, or backup and recovery policy. But regardless of the name, it all boils down to pre-incident planning that creates a tested and robust process for the recovery of an IT network and, ultimately, a return to business-as-normal. While my world is centered around post-breach recovery, companies need to factor in a non-hacker related response. What happens if your data center has a power cut, what happens if geo-political circumstances change, and what happens if your cloud provider crushes all your eggs?

At the risk of sounding like a 90-year-old first discovering the internet, everything is connected these days. This means your disaster recovery plan won’t just save your business (or job), but will ensure that the hundreds, thousands, even millions of customers who rely on your business every day can sleep soundly.

An article in the Financial Times recently stated that “Bank of England research in 2020 found that more than 65 per cent of UK-based banks and insurers relied on just four cloud services.” If just one of those providers went offline would every UK-based bank have their own backup and recovery plan? The EU doesn’t think so and is planning on introducing new regulations that would force banks to factor this into their recovery plans and prove how quickly they could recover from a cyberattack.

Looking back

If we go back to pre-WannaCry, “backup and recovery,” as it was then widely known, was straightforward. Companies only had to worry about physical system crashes, so if you had 100 servers delivering your business, you just backed them all up (ideally off-site) and then when something went wrong you brought a matching server and recovered. Simple.

The risk of everything failing at once due to physical damage was very slim. Yes, sometimes data centers went bust but that’s why you had dual setups in different data centers run by totally sperate companies. Some large organizations had more advanced approaches, with three copies of digital data spread across multiple locations. However, that was not the norm for a lot of SMEs.

Cybersecurity wasn’t really a massive factor in backup and recovery. Some destructive worm-like viruses had been let loose in the past: SQL Slammer in 2003, Sasser in 2004, and Mydoom (sort of a worm) in the same year. Still, most viruses on the internet were designed to cause as much trouble as possible but not to bring down whole companies. Ransomware was also still in its infancy and was seen as a small risk because it only ever affected one or two Windows machines in an organization and was rare in the business world.

Cyber gangs with the infrastructure to encrypt thousands of systems at once didn’t exist so no one worried about it. As cloud services started to grow in popularity the future started to look rosy. I had clients who used to talk about completely migrating to a cloud world, outsourcing all but the upkeep of end user laptops. It was a blissfully innocent time.

Then in 2017 everything changed as WannaCry ripped through organizations across the world and did a lot of damage. Companies had entire IT estates wiped out, hospitals were running ancient versions of Windows and struggled to keep the lights on, and IT teams around the world lost months and billions of dollars to recovery. It was an eye-opening experience for many companies, and CEOs around the world started to ask the same question: “How quickly can we recover from a similar attack?”

It’s not an exaggeration to say that the IT security industry and the dark world of cybercrime were irreversibly changed after WannaCry. Cybercriminals found out that they could infiltrate a company and spread ransomware to hundreds or even thousands of devices within hours. With this came huge rewards running into the millions, in a perfect storm this was facilitated by the new utopian world of cryptocurrency.

This situation prompted some large companies started thinking about recovery and pre-planning. Since 2017, I have been a part of hundreds of engagements where networks needed to be totally rebuilt or totally recovered from backups, and in some horrific cases, recovered from tape backups. Prior to WannaCry, I can’t think of any recovery process that was that impactful. It’s always the same thing that causes issues: time, or rather a lack of time.

The worst-case scenario

It’s impossible to work out how long it will take you to rebuild a network without a lot of pre-planning. I know because I’ve tried to do it. In the last 12 months, I’ve worked with five clients who had no recovery plan and relied completely on tape backups. In one instance just getting the hundreds of terabytes of data off tapes took over a month, then the recovery work started. In another instance the client had no extra storage space to recover tape backups too, so we had to go to a well-known server manufacturer and panic-buy a lot of equipment, the cost of which ran over £250,000. And all of this happened before even one system had been rebuilt.

This would be stressful at the best of times but throw in a board of directors screaming at you for updates, and the pressure can get bad enough to affect people’s health. In every instance, the second systems went down so did customer delivery. Angry clients took to social media. None of my most recent examples had any major societal impact, but apply the same situation to a bank, local authority, or look at the recent Irish health service ransomware attack and the situation and lack of recovery planning has real-life ramifications.

Planning for recovery

So, what should companies be doing to try and avoid issues? If the current proposed EU laws are anything to go by, then financial services companies may not have a choice but to act. In my mind that can only be a positive thing. Eventually cloud computing will be the norm, and it will bring with it a whole new set of issues. Businesses are always slow to react to new changes in technology so a little regulatory nudge in the right direction is a good thing. However, businesses in other sectors shouldn’t wait and see if they have too. It is an unarguable fact that pre-planning for disaster recovery, whether it be for cyber-attacks or otherwise, can save you a lot of money.

There are a million questions that you will need to ask yourself when planning for disaster recovery as this isn’t a quick and easy task, and I suspect that is why not many people do it. A good starting point is an assessment of how long it would take you to completely re-build your network. Simply imagine the worse situation, everything is down and all you have is your backups. What do you do, what order do you do it in, what are your priorities, and how long will it take? In most situations it will be very costly to run this in real time, but it’s totally fine to work it all out theoretically.

Look at the technology that you have to power your recovery, what backup solution do you have. The security of your backups is very important, but that is a whole other thing. Assuming that they are safe and not destroyed by attackers, how long will it take to start the restoration process? If you are cloud-based and running a mixture of virtual systems and cloud apps, how do the two interact? Now you need to talk to your other colleagues: what will legal, HR, PR, even building managers be doing while the IT team and knee-deep in tape backups? This will be the start of your company’s disaster recovery plan.

Once you have a detailed plan and an estimate of how long it will take, simply write it all down in an easy-to-understand plan. Roles and responsibilities should be clear and accepted, communication paths should be defined (who, when, and how), and everyone involved should read and accept the plan before it is put into use. Now everyone can discuss if the time frame for complete disaster recovery is acceptable. If it’s too long then you can start looking at why that is: people, process, or technology. Eventually you will hit on the winning formula of risk vs. cost.

Now that you have the “break glass in case of emergency” solution, start working backwards. What do you do if only one data center goes down, what do you do if one cloud provider goes down, what if DNS, DHCP, AD…and finally, what if Sara in accounts accidentally deletes a spread sheet? Don’t stress too much about this because you can’t think of every scenario. However, covering the most common scenarios and looking at your attack footprint and risk profile can be a big help.

Invest today for the future

This level of pre-planning shouldn’t be the reserve of large banks and institutions, as even the smallest of business can be heavily affected by a range of outages. There is even an argument to be had that the smaller the business the less likely it is to recover from the financial and reputational damage that long term downtime can have. The cost doesn’t have to be prohibitive either. Yes, getting in a consultancy firm to do it all for you can be great, but it also adds tens of thousands of pounds onto the cost. At the end of the day, who knows your network better than you?

Even with pre-planning things will still go wrong and stress levels will creep up – it’s unavoidable. However, trust me when I say that with no pre-planning it will be a lot worse, and no one needs to see their boss in a fist fight.

Source link