Full-scale testing — completely shutting down your primary site and failing over to a recovery site — is impractical for most organizations, and it’s actually the least effective form of testing, based on a recent survey:

The challenges with simulation, parallel and full-scale testing

Simulation testing involves bringing recovery facilities and systems online to validate startup procedures and standby systems. Parallel testing takes this a step further by including restoring from backups and validating production-level functionality. Both methodologies can be executed without impacting your production environment, but still require a commitment of time, money, and resources.

Full-scale testing adds the risk of service interruption if the recovery site cannot be brought online. Unless you are running parallel production data centers, it is too risky and impractical for most organizations.

However, the biggest issue with the above methodologies is the focus on technology. Where companies usually struggle with DR is with people and processes, and those factors are inherently overlooked in technology-based testing. Processes for tasks such as assessing the impact, recalling backups, and coordinating recovery activities are not validated.

Why tabletop testing is so much more effective

Tabletop testing gets the technology out of the room — and out of your focus — so you can concentrate on people and processes, and for the entire event, not just your failover procedures. Specifically, tabletop testing is a paper-based exercise where your Emergency Response Team (ERT) maps out the tasks that should happen at each stage in a disaster, from discovery to notifying staff to the technical steps to execute the recovery.

It’s during these walkthroughs that you discover half of your ERT doesn’t know where your DR command center is located, or that critical recovery information is kept in a locked cabinet in the CIO’s office, or key staff would be required for so many separate tasks that they would need to be in 10 places at once.

Tabletop testing also makes it easier to play out a wider range of scenarios compared to technology-based testing. Walk through relatively minor events, such as an individual key server failing, or major disasters that take down your entire primary site.  Similarly, play out what-if scenarios, such as what happens if key staff members are not available or disk backups have been corrupted.

With parallel testing, you can be sure that the technician restoring backups is not dealing with data corruption, and any necessary documentation is readily available (not locked in an office that you can no longer access); the focus is on “does the technology work” and not the hundred other things that can go wrong during a recovery. Tabletop testing reveals those people and process gaps that are otherwise so difficult to identify until you are actually in a DR scenario.

Focus on unit testing to validate standby systems

Unit testing was second only to tabletop testing in overall importance to DRP success. In this context, unit testing means validating standby systems as your environment changes, ideally as part of your change management procedures. The recovery site goes through the same release procedure as the primary site, including unit testing affected systems, to ensure that standby systems stay in sync with your primary systems.

Unlike simulation, parallel or full-scale testing, there is no pretense that unit testing is validating your DRP. It is validating the technology, and that’s all, so it provides a good complement to tabletop testing.

Conclusion

Is it important to validate standby equipment? Yes, but if that’s the focus of your DR testing, you aren’t truly validating your DRP. Use simulation or parallel testing to validate your recovery site and standby systems, and unit testing as your environment changes for ongoing validation — but make annual tabletop testing your primary methodology for practicing and verifying end-to-end DR procedures.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Business Continuity Planning (BCP) “by the book” means starting with a Risk Assessment to identify the types of incidents and risks you need to mitigate. Makes sense, right? How do you guard against something you haven’t identified?

There are two problems with that approach:

  1. Unless you are a fortune teller, odds are you won’t think of every incident that might occur. If you think of 20 risks, it will be the 21st that gets you.
  2. If you take risk assessment to an extreme level to try to guard against that unforeseen 21st risk, you can very quickly get into unrealistic and cartoonish scenarios – meteors, swarms of locusts, and maybe even an alien invasion.

A much more efficient and practical approach is to focus on what your organization requires to be resilient and recover from service interruptions, regardless of the specific type of incident. Continuity requirements can be boiled down to the following:

  • Alternative locations (DR hot-site, command center, alternate office location for business workers, etc.).
  • Redundancy in both technology and people. On the people side, this can be accomplished through cross-training, mentoring, and so on. It doesn’t mean having two people doing the same job.
  • Documented and accessible knowledge base, including standard operating procedures (SOPs).

To develop and document a BCP, you will need more detail of course to spell out the who’s and how’s. To help you identify those details, define categories of service interruptions, rather than specific incidents, and use that as a basis for documenting recovery procedures. Service interruptions can be grouped into the following categories:

  • Your building is not accessible. Could be due to a swarm of locusts, a chemical spill, or a fire in the building next door. Doesn’t matter what is causing the incident. The net effect is that staff can’t get into the building.
  • Your building is gone or severely damaged (e.g., from a natural disaster, fire, roof collapse, or even a meteor).
  • Hardware or software failure.
  • Power outage.
  • Network failure.

Let’s examine the “Building is not accessible” scenario in more detail. In this scenario, your equipment is operational. Your recovery procedure is really about people and the ability to remotely access your infrastructure. For example, customer service staff might require an alternate office facility while knowledge-based workers might be able to work from home. Whether the incident is a chemical spill or a swarm of locusts really doesn’t matter.

The type of risk assessment that can be useful is exploring the risk of equipment failure and the impact of that failure, and then planning technology enhancements accordingly. For example, in an online catalog application, components such as the Message Queuing servers would be critical due to the risk of data loss. That would be a prime candidate for adding redundancy. However, the goal here is to improve availability and resiliency (again, regardless of the cause of failure).

Now if your data center is next door to a nuclear reactor, you don’t need a risk assessment to understand that having an alternate facility in a geographically distant location should be high on your list of priorities. And if your building does get hit by a meteor, you’ll be covered for that too. However, if there’s an alien invasion, all bets are off.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

If movies and TV are to be believed, the Zombie Apocalypse is inevitable. What will separate the survivors from the victims is how you handle the unrelenting hordes.

Check out this short video for our zombie-kicking advice on how to keep your business running during these dark times:

So how can you keep your business (and your brain) safe?

  1. Knowledge is power. Identify an infected co-worker, know how to protect yourself, stay safe.
  2. Create a strong group dynamic. Identify key roles and fill them with the right people.
  3. Understand the different types of assailants, knowing your enemy is key to securing your safety.

It’s all in the video.  Save yourself… watch it now!

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter