Full-scale testing — completely shutting down your primary site and failing over to a recovery site — is impractical for most organizations, and it’s actually the least effective form of testing, based on a recent survey:
The challenges with simulation, parallel and full-scale testing
Simulation testing involves bringing recovery facilities and systems online to validate startup procedures and standby systems. Parallel testing takes this a step further by including restoring from backups and validating production-level functionality. Both methodologies can be executed without impacting your production environment, but still require a commitment of time, money, and resources.
Full-scale testing adds the risk of service interruption if the recovery site cannot be brought online. Unless you are running parallel production data centers, it is too risky and impractical for most organizations.
However, the biggest issue with the above methodologies is the focus on technology. Where companies usually struggle with DR is with people and processes, and those factors are inherently overlooked in technology-based testing. Processes for tasks such as assessing the impact, recalling backups, and coordinating recovery activities are not validated.
Why tabletop testing is so much more effective
Tabletop testing gets the technology out of the room — and out of your focus — so you can concentrate on people and processes, and for the entire event, not just your failover procedures. Specifically, tabletop testing is a paper-based exercise where your Emergency Response Team (ERT) maps out the tasks that should happen at each stage in a disaster, from discovery to notifying staff to the technical steps to execute the recovery.
It’s during these walkthroughs that you discover half of your ERT doesn’t know where your DR command center is located, or that critical recovery information is kept in a locked cabinet in the CIO’s office, or key staff would be required for so many separate tasks that they would need to be in 10 places at once.
Tabletop testing also makes it easier to play out a wider range of scenarios compared to technology-based testing. Walk through relatively minor events, such as an individual key server failing, or major disasters that take down your entire primary site. Similarly, play out what-if scenarios, such as what happens if key staff members are not available or disk backups have been corrupted.
With parallel testing, you can be sure that the technician restoring backups is not dealing with data corruption, and any necessary documentation is readily available (not locked in an office that you can no longer access); the focus is on “does the technology work” and not the hundred other things that can go wrong during a recovery. Tabletop testing reveals those people and process gaps that are otherwise so difficult to identify until you are actually in a DR scenario.
Focus on unit testing to validate standby systems
Unit testing was second only to tabletop testing in overall importance to DRP success. In this context, unit testing means validating standby systems as your environment changes, ideally as part of your change management procedures. The recovery site goes through the same release procedure as the primary site, including unit testing affected systems, to ensure that standby systems stay in sync with your primary systems.
Unlike simulation, parallel or full-scale testing, there is no pretense that unit testing is validating your DRP. It is validating the technology, and that’s all, so it provides a good complement to tabletop testing.
Is it important to validate standby equipment? Yes, but if that’s the focus of your DR testing, you aren’t truly validating your DRP. Use simulation or parallel testing to validate your recovery site and standby systems, and unit testing as your environment changes for ongoing validation — but make annual tabletop testing your primary methodology for practicing and verifying end-to-end DR procedures.