Full-scale testing — completely shutting down your primary site and failing over to a recovery site — is impractical for most organizations, and it’s actually the least effective form of testing, based on a recent survey:

The challenges with simulation, parallel and full-scale testing

Simulation testing involves bringing recovery facilities and systems online to validate startup procedures and standby systems. Parallel testing takes this a step further by including restoring from backups and validating production-level functionality. Both methodologies can be executed without impacting your production environment, but still require a commitment of time, money, and resources.

Full-scale testing adds the risk of service interruption if the recovery site cannot be brought online. Unless you are running parallel production data centers, it is too risky and impractical for most organizations.

However, the biggest issue with the above methodologies is the focus on technology. Where companies usually struggle with DR is with people and processes, and those factors are inherently overlooked in technology-based testing. Processes for tasks such as assessing the impact, recalling backups, and coordinating recovery activities are not validated.

Why tabletop testing is so much more effective

Tabletop testing gets the technology out of the room — and out of your focus — so you can concentrate on people and processes, and for the entire event, not just your failover procedures. Specifically, tabletop testing is a paper-based exercise where your Emergency Response Team (ERT) maps out the tasks that should happen at each stage in a disaster, from discovery to notifying staff to the technical steps to execute the recovery.

It’s during these walkthroughs that you discover half of your ERT doesn’t know where your DR command center is located, or that critical recovery information is kept in a locked cabinet in the CIO’s office, or key staff would be required for so many separate tasks that they would need to be in 10 places at once.

Tabletop testing also makes it easier to play out a wider range of scenarios compared to technology-based testing. Walk through relatively minor events, such as an individual key server failing, or major disasters that take down your entire primary site.  Similarly, play out what-if scenarios, such as what happens if key staff members are not available or disk backups have been corrupted.

With parallel testing, you can be sure that the technician restoring backups is not dealing with data corruption, and any necessary documentation is readily available (not locked in an office that you can no longer access); the focus is on “does the technology work” and not the hundred other things that can go wrong during a recovery. Tabletop testing reveals those people and process gaps that are otherwise so difficult to identify until you are actually in a DR scenario.

Focus on unit testing to validate standby systems

Unit testing was second only to tabletop testing in overall importance to DRP success. In this context, unit testing means validating standby systems as your environment changes, ideally as part of your change management procedures. The recovery site goes through the same release procedure as the primary site, including unit testing affected systems, to ensure that standby systems stay in sync with your primary systems.

Unlike simulation, parallel or full-scale testing, there is no pretense that unit testing is validating your DRP. It is validating the technology, and that’s all, so it provides a good complement to tabletop testing.

Conclusion

Is it important to validate standby equipment? Yes, but if that’s the focus of your DR testing, you aren’t truly validating your DRP. Use simulation or parallel testing to validate your recovery site and standby systems, and unit testing as your environment changes for ongoing validation — but make annual tabletop testing your primary methodology for practicing and verifying end-to-end DR procedures.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Business Continuity Planning (BCP) “by the book” means starting with a Risk Assessment to identify the types of incidents and risks you need to mitigate. Makes sense, right? How do you guard against something you haven’t identified?

There are two problems with that approach:

  1. Unless you are a fortune teller, odds are you won’t think of every incident that might occur. If you think of 20 risks, it will be the 21st that gets you.
  2. If you take risk assessment to an extreme level to try to guard against that unforeseen 21st risk, you can very quickly get into unrealistic and cartoonish scenarios – meteors, swarms of locusts, and maybe even an alien invasion.

A much more efficient and practical approach is to focus on what your organization requires to be resilient and recover from service interruptions, regardless of the specific type of incident. Continuity requirements can be boiled down to the following:

  • Alternative locations (DR hot-site, command center, alternate office location for business workers, etc.).
  • Redundancy in both technology and people. On the people side, this can be accomplished through cross-training, mentoring, and so on. It doesn’t mean having two people doing the same job.
  • Documented and accessible knowledge base, including standard operating procedures (SOPs).

To develop and document a BCP, you will need more detail of course to spell out the who’s and how’s. To help you identify those details, define categories of service interruptions, rather than specific incidents, and use that as a basis for documenting recovery procedures. Service interruptions can be grouped into the following categories:

  • Your building is not accessible. Could be due to a swarm of locusts, a chemical spill, or a fire in the building next door. Doesn’t matter what is causing the incident. The net effect is that staff can’t get into the building.
  • Your building is gone or severely damaged (e.g., from a natural disaster, fire, roof collapse, or even a meteor).
  • Hardware or software failure.
  • Power outage.
  • Network failure.

Let’s examine the “Building is not accessible” scenario in more detail. In this scenario, your equipment is operational. Your recovery procedure is really about people and the ability to remotely access your infrastructure. For example, customer service staff might require an alternate office facility while knowledge-based workers might be able to work from home. Whether the incident is a chemical spill or a swarm of locusts really doesn’t matter.

The type of risk assessment that can be useful is exploring the risk of equipment failure and the impact of that failure, and then planning technology enhancements accordingly. For example, in an online catalog application, components such as the Message Queuing servers would be critical due to the risk of data loss. That would be a prime candidate for adding redundancy. However, the goal here is to improve availability and resiliency (again, regardless of the cause of failure).

Now if your data center is next door to a nuclear reactor, you don’t need a risk assessment to understand that having an alternate facility in a geographically distant location should be high on your list of priorities. And if your building does get hit by a meteor, you’ll be covered for that too. However, if there’s an alien invasion, all bets are off.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Disaster recovery planning (DRP)Here comes the disaster is often seen as the IT equivalent of a high-cost insurance policy that may never be redeemed, so making the business case for a DRP is difficult at the best of times. For IT departments moving ahead with DRP projects, knowing when to defer business continuity planning (BCP) will keep the DR work on budget and on track.

DRP vs. BCP: Who Owns What?

Info-Tech fields many queries from clients asking who is responsible for continuity and recovery. Strictly speaking, best practices dictate that IT is accountable for DRP, while the business at large assumes responsibility for BCP. Confusion over ownership arises when neither IT nor the business understand what DRP and BCP actually are, nor where the differences lie between the two.

At the heart of any recovery or continuity initiative is the ability of the enterprise to identify, quantify, and mitigate risk. A fundamental awareness of risk, combined with a razor-sharp knowledge of how risk impacts critical business processes, are key to developing a comprehensive disaster recovery or business continuity plan. When risk and business impact are misinterpreted or miscommunicated, many problems arise:

  • Lack of unified incident response across the organization.
  • Failure to achieve consensus on standardized recovery processes.
  • Incomplete or non-existent risk assessments, assumptions, and objectives.
  • Insufficient communication plans to coordinate recovery/continuity efforts.
  • Inability to recover data and applications.

Our Recommendations

  1. Make the DRP/BCP distinction crystal clear to executives. Explain that the business must assume the business roles. Otherwise, IT budget and efforts will become overburdened with BCP tasks that may exceed IT’s expertise and/or authority to execute. This can expose both the DRP and the BCP to risk of failure caused by:
    1. Deferring a strategic decision to a tactical level within the company.
    2. A deep-seated misunderstanding of the differences between business/IT risk.
    3. Reliance on technology alone when thinking about business continuity.
    4. Lack of coherent assumptions between the business and IT.
    5. Diverting resources from one project to feed the other.
  2. Convince the business to get involved. Even without BCP work, DRP can be a daunting task. IT will have its hands full, so IT leaders must demonstrate to the business why DRP is standalone, and also why IT should have little to no role in BCP. Make the business aware of the costs of technology downtime, thereby establishing the case that IT should focus its efforts on DRP alone.
    1. $50,000/hour is lost in productivity (for organizations with ~1,000 workers) resulting from business and transactional disruption due to disaster.
    2. $600,000/hour is lost in revenue for large enterprise applications such as ERP, SCM, CRM, and so on.
    3. 40% of enterprises do not have a working DRP in place, while only 22% of these companies have conducted full tests.
    4. Most enterprises fail to meet stated recovery time objectives (RTO) and recovery point objectives (RPO)
  3. If working on BCP, know the key success factors to eliminate rework. When executive ownership of BCP is lacking, IT is forced to assume most, if not all, of the work associated with it. This will involve developing greater knowledge of risk as it relates to corporate objectives, incident management, crisis management, and incident communications between IT, Finance, HR, and other business units. Also, take a combined approach towards the BCP and DRP projects to eliminate rework.

IT is on the hook for disaster recovery plans (DRP), yet IT professionals are often forced to bring their DRP skills to bear on business continuity planning (BCP). Even though IT can act as a facilitator for BCP, you can (and should) be able to prove to leadership why and how BCP is a separate issue from tech-oriented DRP.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter