Full-scale testing — completely shutting down your primary site and failing over to a recovery site — is impractical for most organizations, and it’s actually the least effective form of testing, based on a recent survey:

The challenges with simulation, parallel and full-scale testing

Simulation testing involves bringing recovery facilities and systems online to validate startup procedures and standby systems. Parallel testing takes this a step further by including restoring from backups and validating production-level functionality. Both methodologies can be executed without impacting your production environment, but still require a commitment of time, money, and resources.

Full-scale testing adds the risk of service interruption if the recovery site cannot be brought online. Unless you are running parallel production data centers, it is too risky and impractical for most organizations.

However, the biggest issue with the above methodologies is the focus on technology. Where companies usually struggle with DR is with people and processes, and those factors are inherently overlooked in technology-based testing. Processes for tasks such as assessing the impact, recalling backups, and coordinating recovery activities are not validated.

Why tabletop testing is so much more effective

Tabletop testing gets the technology out of the room — and out of your focus — so you can concentrate on people and processes, and for the entire event, not just your failover procedures. Specifically, tabletop testing is a paper-based exercise where your Emergency Response Team (ERT) maps out the tasks that should happen at each stage in a disaster, from discovery to notifying staff to the technical steps to execute the recovery.

It’s during these walkthroughs that you discover half of your ERT doesn’t know where your DR command center is located, or that critical recovery information is kept in a locked cabinet in the CIO’s office, or key staff would be required for so many separate tasks that they would need to be in 10 places at once.

Tabletop testing also makes it easier to play out a wider range of scenarios compared to technology-based testing. Walk through relatively minor events, such as an individual key server failing, or major disasters that take down your entire primary site.  Similarly, play out what-if scenarios, such as what happens if key staff members are not available or disk backups have been corrupted.

With parallel testing, you can be sure that the technician restoring backups is not dealing with data corruption, and any necessary documentation is readily available (not locked in an office that you can no longer access); the focus is on “does the technology work” and not the hundred other things that can go wrong during a recovery. Tabletop testing reveals those people and process gaps that are otherwise so difficult to identify until you are actually in a DR scenario.

Focus on unit testing to validate standby systems

Unit testing was second only to tabletop testing in overall importance to DRP success. In this context, unit testing means validating standby systems as your environment changes, ideally as part of your change management procedures. The recovery site goes through the same release procedure as the primary site, including unit testing affected systems, to ensure that standby systems stay in sync with your primary systems.

Unlike simulation, parallel or full-scale testing, there is no pretense that unit testing is validating your DRP. It is validating the technology, and that’s all, so it provides a good complement to tabletop testing.

Conclusion

Is it important to validate standby equipment? Yes, but if that’s the focus of your DR testing, you aren’t truly validating your DRP. Use simulation or parallel testing to validate your recovery site and standby systems, and unit testing as your environment changes for ongoing validation — but make annual tabletop testing your primary methodology for practicing and verifying end-to-end DR procedures.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

By Frank Trovato, a Research Analyst specializing in mainframe technology and mission critical systems for Info-Tech Research Group

Traditional challenges include lack of testing and inadequate technology or facilities. However, a much more common issue is simply not recognizing when to invoke disaster recovery (DR) procedures for less obvious disasters. For example, an Info-Tech survey found that local software and hardware issues were the most common cause of unacceptable downtime – not power outages, network failures, or natural disasters.

Think about what this means. With server virtualization and backup technology that continues to become more and more sophisticated, why are so many organizations unable to recover within an acceptable timeframe from relatively non-destructive incidents?

In the above survey, “unacceptable downtime” was defined as downtime that extends beyond Recovery Time Objectives (RTOs) set by the business.

Below are a few of the reasons why relatively minor incidents lead to unacceptable downtime:

1)    Invoking DR procedures is rarely a zero-cost procedure. For example, if you have a DR hot site managed by a vendor, there is often a surcharge to execute the failover. At the very least, regardless of your DR procedures, there is a productivity cost (e.g. IT staff stopping their normal work to execute the recovery, and potentially interrupting other services as part of the recovery procedure). As a result, management may wait too long to pull the trigger and declare a disaster.

2)    IT departments often have a hero culture, and this can also lead to overconfidence in the ability to resolve an issue without having to invoke DR procedures.

3)    Service management and DR managed as distinct separate processes. There is not a clear escalation path from service management to DR.

Organizations that are most successful in overcoming these challenges treat their DRP as an extension of their service management procedures. They have strict timelines and criteria for when to move from service management to disaster recovery, and incorporate this into their escalation rules.

Consider this scenario:

1)    Performance begins to degrade on the back-end transaction server supporting an online ordering system. At this point, end users are not experience much delay, but there is a transaction backlog building up. It is a critical system, so it is assigned a high severity rating and appropriate staff are assigned to investigate and resolve the issue.

2)    The business has defined a Recovery Time Objective (RTO) of two hours based on a business impact analysis. Executing the recovery procedures (e.g. to bring the standby system online) takes one hour to execute. This leaves IT with one hour to troubleshoot before the RTO is compromised. Although the system is not down yet, the issue is severe enough that it should be viewed as if it were.

3)    Performance continues to degrade, but at the one hour mark the developers working on the problem believe they have identified the problem and just need 20 more minutes to fix the problem.

How many times have you seen the developer get that extra time? And how often has the manager come back after 20 minutes to find out the issue is still not resolved and the developer now needs “just five more minutes”? The whole time, performance continues to degrade and the online ordering system is essentially stalled – no orders coming in, customers are experiencing severe delays, and so on.

The above example is also why companies go through a business impact analysis, even if it’s an informal analysis, to determine recovery time objectives and criteria for declaring a disaster so that the company is not left hanging while an IT hero tries to resolve an outage. Integrating DR thinking into service management procedures enable IT to keep the bigger picture of service continuity and business impact in mind, and minimize the chances of failing to meet the availability/downtime guidelines set by the business.

In summary:

  • Extend your severity definitions to identify potential disaster scenarios.
  • Define escalation rules that account for the time required to prepare for and execute a DR.
  • Don’t give in to the IT hero mentality. When your troubleshooting time is up, failover to your standby system so that business operations can continue. Then work on resolving the root cause of the incident.

For more information, see Info-Tech’s solution set Bridge the Gap between Service Management and Disaster Recovery.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

By Frank Trovato, a Research Analyst specializing in mainframe technology and mission critical systems for Info-Tech Research Group

Downtime is more likely to be caused by human
error or process issues, yet organizations often focus primarily on technology redundancy. An Info-Tech survey found that adding more layers of redundancy (e.g. going to N+2) does not have close to the same impact as addressing people and process issues. Organizations thinking about investing tens or hundreds of thousands of dollars into increasing redundancy should first take a look at their people and processes.

For example, the same survey found that having secondary resources in place for mission critical systems was a strong indicator of success in meeting availability objectives. Secondary resources does not mean paying two people to do the same job, but rather sharing knowledge through a mentorship program or cross-training so the organization is not overly dependent on specific individuals. You need backup people such as much as you need redundant servers.

As far as processes are concerned, don’t assume staff are already following good processes, or that normal processes for production systems are rigorous enough for mission critical systems. There is a higher level of investment and risk with mission critical systems that demand a higher level of attention. For example, a U.S. bank recently discovered their development team was not consistently using source control for mission critical code. Processes must be documented and managed.

On the technology side, when end-to-end redundancy is not possible due to budget limitations, prioritize investments based on risk and impact analysis. That means doing your homework in terms of clearly identifying which systems are mission critical, what are their dependencies (and therefore also mission critical), what is the impact to the business, and where are the single points of failure.

In the meantime, while technology investments may need to be delayed, there is no reason to delay addressing the equally (if not more) important people and process aspects of high availability. Simply purchasing and installing more-advanced hardware and software will not deliver 4 or 5 x 9 availability. For more on aligning people, processes, and technology to deliver high availability, see Info-Tech’s solution set, Maximize Availability for Mission Critical Systems.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Keep your DRP from becoming expensive shelfware.

  • The enterprise has a DR plan, but has invested nothing in ensuring that it is actionable.
  • Without trained resources, and the validation that only testing provides, the plan is destined for failure.
  • This solution set will help clients ensure that, when required, the enterprise can put its DR plan into action.

Our Advice

Critical Insight

  • DR operations are not just an IT responsibility; DR teams must be staffed from across the enterprise and role redundancy is essential.
  • 40% of enterprises with DR plans never test them. Without testing, staff is never trained and problems are not discovered.

Impact and Result

  • Upon completion of the work outlined in this Solution Set, you will have ensured that the enterprise is fully prepared to respond to a disaster.

Get to Action

1. Get a crash course on staffing and executing DR plans
Be able to educate the rest of IT and the business on what’s involved in operating the DR capabilities.

Storyboard: Make Disaster Recovery Actionable
Bring the DRP to Life: Make the Plan Actionable

2. Build a DR Team
Identify the skills, roles and responsibilities required to operate the enterprise DR capability.

DR Team Build Sheet

3. Test DR capabilities
Make sure that the organization understands the plan and can put it into action should a disaster occur.

DRP Test Worksheet
DRP Test Schedule Worksheet

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter