Why DRPs Fail – The Answer May Surprise YouJune 20, 2012
By Frank Trovato, a Research Analyst specializing in mainframe technology and mission critical systems for Info-Tech Research Group
Traditional challenges include lack of testing and inadequate technology or facilities. However, a much more common issue is simply not recognizing when to invoke disaster recovery (DR) procedures for less obvious disasters. For example, an Info-Tech survey found that local software and hardware issues were the most common cause of unacceptable downtime – not power outages, network failures, or natural disasters.
Think about what this means. With server virtualization and backup technology that continues to become more and more sophisticated, why are so many organizations unable to recover within an acceptable timeframe from relatively non-destructive incidents?
Below are a few of the reasons why relatively minor incidents lead to unacceptable downtime:
1) Invoking DR procedures is rarely a zero-cost procedure. For example, if you have a DR hot site managed by a vendor, there is often a surcharge to execute the failover. At the very least, regardless of your DR procedures, there is a productivity cost (e.g. IT staff stopping their normal work to execute the recovery, and potentially interrupting other services as part of the recovery procedure). As a result, management may wait too long to pull the trigger and declare a disaster.
2) IT departments often have a hero culture, and this can also lead to overconfidence in the ability to resolve an issue without having to invoke DR procedures.
3) Service management and DR managed as distinct separate processes. There is not a clear escalation path from service management to DR.
Organizations that are most successful in overcoming these challenges treat their DRP as an extension of their service management procedures. They have strict timelines and criteria for when to move from service management to disaster recovery, and incorporate this into their escalation rules.
Consider this scenario:
1) Performance begins to degrade on the back-end transaction server supporting an online ordering system. At this point, end users are not experience much delay, but there is a transaction backlog building up. It is a critical system, so it is assigned a high severity rating and appropriate staff are assigned to investigate and resolve the issue.
2) The business has defined a Recovery Time Objective (RTO) of two hours based on a business impact analysis. Executing the recovery procedures (e.g. to bring the standby system online) takes one hour to execute. This leaves IT with one hour to troubleshoot before the RTO is compromised. Although the system is not down yet, the issue is severe enough that it should be viewed as if it were.
3) Performance continues to degrade, but at the one hour mark the developers working on the problem believe they have identified the problem and just need 20 more minutes to fix the problem.
How many times have you seen the developer get that extra time? And how often has the manager come back after 20 minutes to find out the issue is still not resolved and the developer now needs “just five more minutes”? The whole time, performance continues to degrade and the online ordering system is essentially stalled – no orders coming in, customers are experiencing severe delays, and so on.
The above example is also why companies go through a business impact analysis, even if it’s an informal analysis, to determine recovery time objectives and criteria for declaring a disaster so that the company is not left hanging while an IT hero tries to resolve an outage. Integrating DR thinking into service management procedures enable IT to keep the bigger picture of service continuity and business impact in mind, and minimize the chances of failing to meet the availability/downtime guidelines set by the business.
- Extend your severity definitions to identify potential disaster scenarios.
- Define escalation rules that account for the time required to prepare for and execute a DR.
- Don’t give in to the IT hero mentality. When your troubleshooting time is up, failover to your standby system so that business operations can continue. Then work on resolving the root cause of the incident.
For more information, see Info-Tech’s solution set Bridge the Gap between Service Management and Disaster Recovery.This entry was posted in Advisory, Infrastructure, News & Analysis, What's New in Research and tagged backup, disaster-recovery, dr, drp, Incident Management, service-management, Severity, Severity Definitions, Severity Levels. Bookmark the permalink.