A difficulty in analyzing the mid-market or the mid-range in any area of information technology is that the mid-range often doesn’t work as one uniform category. This year, for example, we decided to subdivide the mid-range storage landscape into two. Now we’re doing it again for this year’s backup software Vendor Landscapes (VL).

We’re calling the two VLs homogeneous and heterogeneous backup.

  1. Homogeneous focuses on vendors that provide backup primarily for Windows and Linux systems. It is homogeneous in that these are all industry standard x86 systems. Typically, customers are at the small to mid-range of the SMEs. Champions: [withheld].
  2. Heterogeneous backup focuses on products typically in the mid-sized to enterprise space that support a range of architectures. While x86 remains a critical component here we are also looking to support for proprietary UNIX systems up to mainframes. Champions: [withheld].

As with our unified storage array landscapes, we find that solutions in the small to mid-range come from multiple antecedents and tend to overlap in terms of market coverage. In storage, for example, you have traditional enterprise solutions, typically based on Fibre Channel networking, that have come down market to the mid-range. Then there are the iSCSI and NAS players that started in smaller-end and grew up-market.

Similarly, in backup, there are products that began in larger heterogeneous enterprises as far back as the 1980s and then there are more recent entrants that catered to the smaller, primarily Windows-based, end of the market. When x86 servers became a data center staple the former big iron titles expanded their reach. The former Windows backup titles expanded their capacity. Now they’re all playing in the mid-range and there is considerable overlap for potential mid-range customers.

fig1-BU-software

Treating the mid-range as one market can be problematic for product differentiation, particularly for vendors that have multiple product offerings. In storage, if Dell is a leader, is it for Dell EqualLogic, is it for Dell Compellent, or is it for both? In backup, is Symantec being evaluated for Backup Exec or for NetBackup?

On the other hand, there are vendors that have one product whose sweet spot is precisely in the middle of the mid-range, right in that overlap zone of small-to-mid and mid-to-large. CommVault is such a vendor and product in the mid-range backup space. Its lineage is in Windows backup, but it has grown up to take on the enterprise titles at the larger end.

fig2-BU-software

We hope having two backup VLs rather than one will improve the clarity of our industry view. If that isn’t enough, we’ve recently published a third VL on virtual infrastructure backup. Effective backup of virtual machines is becoming critical as more server infrastructure is virtualized. In addition to the big system/little system predecessors to modern backup, there is also a group of players that come from a pure-play virtual backup realm (Veeam, Vizioncore, PhdVirtual).

For more information, please see:

 

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

 

Full-scale testing — completely shutting down your primary site and failing over to a recovery site — is impractical for most organizations, and it’s actually the least effective form of testing, based on a recent survey:

The challenges with simulation, parallel and full-scale testing

Simulation testing involves bringing recovery facilities and systems online to validate startup procedures and standby systems. Parallel testing takes this a step further by including restoring from backups and validating production-level functionality. Both methodologies can be executed without impacting your production environment, but still require a commitment of time, money, and resources.

Full-scale testing adds the risk of service interruption if the recovery site cannot be brought online. Unless you are running parallel production data centers, it is too risky and impractical for most organizations.

However, the biggest issue with the above methodologies is the focus on technology. Where companies usually struggle with DR is with people and processes, and those factors are inherently overlooked in technology-based testing. Processes for tasks such as assessing the impact, recalling backups, and coordinating recovery activities are not validated.

Why tabletop testing is so much more effective

Tabletop testing gets the technology out of the room — and out of your focus — so you can concentrate on people and processes, and for the entire event, not just your failover procedures. Specifically, tabletop testing is a paper-based exercise where your Emergency Response Team (ERT) maps out the tasks that should happen at each stage in a disaster, from discovery to notifying staff to the technical steps to execute the recovery.

It’s during these walkthroughs that you discover half of your ERT doesn’t know where your DR command center is located, or that critical recovery information is kept in a locked cabinet in the CIO’s office, or key staff would be required for so many separate tasks that they would need to be in 10 places at once.

Tabletop testing also makes it easier to play out a wider range of scenarios compared to technology-based testing. Walk through relatively minor events, such as an individual key server failing, or major disasters that take down your entire primary site.  Similarly, play out what-if scenarios, such as what happens if key staff members are not available or disk backups have been corrupted.

With parallel testing, you can be sure that the technician restoring backups is not dealing with data corruption, and any necessary documentation is readily available (not locked in an office that you can no longer access); the focus is on “does the technology work” and not the hundred other things that can go wrong during a recovery. Tabletop testing reveals those people and process gaps that are otherwise so difficult to identify until you are actually in a DR scenario.

Focus on unit testing to validate standby systems

Unit testing was second only to tabletop testing in overall importance to DRP success. In this context, unit testing means validating standby systems as your environment changes, ideally as part of your change management procedures. The recovery site goes through the same release procedure as the primary site, including unit testing affected systems, to ensure that standby systems stay in sync with your primary systems.

Unlike simulation, parallel or full-scale testing, there is no pretense that unit testing is validating your DRP. It is validating the technology, and that’s all, so it provides a good complement to tabletop testing.

Conclusion

Is it important to validate standby equipment? Yes, but if that’s the focus of your DR testing, you aren’t truly validating your DRP. Use simulation or parallel testing to validate your recovery site and standby systems, and unit testing as your environment changes for ongoing validation — but make annual tabletop testing your primary methodology for practicing and verifying end-to-end DR procedures.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

By Frank Trovato, a Research Analyst specializing in mainframe technology and mission critical systems for Info-Tech Research Group

Traditional challenges include lack of testing and inadequate technology or facilities. However, a much more common issue is simply not recognizing when to invoke disaster recovery (DR) procedures for less obvious disasters. For example, an Info-Tech survey found that local software and hardware issues were the most common cause of unacceptable downtime – not power outages, network failures, or natural disasters.

Think about what this means. With server virtualization and backup technology that continues to become more and more sophisticated, why are so many organizations unable to recover within an acceptable timeframe from relatively non-destructive incidents?

In the above survey, “unacceptable downtime” was defined as downtime that extends beyond Recovery Time Objectives (RTOs) set by the business.

Below are a few of the reasons why relatively minor incidents lead to unacceptable downtime:

1)    Invoking DR procedures is rarely a zero-cost procedure. For example, if you have a DR hot site managed by a vendor, there is often a surcharge to execute the failover. At the very least, regardless of your DR procedures, there is a productivity cost (e.g. IT staff stopping their normal work to execute the recovery, and potentially interrupting other services as part of the recovery procedure). As a result, management may wait too long to pull the trigger and declare a disaster.

2)    IT departments often have a hero culture, and this can also lead to overconfidence in the ability to resolve an issue without having to invoke DR procedures.

3)    Service management and DR managed as distinct separate processes. There is not a clear escalation path from service management to DR.

Organizations that are most successful in overcoming these challenges treat their DRP as an extension of their service management procedures. They have strict timelines and criteria for when to move from service management to disaster recovery, and incorporate this into their escalation rules.

Consider this scenario:

1)    Performance begins to degrade on the back-end transaction server supporting an online ordering system. At this point, end users are not experience much delay, but there is a transaction backlog building up. It is a critical system, so it is assigned a high severity rating and appropriate staff are assigned to investigate and resolve the issue.

2)    The business has defined a Recovery Time Objective (RTO) of two hours based on a business impact analysis. Executing the recovery procedures (e.g. to bring the standby system online) takes one hour to execute. This leaves IT with one hour to troubleshoot before the RTO is compromised. Although the system is not down yet, the issue is severe enough that it should be viewed as if it were.

3)    Performance continues to degrade, but at the one hour mark the developers working on the problem believe they have identified the problem and just need 20 more minutes to fix the problem.

How many times have you seen the developer get that extra time? And how often has the manager come back after 20 minutes to find out the issue is still not resolved and the developer now needs “just five more minutes”? The whole time, performance continues to degrade and the online ordering system is essentially stalled – no orders coming in, customers are experiencing severe delays, and so on.

The above example is also why companies go through a business impact analysis, even if it’s an informal analysis, to determine recovery time objectives and criteria for declaring a disaster so that the company is not left hanging while an IT hero tries to resolve an outage. Integrating DR thinking into service management procedures enable IT to keep the bigger picture of service continuity and business impact in mind, and minimize the chances of failing to meet the availability/downtime guidelines set by the business.

In summary:

  • Extend your severity definitions to identify potential disaster scenarios.
  • Define escalation rules that account for the time required to prepare for and execute a DR.
  • Don’t give in to the IT hero mentality. When your troubleshooting time is up, failover to your standby system so that business operations can continue. Then work on resolving the root cause of the incident.

For more information, see Info-Tech’s solution set Bridge the Gap between Service Management and Disaster Recovery.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

By Frank Trovato, a Research Analyst specializing in mainframe technology and mission critical systems for Info-Tech Research Group

Downtime is more likely to be caused by human
error or process issues, yet organizations often focus primarily on technology redundancy. An Info-Tech survey found that adding more layers of redundancy (e.g. going to N+2) does not have close to the same impact as addressing people and process issues. Organizations thinking about investing tens or hundreds of thousands of dollars into increasing redundancy should first take a look at their people and processes.

For example, the same survey found that having secondary resources in place for mission critical systems was a strong indicator of success in meeting availability objectives. Secondary resources does not mean paying two people to do the same job, but rather sharing knowledge through a mentorship program or cross-training so the organization is not overly dependent on specific individuals. You need backup people such as much as you need redundant servers.

As far as processes are concerned, don’t assume staff are already following good processes, or that normal processes for production systems are rigorous enough for mission critical systems. There is a higher level of investment and risk with mission critical systems that demand a higher level of attention. For example, a U.S. bank recently discovered their development team was not consistently using source control for mission critical code. Processes must be documented and managed.

On the technology side, when end-to-end redundancy is not possible due to budget limitations, prioritize investments based on risk and impact analysis. That means doing your homework in terms of clearly identifying which systems are mission critical, what are their dependencies (and therefore also mission critical), what is the impact to the business, and where are the single points of failure.

In the meantime, while technology investments may need to be delayed, there is no reason to delay addressing the equally (if not more) important people and process aspects of high availability. Simply purchasing and installing more-advanced hardware and software will not deliver 4 or 5 x 9 availability. For more on aligning people, processes, and technology to deliver high availability, see Info-Tech’s solution set, Maximize Availability for Mission Critical Systems.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Candisaster recovery your IT department articulate its ability to withstand and recover from a disaster? Knowing IT’s existing ability to withstand and recover from disaster provides a baseline from which all future disaster recovery (DR) enhancements and/or downgrades can be made.

However, without understanding where the business’ needs begin and end, IT will be blindly assembling disaster recovery objectives. The organization will either waste money on unneeded DR or, won’t be fully prepared for disasters.

Buy-in is not as elusive as you might imagine, but here are some tips just in case:

  • Many organizations have found that simply explaining DR’s relevance to the business and the company’s survivability goes a long way in generating buy-in.
  • If you have trouble getting buy-in from the business group, try focusing on one key individual. If you can win over a business leader and have them champion DR to the rest of the departments, then the process should be much smoother.

One of our clients, a consulting company, went so far as to place an executive from the business side of the organization in charge of the DR initiative in order to get buy-in for the project from both IT and the business. Due to his connections with other business stakeholders and the relevance of the project to IT, the executive was able to collect input from both sides and build the organization’s DR capabilities to the satisfaction of all involved.

Milestones on the Path to Understanding

You can’t know which direction your organization should head in until you know where it stands. Ask yourself:

  1. What is IT currently doing? Are there multiple data centers? How often is data backed up? What are the general practices around storing data and fixing technology problems? Whether IT realizes it or not, aspects of DR might already be incorporated into their standard operating procedures.
  2. How do these practices translate into measurable statistics? Once IT recognizes what’s being done, it becomes a matter of recording how effective those practices are.

Recovery objectives are a useful metric for determining effectiveness. They are the metrics that set the level of your organization’s DR capability. The Recovery Time Objective (RTO) is the amount of time and organization can afford to have its systems down (e.g. the organization’s systems can be down no longer than one hour). The Recovery Point Objective (RPO) is the point in time beyond which an organization cannot afford to lose information (e.g. the organization can afford to lose 24 hours data/processing).

RTOs and RPOs vary depending on the needs of the organization and the criticality of the system/data they are relevant to; they can range from less than an hour to more than a week. Off-site back up does not result in RTOs and RPOs of zero hour. Unless data is streamed to redundant facilities and simultaneously processed, outages can still occur.

Knowing what recovery infrastructure and systems are in place is the first step in understanding how your organization can improve recovery times. If you know what you currently have, then it’s much easier to identify what you still need. A review of your organizations’ resources may also identify what can be cut, and thereby save your organization from some unnecessary expenses.

You can save time and money by properly scoping your disaster recovery capabilities prior to creating the actual DR plan. Properly scoping DR will prevent over spending and ensure a good return on investment.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter