POWER > Systems Management > High Availability

Determine the Right Level of Business Continuity

Business Continuity

Business continuity (BC) and disaster recovery (DR) are important aspects of any organization. Being proactive before a disaster strikes is important. If you don’t have a plan, it’s time to develop one. If you do have a plan, it might be time to examine if it’s current and meeting today’s needs and demands.

Organizations should consider three primary aspects of BC: high availability (HA), continuous operations and DR. The first, HA, keeps the environment running, even if a single component fails. Individual servers and storage often have redundant components, designed for no single point of failure. Server and storage systems can also be clustered, so that any system failure doesn’t bring down the environment. Continuous operations is the ability to perform routine operations without having to plan for shutdown. This includes taking online backups of databases, adding or upgrading hardware capacity, or upgrading firmware or software on running systems. Lastly, DR is the ability to recover an environment after an unplanned outage occurs. This may be at a different location, or to the same location after the problems have been resolved.

For many organizations, the first two are handled through purchasing hardware and software and configuring them to follow best practices. But the third one, DR, seems to elude many companies. An NTT Communications survey of organizations (bit.ly/2zKvlOh) found that 50 percent of respondents don’t have a documented DR plan. And half of those surveyed reported using data backup as their only DR plan. Additionally, 55 percent of respondents aren’t testing their recovery plans regularly, and 23 percent have never conducted testing.

Look at Disaster Effects

Shocked? You should be; disasters happen. Focusing on a disaster’s effects, rather than causes, is helpful for BC/DR planning to find a solution more quickly. All effects generally fall into one of four categories: workforce shortage, loss of technology, loss of facilities or failure in the supply chain. This article discusses loss of technology and facilities, and what organizations can do to prepare ahead of a disaster.

Ideally, operations should continue or resume quickly after a disaster. Failure to do so can mean loss in business and trust. The recovery methodology should be reliable, predictable and at a manageable cost. Critical personal and business data should be protected and secure throughout the process. A typical recovery involves a bottom-up approach with the following steps:

  1. Identify the location. When an outage occurs, management should assess a course of action and decide where the recovery will occur. This could be another location you own, a third-party facility or a cloud service provider.
  2. Recover the data. Do you know what data you had at the time of the disaster? What data can you recover? How will you handle the rest?
  3. Re-host applications. The applications need to be restarted, possibly on bare metal servers that are different from what you had at the primary location. The recovery may include OSes and device drivers unique to the new equipment. The use of VMs might help hide some of those differences, but now isn’t the time to figure out a physical-to-virtual deployment.
  4. Human success factors. Don’t just focus on the technological aspects. Also, look at what I call my “five Cs:” Command and control; communication and collaboration; telephone and network connectivity; contingency; and counseling. Your staff may be scattered, with some at the primary location, at the DR facility or somewhere else altogether.

Two metrics are used to measure DR: recovery point objective (RPO) and recovery time objective (RTO). The technology employed determines the RPO. This is the time from when the data was backed up to the time the disaster happened. Backing up to tapes once per day represents a 24-hour RPO. Mirroring data to flash and disk located at the disaster facility reduces this down to seconds.

The automation employed determines the RTO. This is the time from when the disaster happened, to the time your business process is operational again. This includes the time for management to assess the situation, recover the data, re-host the applications and correct any partial or incomplete transactions. Depending on how manual or automated your recovery is, this can be measured in days, hours or minutes.

Plan of Action

While the recovery itself is bottom-up, the BC/DR plan itself should be developed top-down.

Focus first on the business process as a unit of recovery. Let’s take payroll as an example. Payroll involves three applications: gathering the hours each employee worked, performing some business logic such as calculating tax withholdings, and then printing checks or sending funds electronically via direct deposit. It does no good to only recover one or two of those applications; you need all three to run payroll.

For each business process, prioritize its importance after a disaster. Be pragmatic: Use categories like gold, silver and bronze to rank each and assign a desired RTO for each category (e.g., gold business processes need to be operational in four hours, silver in 48 hours and bronze within two weeks).

Identify the applications and data required to support each business process and the needed server, storage and network infrastructure. Can these run in a cloud? Do you have what you need at your designated DR facility?

IBM conference attendees in 1983 documented a standard set of DR levels. The business continuity tiers were ranked from “least expensive, longest time to recover” to “most expensive, fastest recovery,” and have stood the test of time (see Figure 1). Over three decades later, these are still the standards used for BC/DR planning:

BC Tier 1: Restore from offline media, most often tape
Typically, backup tapes are created and periodically shipped to an offsite storage facility. Depending on how often this happens, the organization must be prepared to accept several days to weeks of data loss. Recovery will require someone to pick up these tapes and take them to the recovery location.

BC Tier 2: Tapes in a hot site
Similar to Tier 1, but with the offline media shipped to the designated recovery location. A “hot site” implies servers, storage and network equipment are already in place to perform recovery from these backup tapes.

BC Tier 3: Electronic vaulting
Rather than shipping physical media, data is transmitted electronically and stored on tapes or other media at the recovery location. This eliminates many of the problems associated with boxing up backup tapes, providing them to couriers and having them received at the recovery site.

BC Tier 4: Point-in-time copies
Flash and disk storage data can be copied at specific points in time (e.g., every four hours) and electronically sent to the recovery facility. When a disaster occurs, the point-in-time copies may need to be cleaned up for partial or incomplete transactions that were in play at the time the copies were made.

BC Tier 5: Application/database integration
For some business processes, transactional integrity is critical. By integrating with specific application or database capabilities, BC Tier 5 provides for little or no data loss. This can be accomplished by sending transactions to secondary copies of the databases at the recovery locations. This is sometimes called log shipping or database shadowing. Unfortunately, this only addresses the subset of data and applications that support this, leaving all unstructured data to be recovered some other way.

BC Tier 6: Real-time continuous data replication
Often referred to as mirroring, BC Tier 6 eliminates dependencies on application or database capabilities and covers all data, both structured and unstructured. Synchronous replication means that the data at the recovery location is identical to the data at the primary location, with an RPO of zero. Asynchronous means the data at the recovery location may be a few seconds or minutes behind.

BC Tier 7: End-to-end resiliency orchestration
Tier 6 only addresses the data on storage devices. BC Tier 7 adds complete orchestration to address the server and network components of the infrastructure as well. Servers are booted up, applications are started and networks are redirected to match the new location.

All of these tiers are available today with existing technologies. Organizations that have BC Tier 7 implemented have the ability to recover from a site wide disaster in less than 30 minutes.

Keep Testing Current

All of the people involved in the preparation and recovery of information should understand their roles and responsibilities. Management should ensure all procedures are documented, and employees are trained to handle a disaster.

Consider testing your BC/DR plan at least once a year. Some companies test more frequently. Some testing can be done on-premises, at the designated recovery site or at a third location. Even a walk-through, where all parties get together in a conference room and talk through what steps they would take, is a good start.

DR is a business solution, not just technology. While this article focused on the effect of losing IT technology and facilities, your BC/DR plan should also consider other effects, workforce shortages and supply chain disruption.

Tony Pearson is a Master Inventor and senior IT architect with IBM System Storage.



2018 Solutions Edition

A Comprehensive Online Buyer's Guide to Solutions, Services and Education.

Data is Money

A recent survey explores the state of Power Systems resilience


Determine the Right Level of Business Continuity

IBM Systems Magazine Subscribe Box Read Now Link Subscribe Now Link iPad App Google Play Store