Skip to main content

Craft an HA/DR Strategy to Minimize Data Loss

IBM Champion Shawn Bodily shares steps to minimize downtime.

99.999 in orange, yellow and red on a black background.

Image by Charles Williams

In our 24-7 digital economy, even brief outages of the enterprise computing platform can be alarmingly expensive. In a recent survey of more than 1,000 businesses worldwide, 98% of respondents said that downtime costs them at least $100,000 per hour; just over one-third of respondents put the cost per hour at between $1 million and $5 million. These figures represent real business costs, excluding penalties and litigation. 

Enterprises recognize the need to put high availability (HA) and disaster recovery (DR) strategies in place. The challenge is how to build strategies that will function effectively when trouble strikes.

HA Versus DR

A common misconception is that HA and DR are the same thing. The two terms are related but have two distinct meanings. HA refers to the ability of the application environment to be operational to a certain agreed upon standard. 

DR refers to strategies put in place to maintain HA even in the face of extreme events (e.g., fire, flooding, earthquake, hurricanes, tornadoes, etc.). This involves maintaining multiple geographically separated facilities mirrored such that if one is offline, the other can take over operations within some specified time frame. An HA strategy won’t necessarily protect against disaster. Conversely, a DR strategy may not necessarily ensure HA. The two strategies need to be distinct but also complementary.

A second frequent misconception involves the cause of downtime. It’s easy to assume that an issue is most likely to arise as a result of hardware failure or disaster. In reality, the vast majority of outages stem from less dramatic events. “The most common cause is probably human-made,” says Shawn Bodily, IBM Champion and senior consulting IT specialist at Clear Technologies. “Human error, OS issues and software problems are probably the biggest causes of downtime.” That point underscores the importance of an HA/DR strategy. Hurricanes, tornadoes and earthquakes may hit only once a century. Human error is available in inexhaustible volumes. 

Setting Goals

Building HA and DR strategies begins with understanding the impact of downtime to the business. Depending on the applications and processes interrupted, the impact of downtime ranges from lost revenues to lost business opportunity to lost customers. Once the business' per-minute or per-hour cost of downtime has been established, that information can be used to determine the level of availability required.

Two other factors help determine the type of HA/DR solution required: recovery time objective (RTO) and recovery point objective (RPO). RTO refers to the maximum amount of time that can pass before the system is back up and running. RPO refers to the time state of the data. 

Is any data loss acceptable? The answer to that question is generally no, which imposes strict requirements for replicating data. RTO and RPO may be dictated by the end customer, the industry and/or governing bodies over them.

Finally, RTO and RPO get mapped over to availability levels. The most common level of availability is “five nines” or 99.999%. That corresponds to five minutes and 35 seconds of downtime or less per year. For the next two levels, the amount of downtime—and the cost—add up quickly. That said, a cost is also associated with achieving the higher levels of availability. “You can take it to the nth degree of redundancy and resiliency, and there are usually additional returns for doing so, but there's also a point of diminishing returns,” says Bodily.

Layers of Resiliency

An effective HA/DR solution should demonstrate the following three layers of resiliency: application data resiliency, application infrastructure resiliency and application state resiliency. Application data resiliency refers to replicating data to ensure operations can resume after an outage within the required RPO. Application infrastructure resiliency refers to recreating the application environment at a standby node to enable operations to restart within the desired RTO. Application state resiliency means that the operations start on the standby node in the cluster in the same state the application was in at the time of outage. 

One common HA architecture used to provide this resiliency consists of individual systems (nodes) that are networked together into clusters. Workloads running on a primary node can failover onto a secondary (standby) node. Standby nodes can be configured to provide various levels of resiliency (see Figure 1, below). The trade-offs are cost and complexity.

For organizations that cannot tolerate downtime, running multiple instances of an application with a hot standby or active-active architecture can be effective. The secondary node would be located either within the data center or at a different physical location within the campus cluster. “If one instance goes down, it’s just not providing that service,” Bodily says. “There are other instances, other application environments still providing the same capabilities.” 

Data resiliency can be achieved through many approaches. Modern databases typically have inherent redundancy through mirroring capabilities. In the aforementioned scenario, the instances of an application might all draw on the same database. If that’s insufficient, the organization can install a second database on separate equipment in a different facility and mirror the data between locations. In the event of a failure, or even an event involving one entire building, the hot standby node is still running at the second location, complete with a full copy of data.

The solution pays dividends with not only troubleshooting, but also with minimizing downtime. While the application is running uninterrupted on the secondary node, staff can troubleshoot the primary node to determine the problem's root cause. 

Making a DR Plan

DR has much in common with HA strategies, the main differences being that the secondary node has a separate backup database and that it’s located in a geographically removed location from the primary node. One of the most fundamental decisions of DR is where to place the second facility. If a company is located on the Louisiana coast and has a data center backup 25 miles away, there’s a strong likelihood that a hurricane hitting the primary facility will impact the backup center in some manner. It’s essential to choose an independent site with reliable infrastructure and connectivity between the two.

You can take it to the nth degree of redundancy and resiliency, and there are usually additional returns for doing so, but there’s also a point of diminishing returns.
Shawn Bodily, senior consulting IT specialist, Clear Technologies

The next decision to make hinges on RPO. If RPO is zero, this requires synchronous replication from the primary database to the secondary node. Enterprises often have three options for data replication: through their enterprise storage subsystem, through the application/database or through the OS. All three typically have some sort of data replication functionality. Synchronous replication orchestrated by the application is OS agnostic, while replication orchestrated by the OS is both application and storage agnostic. Finally, storage replication is often OS and application agnostic. The most appropriate choice depends upon the needs of the enterprise, staff expertise and RTO/RPO.

Some databases have a function called log shipping that particularly streamlines data replication. Instead of mirroring the data transaction by transaction, the application automatically backs up the log files from the primary server and restores them on the standby node. Log shipping is more efficient than mirroring in terms of I/O. It’s an economical approach that works with dissimilar hardware. It’s reliable and easy to maintain while being OS and application agnostic.

Get Started With an HA/DR Plan

Implementations vary, but here are a few essential steps to building an effective HA/DR strategy:

  1. Determine recovery time objective (RTO) and recovery point objective (RPO)
  2. Calculate the cost of downtime and the impact it would have on the business 
  3. Audit the system including applications, OSes and storage solutions. What do they offer for high availability capabilities?
  4. Check the infrastructure and determine what’s required to support operations 
  5. Choose appropriate locations for geographically dispersed backup data centers

Staffing Solutions for HA/DR

One important factor often overlooked is the issue of staffing, particularly if IT staff need to evacuate or attend to family and property concerns. Consider staffing strategies to ensure that the necessary expertise will be available at the backup data center. Conversely, craft a strategy bearing in mind that your experts may not always be available during a local outage or major event. It’s impossible to overstate the importance of having an

HA/DR plan. Don’t make decisions based on a natural disaster with a low probability of occurrence. “If a company is doing this just to protect against a hurricane and they decide not to do it, they may have a problem,” says Bodily. “The hurricane probably won’t be responsible for their demise. Their demise is probably going to be caused by something done inadvertently in-house.” 

HA/DR solutions provide protection against business losses. Calculate the cost of downtime, then build a solution that delivers the availability required. At the same time, don’t forget that availability reaches a point of diminishing returns. Find the best balance. “It comes down to price, performance and availability,” says Bodily. “As with many things, you can usually have any two out of the three that you want. The important thing is that you find the best strategy for the enterprise, execute it and maintain it.”



Read More

Get in-depth info on developing an HA/DR strategy from IBM Redbooks


IBM Systems Webinar Icon

View upcoming and on-demand (IBM Z, IBM i, AIX, Power Systems) webinars.
Register now →