Blog

The Many Names and Faces of Disaster Recovery

July 16, 2018 by Cali Thompson

When discussing disaster recovery, people often throw out a variety of words and terms to describe their strategy. Sometimes, these terms are used interchangeably, even when they mean very different things. In this post, we’ll explore these terms and their usage so you can go into the planning process well-informed.

Disaster Recovery:

This is a term that has been making the rounds since the mid- to late-seventies. Although the meaning has evolved slightly over time, the disaster recovery process generally focuses on preventing loss from natural and man-made disasters, such as floods, tornadoes, hazardous material spills, IT bugs, or bio-terrorism. Many times, a company’s disaster recovery plan is to duplicate their bare metal infrastructure to create geographic redundancy.

Recovery Time Objective (RTO):

As you build your disaster recovery strategy, you must make two crucial determinations. First, figure out how much time you can afford to wait while your infrastructure works to get back up and running after a disaster. This number will be your RTO. Some businesses can only survive without a specific IT system for a few minutes. Others can tolerate a wait of an hour, a day, or a week. It all depends on the objectives of your business.

Recovery Point Objective (RPO):

The second determination an organization must make as they discuss disaster recovery is how much tolerance they have for losing data. For example, if your system goes down, can your business still operate if the data you recover is a week old? Perhaps you can only tolerate a data loss of a few days or hours. This figure will be your RPO.

IT Resilience:

This term measures an organization’s ability to adapt to both planned and unplanned failures, along with their capacity to maintain high availability. Maintaining IT resilience is unique from traditional disaster recovery in that it also encompasses planned events, such as cloud migrations, datacenter consolidations, and maintenance.

Load Balancing:

To gain IT resilience and keep applications highly available, companies must engage in load balancing, which is the practice of building an infrastructure that can distribute, manage, and shift workload traffic evenly across servers and data centers. With load balancing, a downed server is no concern because there are several other servers ready to pick up the slack.

Streaming giant Netflix often tests the load balancing ability of their network with a proprietary program called Chaos Monkey. Using this tool, they ensure that their infrastructure can sustain random failures by purposefully creating breakdowns throughout their environment. This is a great example for companies to follow. Ask yourself: What would happen if someone turned off my server or DDOSed my website? Would everything come crashing to a halt if an employee accidentally deleted a crucial file?

Backup:

Backups are just one piece of the disaster recovery puzzle. Imagine if you took a snapshot of your entire workload and replicated it on a separate server or disc—that is a backup. With backups, you always have a point-in-time copy of your workload to revert back to if something happened to your environment; however, anytime you must revert to a backup, anything created or changed between the time the last snapshot was taken and the time the disaster occurred will be lost.

Failover Cluster:

Another piece of the disaster recovery puzzle, failover clusters are groups of independent servers (often called nodes) that work together to increase the availability and scalability of clustered applications. Connected through networking and software, these servers “failover,” or begin working, when one or more nodes fail.

Which type of failover server you choose depends on how crucial the system is, along with the RPO and RTO objectives of the disaster recovery plan. Failover servers are classified as follows:

Cold Standby: Receives data backups from the production system; is installed and configured only if production fails.
Warm Standy: Receives backups from production and is up and running at all times; in the case of a failure, the processes and subsystems are started on the warm standby to take over the production role.
Hot Standby: This configuration is up and running with up-to-date data and processes that are always ready; however, a hot standby will not process requests unless the production server fails.

Replication:

This term represents the process of copying one server’s application and database systems to another server as part of a disaster recovery plan. Sometimes, this means replacing schedule backups. In fact, replication happens closer to real-time than traditional backups, and therefore can typically yield an adherence to shorter RPO and RTO.

Replication can happen three different ways:

Physical server to physical server
Physical server to virtual server
Virtual server to virtual server

Database Mirroring:

As with backups and replication, database mirroring involves copying a set of data on two different pieces of hardware; however, with database mirroring, both copies run simultaneously. Anytime an update, insertion, or deletion is made on the principal database, it is also made on the mirror database so that your backup is always current.

Journaling:

In the process of journaling, you create a log of every transaction that occurs within a backup or mirrored database. These logs are sometimes moved to another database for processing so that there is a warm standby failover configuration of the database.

At the end of the day, what you really need is business continuity.

A well-formed business continuity plan will use all of these methods to ensure your organization can overcome serious incidents or disasters. Going beyond availability, business continuity plans determine how your business will continue to run at times of trouble. Can your business survive a systems failure? Can it survive a situation where your offices burn down? How quickly can you access your mission-critical data and mission-critical applications? How will people access your mission-critical applications while your primary servers are down? Do you need VPNs so employees can work from home or from a temporary space? Have you tested and retested your business continuity plan to ensure you can actually recover? Does your plan follow all relevant guidelines and regulations?

The right mix of solutions will depend on the way your business operates, the goals you’re trying to achieve, and your RPO and RTO targets. In the end, the resilience of any IT infrastructure or business comes down to planning, design, and budget. With the right partner to provide disaster recovery and business continuity management services, you can come up with a smart plan that proactively factors in all risk, TCO goals, and availability objectives.

To start planning your own battle-tested IT disaster recovery plan and business continuity strategy—and ensure that your business is ready for anything—contact one of our experts for a free risk assessment today.