Anatomy of an outage: Understanding the lifecycle of downtime

Thu, 19th Apr 2018

FYI, this story is more than a year old

By Eric Vaughn, Chief, Revenue Officer, Neverfail

No one likes dealing with an outage.

They're bad for nearly every aspect of a business, from employee morale to productivity, competitiveness, profitability, and reputation.

It's a common problem, though, and there are lots of noisy claims on the market from many different vendors. You'll hear everything from “Recover in minutes” to “Avoid downtime” to “Zero data loss”.

Figuring out which claims are bogus and which ones make sense for YOUR operation is daunting. Each organization has unique requirements and, since budgets are never unlimited, IT managers must do more than simply implement gold-plated protection for every application and data repository.

What exactly is an outage?

Before you can decide what the optimum RTOs and RPOs are for your business, and select the right DR and BC services, you need to understand what constitutes an “outage.” It's not as obvious as it sounds.

To understand the concept clearly, let's break it down into four components:

Awareness
Resolution
Failover
Recovery

An outage includes all four stages, and the time to deal with an outage completely includes all the time required to get through all these steps. Recovery is usually the shortest part of any outage, and, as a result, you'll often hear vendors focus almost exclusively on their short recovery times. Fine, as far as it goes.

But ignoring the other three stages - and failing to take them into account - will get you into hot water. When you're making commitments to the business about how short any potential outages will be, it pays to be realistic.

Stage 1: Awareness

The first stage is usually the longest: Figuring out that you actually have an outage.

IT often finds out about a problem when users start calling to complain that they can't work. At this point, the outage has been underway for some time and is already having an impact on the business.

Understanding what is really going on - is it an outage or user error, for example - can often take an hour or more. Once you've confirmed that you're dealing with an outage, you can move immediately to Stage 2.

Stage 2: Resolution

Now you must triage the system and make some decisions. Is it something you can fix quickly or do you need to failover to backup systems?

Sometimes what looks like a system outage is highly localized. For example, a virus-ridden laptop might be causing problems for someone, or even a group of people, but looks like a problem at the server level.

This type of local problem is still an outage as far as your people are concerned, of course, but you can usually deal with it quickly and without affecting the rest of the organization.

It takes another half hour minimum to confirm you're looking at a system-level issue that requires a failover.

Stage 3: Failover

After you've confirmed there really is an outage and determined that there is no quick fix, you have to failover to your backup systems.

Depending on how well your backup systems are equipped and configured, this can take anywhere from a few seconds to over an hour.

When you failover, there are some key assumptions you make in the hopes of a successful recovery later. We all know what they say about “assume”, so it's worth taking the time to make sure these are more like certainties than assumptions:

With unplanned outages, you must ensure that your data is complete and not corrupted, hardware and infrastructure are available, and that the internal resources are available to complete the failover process successfully.

If any of these assumptions is not true, your ability to recover in Stage 4 will be compromised, if recovery is even possible.

One of your biggest risk factors is right here. You must take the actions necessary to ensure that all the elements – people, equipment, communications, network connectivity, and so on – are in place and working so an unplanned failover will always result in a successful recovery.

Stage 4: Recovery

We're not finished! The clock is still ticking on your outage when the recovery stage starts.

At this point, you have to restore services to your production servers or site and reinstitute your normal BC/DR practices. If you don't do it now, you're putting your operation at severe risk; what if another outage occurs before then?

This stage can easily consume several hours, since you want to be absolutely certain that all production systems are working as expected, and that your DR/BC systems are ready to handle it when the next outage occurs.

Most solutions require at least a little downtime during the failover process to re-synchronize data and to establish user connections to the backup environment.

The same process has to occur again, but in reverse, to bring your production systems back online. Essentially, every IT outage produces two potential business outages — two periods of time when employees can't do their work.

Be sure that you're taking that cold, hard fact into account in your planning – and your presentations to management.

How much do outages really cost?

The Ponemon Institute has conducted several studies of this, with the most recent one published in 2016.

According to their latest study, the average cost of a data center outage has steadily increased from $505,502 in 2010 to $740,357 in 2016 (38% increase!) We can safely assume outages are even more expensive today.

To make their assessment, the Institute looked at these primary factors in calculating costs:

Damage to mission-critical data
Impact of downtime on organizational productivity
Damages to equipment and other assets
Cost to detect and remediate systems and core business processes
Legal and regulatory impact, including litigation defense cost
Lost confidence and trust among key stakeholders
Diminishment of marketplace brand and reputation

Ponemon found that cost of downtime has a large range, depending on industry and many other factors, but the average is about $9,000 per minute.

Another way to look at this number: 99.9% uptime may sound good, but that translates into about 44 minutes of downtime per month, therefore costing the average business almost $400,000! Adding another 9 to that SLA - making it 99.99% - translates into about 4 minutes per month, which obviously represents a huge savings, and provides firm justification for investing in an appropriate level of protection.

Putting it all together

When you understand the anatomy of an outage and all of its stages, you can plan effectively and make good decisions. You can also develop proposals to take to management that they will understand in business terms.

Don't be fooled by a “failover in minutes” pitch. It's easy to avoid overpromising and underdelivering when you understand all the implications - and the costs - associated with each stage of an outage.

Finding the right partner

The market for backup and recovery services is crowded and you will find no shortage of potential vendors. The keys are finding one that not only has suitable technology and is able to deliver what you need at a competitive price, but also has people you can depend on when the time comes.

Find a partner who has a large number of clients who are similar to your business. Talk to some of their engineering people, not just an account rep. Can you communicate easily with them?

Are they listening and truly understanding your priorities and unique challenges?

Once you've selected a partner, work with them to develop a plan that will fit your organization's unique set of requirements and put together an implementation schedule that won't disrupt your operations.

The whole process, from initial consultation through to first tests and then to completed deployment, can take less than a month — with the right partner.