Equinix: How to escape a cloud outage unscathed

Fri, 24th Mar 2017

FYI, this story is more than a year old

Media coverage has pinned blame for the failures on the cloud service providers (CSPs), but CSPs aren't the only entity on the hook here.

Enterprises are responsible for their own disaster planning and recovery procedures, whether they are deploying IT solutions in their own data center, managing their IT with one or more CSPs, or utilizing a hybrid architecture.

During the recent outages, enterprises without plans and those whose plans didn't live up to their expectations experienced some embarrassing failures.

And even if blame didn't publicly fall on them, the consequences of the outage surely did. The good news was the news we didn't hear – from those who had a plan that worked, and who continued processing, even though one component of their IT infrastructure had failed.

How can companies make sure they get that kind of no news/good news during a cloud outage? Diligent planning and an interconnection-first IT architecture can certainly help. Read more about an Interconnection Oriented Architecture (IOA) strategy.

Planning to escape disaster

Planning for disasters involves many layers of business, facilities and IT management, as well as experts in systems software, virtualization technology, application frameworks, application development, database management, storage and networking.

As the infrastructure and application architecture changes, these plans must be changed as well. And as enterprises incorporate Internet of Things (IoT), big data and other recent trends in IT, they have to extend, enhance and change their disaster recovery (DR) plans as well.

Some rules for robust DR

Most IT solutions in today's world are hosted on distributed, multi-system, multi-operating systems and increasingly, multi-clouds. Even multiple system architectures that are communicating with one another are doing so over multiple types of network media.

The distributed systems include servers in one or more data centers, storage that is equally distributed and, now, access-point devices such as smartphones, tablets, laptops and desktop systems.

So in this environment, what elements and practices do reliable DR and business continuity (BC) plans have in common?

In his article, “Business Continuity And Disaster Recovery Best Practices From The Availability Trenches”, Christian Burns McBeth, presented six steps that are worthy of consideration:

Maintain a full copy of your mission-critical data outside your production region.
Test your BC/DR plan in a realistic way to ensure it actually works.
Ensure production changes are properly reflected in the BC/DR plan.
Have a plan that is consistent and accessible, even in the event of a major disaster.
Always have several people fully trained on the BC/DR plan — preferably some of whom are outside the production region.
Remember Murphy's Law: “Whatever can go wrong, will go wrong.”

For example, if an outage at a cloud service provider's cloud storage service causes an application outage that is visible to customers, it's clear that rule No. 1, and likely rule No. 2, weren't honoured in the enterprise DR plan. If a visible outage is caused by a change in the production environment, such as a software update to a router, a storage server or even one or more of the production application servers, it's clear that No. 3 wasn't followed.

The role of direct, secure, close interconnection

Another way companies can be sure they're prepared for cloud outages is by deploying an IOA strategy. An IOA strategy is a proven and repeatable architectural framework that directly and securely connects people, locations, clouds and data.

This interconnection-first approach emphasizes deploying IT in distributed locations, close to dense ecosystems of network and cloud partners and end users at the digital edge.

That means your organization can be interconnected to multiple cloud and network providers in multiple locations over redundant, high-performance, low-latency connections and provide fast, seamless failover options when things go wrong.

Unlike traditional IT architectures, which are siloed and centralized, an IOA is distributed and dynamic. It allows for faster adjustments and reactions in the event of a cloud or network outage, including simplifying the deployment of DR and BC environments, such as an active-active high availability environment.

Say a cloud-based application fails, An IOA strategy makes it easier for you to put the plans, partners and IT in place to route around that failure to the same application on a different cloud.

The same is true of network connections and compute and storage systems. A failure of any of these things, shouldn't bring the house of cards crashing down. With an IOA strategy, it won't.

Having a strong interconnection-first strategy is a mission-critical component of a robust DR/BC strategy and keeping customers satisfied. For a step-by-step guide on how to implement a resilient IT infrastructure using an IOA, read the IOA Playbook.

And as World Backup Day approaches, the general principle behind the redundancy enabled by an IOA strategy is worth spotlighting.

Whether it involves cloud storage, an external hard drive, or some other method or medium, this March 31 “holiday” is a good excuse for individuals and businesses to prevent problems before they happen and do some big-time file backup.

Article by Jim Poole, Equinix Blog Network