Story image

Azure outage postmortem: Microsoft reveals what happened and why

12 Sep 18

Since last week headlines around the world have been painted with headlines shouting about the disruption to Microsoft’s services after a severe weather event knocked one of its data centers offline.

Essentially, the cause was blamed on high storms in the Texas area that resulted in power swells and ultimately ended in the temporary demise of one of the company’s South Central US data centres in San Antonio.

In a recent blog post, Microsoft Azure DevOps director of engineering Buck Hodges has released a ‘postmortem’ of what went down, why it happened, and what the company is doing to prevent similar incidents in the future.

“First, I want to apologize for the very long VSTS [now called Azure DevOps] outage for our customers hosted in the affected region and the impact it had on customers globally,” says Hodges.

“This incident was unprecedented for us. It was the longest outage for VSTS customers in our seven-year history. I've talked to customers through Twitter, email, and by phone whose teams lost a day or more of productivity. We let our customers down. It was a painful experience, and for that I apologize.”

The Azure status report reveals the data center switched from utility power to generator power following the power swells caused by the lightning, however, the mechanical cooling systems were also a victim of the power swells despite having surge suppressors in place.

While the data center was able to continue operating for a period of time, temperatures soon exceed safe operational thresholds which initiated an automated shutdown. While this blackout is an initiative to preserve infrastructure and data integrity, in this case temperatures rose so quickly that some hardware was damaged before it could be shut down.

Many asked why didn’t VSTS simply fail over to a different region.

We never want to lose any customer data. A key part of our data protection strategy is to store data in two regions using Azure SQL DB Point-in-time Restore (PITR) backups and Azure Geo-redundant Storage (GRS),” says Hodges.

“This enables us to replicate data within the same geography while respecting data sovereignty.Only Azure Storage can decide to fail over GRS storage accounts. If Azure Storage had failed over during this outage and there was data loss, we would still have waited on recovery to avoid data loss.

“Azure Storage provides two options for recovery in the event of an outage: wait for recovery or access data from a read-only secondary copy. Using read-only storage would degrade critical services like Git/TFVC and Build to the point of not being usable since code could neither be checked in nor the output of builds be saved (and thus not deployed). Additionally, failing over to the backed up DBs, once the backups were restored, would have resulting in data loss due to the latency of the backups.”

Hodges says the team is now in the process of making a number of changes based on the learnings from the outage, including:

  1. In supported geographies, move services into regions with Azure Availability Zones to be resilient to data center failures within a region.
  2. Explore possible solutions for asynchronous replication across regions
  3. Regularly exercise fail over across regions for VSTS services using our own organization.
  4. Add redundancy for our internal tooling to be available in more than one region.
  5. Fixed the regression in Dashboards where failed calls to Marketplace made Dashboards unavailable.
  6. Review circuit breakers for service-to-service calls to ensure correct scoping (surfaced in the calls to the User service)
  7. Review gaps in our current fault injection testing exposed by this incident.

“I apologize again for the very long disruption from this incident,” concludes Hodges.

HPE extends cloud-based AI tool InfoSight to servers
HPE asserts it is a big deal as the system can drive down operating costs, plug disruptive performance gaps, and free up time to allow IT staff to innovate.
Digital Realty opens new AU data centre – and announces another one
On the day that Digital Realty cut the ribbon for its new Sydney data centre, it revealed that it will soon begin developing another one.
'Public cloud is not a panacea' - 91% of IT leaders want hybrid
Nutanix research suggests cloud interoperability and app mobility outrank cost and security for primary hybrid cloud benefits.
Altaro introduces WAN-optimised replication for VMs
"WAN-optimised replication allows businesses to continue working in the case of damage to on-premise servers."
DDN part of data mining mission on Mars
DataDirect Networks (DDN) today announced that it will be playing a role in one of NASA’s most critical missions.
Opinion: Data centre management can learn from the Navy
While a nuclear submarine may seem like a completely different beast from a data centre, the similarities in how they should be managed are striking and many.
14 milestones Workday has achieved in 2018
We look into the key achievements of business software vendor Workday this year
HPE building new supercomputer with €38m price tag
It will be installed at the High Performance Computing Center of the University of Stuttgart and will be the world's fastest for industrial production.