Story image

Azure outage postmortem: Microsoft reveals what happened and why

12 Sep 18

Since last week headlines around the world have been painted with headlines shouting about the disruption to Microsoft’s services after a severe weather event knocked one of its data centers offline.

Essentially, the cause was blamed on high storms in the Texas area that resulted in power swells and ultimately ended in the temporary demise of one of the company’s South Central US data centres in San Antonio.

In a recent blog post, Microsoft Azure DevOps director of engineering Buck Hodges has released a ‘postmortem’ of what went down, why it happened, and what the company is doing to prevent similar incidents in the future.

“First, I want to apologize for the very long VSTS [now called Azure DevOps] outage for our customers hosted in the affected region and the impact it had on customers globally,” says Hodges.

“This incident was unprecedented for us. It was the longest outage for VSTS customers in our seven-year history. I've talked to customers through Twitter, email, and by phone whose teams lost a day or more of productivity. We let our customers down. It was a painful experience, and for that I apologize.”

The Azure status report reveals the data center switched from utility power to generator power following the power swells caused by the lightning, however, the mechanical cooling systems were also a victim of the power swells despite having surge suppressors in place.

While the data center was able to continue operating for a period of time, temperatures soon exceed safe operational thresholds which initiated an automated shutdown. While this blackout is an initiative to preserve infrastructure and data integrity, in this case temperatures rose so quickly that some hardware was damaged before it could be shut down.

Many asked why didn’t VSTS simply fail over to a different region.

We never want to lose any customer data. A key part of our data protection strategy is to store data in two regions using Azure SQL DB Point-in-time Restore (PITR) backups and Azure Geo-redundant Storage (GRS),” says Hodges.

“This enables us to replicate data within the same geography while respecting data sovereignty.Only Azure Storage can decide to fail over GRS storage accounts. If Azure Storage had failed over during this outage and there was data loss, we would still have waited on recovery to avoid data loss.

“Azure Storage provides two options for recovery in the event of an outage: wait for recovery or access data from a read-only secondary copy. Using read-only storage would degrade critical services like Git/TFVC and Build to the point of not being usable since code could neither be checked in nor the output of builds be saved (and thus not deployed). Additionally, failing over to the backed up DBs, once the backups were restored, would have resulting in data loss due to the latency of the backups.”

Hodges says the team is now in the process of making a number of changes based on the learnings from the outage, including:

  1. In supported geographies, move services into regions with Azure Availability Zones to be resilient to data center failures within a region.
  2. Explore possible solutions for asynchronous replication across regions
  3. Regularly exercise fail over across regions for VSTS services using our own organization.
  4. Add redundancy for our internal tooling to be available in more than one region.
  5. Fixed the regression in Dashboards where failed calls to Marketplace made Dashboards unavailable.
  6. Review circuit breakers for service-to-service calls to ensure correct scoping (surfaced in the calls to the User service)
  7. Review gaps in our current fault injection testing exposed by this incident.

“I apologize again for the very long disruption from this incident,” concludes Hodges.

Chayora announces a strategic partnership with Sinnet Technology
Chayora, a Hong Kong-based data center infrastructure company, announced that it has entered into a strategic partnership with Beijing Sinnet Technology.
Commvault fully integrates backup with Cisco Hyperflex
Its IntelliSnap technology has been validated to work with Cisco HyperFlex hyper-converged systems without the need for third-party tools.
Huawei continues 5G trials despite ongoing concern
Huawei completed the 5G NR test at 2.6GHz spectrum in the 5G trial organised by the IMT-2020 (5G) Promotion Group. 
Experts comment on record 772mil-user data breach
Dubbed “Collection #1”, the data set contains emails and passwords with over a billion unique combinations of email addresses and passwords.
Top risk facing organisations? Why, it’s an IT talent famine
For some time there has been talk about how the IT industry is crying out for new talent and skills, which a lot of people have glossed over. But now Gartner says it is a harsh reality.
HPE invests in services with new A/NZ execs 
With IT services spend growing in Australia and New Zealand, HPE is appointing execs for software and technology services in the South Pacific.
Inspur’s server delivery to Baidu claims new record
After an urgent request, Inspur delivered a shipment of rack scale servers of more than 10,000 nodes to a Baidu data centre - equating to one server delivered every 2.88 seconds.
LISA Double Access fibre management system to launch at Cisco Live
“In a data centre, the protection of the fibre is key, which is exactly what the LISA Double Access offers customers.”