Microsoft releases details on last week’s big Azure outage, during which servers were damaged but no data was lost

A look inside a Microsoft data center in Cheyenne, Wyo. (Microsoft Photo)

A severe lightning storm in the San Antonio area last week not only disrupted the power supply to Microsoft Azure’s data center in the region, it knocked the cooling systems offline, damaging “a significant amount” of equipment.

Microsoft Azure’s South Central US data center region was down for quite a while last week, and the company has now released details explaining to customers what happened. The issues affected anyone with workloads in that data center, as well as customers around the world who were using Active Directory and Visual Studio Team Services, for more than 24 hours before they were completely resolved.

Lightning storms are a fact of life in Texas, but this was a big, slow-moving storm, shattering rainfall records for San Antonio by more than seven inches. In the middle of the night local time, “lightning caused electrical activity on the utility supply,” Microsoft said. This created a voltage swell that tripped part of the data center onto backup generator power, but somehow also overwhelmed and shut down the cooling systems for that part of the data center.

After the servers, the cooling equipment is the most important part of a data center. Thousands and thousands of servers, storage, and networking equipment emit a lot of heat, and can quickly overheat a confined area without an active cooling system.

Cooling towers sit atop a Google data center in The Dalles, Ore. (GeekWire Photo / Tom Krazit)

In this case, the equipment began to shut down as rising temperatures were detected but “temperatures increased so quickly in parts of the datacenter that some hardware was damaged before it could shut down. A significant number of storage servers were damaged, as well as a small number of network devices and power units,” Microsoft said in its report.

Engineers decided to prioritize preserving customer data instead of moving customers over to another data center, which could have caused resulted in the loss of some data “due to the asynchronous nature of geo replication,” Microsoft said.

Credit again to Microsoft for providing this level of detail; customers deserve nothing less, and understanding the cause of outages helps everyone get better. It’s still not clear why the cooling system, which the company said was equipped with surge suppressors, was defeated by the voltage swell. But this won’t be the last time lightning strikes the area around a data center, unless you count Project Natick.

Rival Amazon Web Services couldn’t resist a subtle jab, however, highlighting the availability zones it offers cloud customers to mitigate the risk of a problem at a single data center within a region as Hurricane Florence bears down on the Carolinas. AWS operates its largest region, U.S. East, in Northern Virginia, which could see some action from the hurricane.

“Common points of failure, such as generators, UPS units, and air conditioning, are not shared across Availability Zones,” AWS said in a blog post Tuesday. “Electrical power systems are designed to be fully redundant and can be maintained without impacting operations.”

Microsoft has only recently started to roll out availability zones within cloud regions around the world. Only one region in the U.S. — Central US — offers availability zones, and “the physical separation of Availability Zones within a region protects applications and data from datacenter failures,” according to Microsoft Azure documentation. Ten bucks says availability zones are coming to the South Central US region sooner rather than later.



Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.