The company said in on its Azure status page that high-energy storms had hit southern Texas on the morning of 4 September, close to Microsoft Azure’s South Central US region, resulting in voltage fluctuations both up and down.
These changes affected the data centre's cooling systems which shut down, causing damage to hardware and necessitating their replacement. A decision was made to attempt data recovery and not to fail over to another data centre which caused a cascading impact to services outside the region.
In the South Central US, storage servers began to shut down from about 2.30am Pacific Time on 4 September (7.30pm AEST 4 September). A huge number of services were affected and though the vast majority of the effects were mitigated by 4am Pacific Time (9pm AEST 4 September), full mitigation did not take effect until 1.40am Pacific Time on 7 September (6.40pm AEST September 7).
Microsoft offered an apology to those affected and said it would be investigating the following, which are deemed to be the biggest contributory factors to the incident:
- "A detailed forensic analysis of the impacted data centre hardware and systems, in addition to a thorough review of the data centre recovery procedures.
- "A review with every internal service to identify dependencies on the Azure Service Manager API. We are exploring migration options to move these services from ASM to the newer ARM architecture.
- "An evaluation of the future hardware design of storage scale units to increase resiliency to environmental factors. In addition, for scenarios in which impact is unavoidable, we are determining software changes to automate and accelerate recovery."