Goto

Collaborating Authors

 cloud failure


The Microsoft Azure Outage Shows the Harsh Reality of Cloud Failures

WIRED

The second major cloud outage in less than two weeks, Azure's downtime highlights the "brittleness" of a digital ecosystem that depends on a few companies never making mistakes. Microsoft's Azure cloud platform, its widely used 365 services, Xbox, and Minecraft started suffering outages at roughly noon Eastern time on Wednesday, the result of what Microsoft said was "an inadvertent configuration change." The incident--which marks the second major cloud provider outage in less than two weeks--highlights the instability of an internet built largely on infrastructure run by a few tech giants. Microsoft's problems specifically originated from Azure's Front Door content delivery network and emerged just hours before Microsoft's scheduled earnings announcement. The company website, including its investor relations page, was still down on Wednesday afternoon, and the Azure status page where Microsoft provides updates was having intermittent issues as well.


Diffusion-based Time Series Data Imputation for Microsoft 365

Yang, Fangkai, Yin, Wenjie, Wang, Lu, Li, Tianci, Zhao, Pu, Liu, Bo, Wang, Paul, Qiao, Bo, Liu, Yudong, Björkman, Mårten, Rajmohan, Saravan, Lin, Qingwei, Zhang, Dongmei

arXiv.org Artificial Intelligence

Reliability is extremely important for large-scale cloud systems like Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten service reliability, resulting in online service interruptions and economic loss. Existing works focus on predicting cloud failures and proactively taking action before failures happen. However, they suffer from poor data quality like data missing in model training and prediction, which limits the performance. In this paper, we focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model, to impute the missing data efficiently based on the observed data. Our experiments and application practice show that our model contributes to improving the performance of the downstream failure prediction task.