Machine learning: The AIOps system Azure uses to make the cloud reliable
Cloud services change all the time, whether it's adding new features or fixing bugs and security vulnerabilities; that's one of the big advantages over on-prem software. But every change is also an opportunity to introduce the bugs and regressions that are the main reasons for reliability issues and cloud downtime. In an attempt to avoid issues like that, Azure uses a safe deployment process that rolls out updates in phases, running them on progressively larger rings of infrastructure and using continuous, AI-powered monitoring to detect any issues that were missed during development and testing. When Microsoft launched its Chaos Studio service for testing how workloads cope with unexpected faults last year, Azure CTO Mark Russinovich explained the safe deployment process. "We go through a canary cluster as part of our safe deployment, which is an internal Azure region where we've got synthetic tests and we've got internal workloads that actually test services before they go out. This is the first production environment that the code for new service update reaches so we want to make sure that we can validate it and get a good sense for the quality of it before we move it out and actually have it touch customers."
Feb-25-2022, 02:54:45 GMT