Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides
An, Kaikai, Yang, Fangkai, Lu, Junting, Li, Liqun, Ren, Zhixing, Huang, Hao, Wang, Lu, Zhao, Pu, Kang, Yu, Ding, Hua, Lin, Qingwei, Rajmohan, Saravan, Zhang, Dongmei, Zhang, Qi
–arXiv.org Artificial Intelligence
Effective incident management is pivotal for the smooth To investigate the effect of TSGs on incident mitigation, we analyze operation of Microsoft cloud services. In order to expedite incident around 1000 high-severity incidents in the recent twelve months mitigation, service teams gather troubleshooting knowledge into that demand immediate intervention from OCEs. Consistent with Troubleshooting Guides (TSGs) accessible to On-Call Engineers findings from prior studies [8, 18, 9], which demonstrate the efficacy (OCEs). While automated pipelines are enabled to resolve the most of TSGs in incident mitigation. We found that incidents paired with frequent and easy incidents, there still exist complex incidents that TSGs exhibit a 60% shorter average time-to-mitigate (TTM) compared require OCEs' intervention. In addition, TSGs are often unstructured to those without TSGs, emphasizing the pivotal role played and incomplete, which requires manual interpretation by OCEs, leading by TSGs. This trend is consistent across various companies, as evidenced to on-call fatigue and decreased productivity, especially among by research [14, 10], even among those employing different new-hire OCEs. In this work, we propose Nissist which leverages forms of TSGs. However, despite their utility, as highlighted by unstructured TSGs and incident mitigation history to provide proactive [18, 2], the unstructured format, varying quantity, and propensity for incident mitigation suggestions, reducing human intervention.
arXiv.org Artificial Intelligence
May-10-2024
- Country:
- North America > United States (0.14)
- Genre:
- Research Report (0.40)
- Industry:
- Information Technology > Services (0.35)
- Technology: