incident management
Automated Traffic Incident Response Plans using Generative Artificial Intelligence: Part 1 -- Building the Incident Response Benchmark
Grigorev, Artur, Saleh, Khaled, Kim, Jiwon, Mihaita, Adriana-Simona
Traffic incidents remain a critical public safety concern worldwide, with Australia recording 1,300 road fatalities in 2024, which is the highest toll in 12 years. Similarly, the United States reports approximately 6 million crashes annually, raising significant challenges in terms of a fast reponse time and operational management. Traditional response protocols rely on human decision-making, which introduces potential inconsistencies and delays during critical moments when every minute impacts both safety outcomes and network performance. To address this issue, we propose a novel Incident Response Benchmark that uses generative artificial intelligence to automatically generate response plans for incoming traffic incidents. Our approach aims to significantly reduce incident resolution times by suggesting context-appropriate actions such as variable message sign deployment, lane closures, and emergency resource allocation adapted to specific incident characteristics. First, the proposed methodology uses real-world incident reports from the Performance Measurement System (PeMS) as training and evaluation data. We extract historically implemented actions from these reports and compare them against AI-generated response plans that suggest specific actions, such as lane closures, variable message sign announcements, and/or dispatching appropriate emergency resources. Second, model evaluations reveal that advanced generative AI models like GPT-4o and Grok 2 achieve superior alignment with expert solutions, demonstrated by minimized Hamming distances (averaging 2.96-2.98) and low weighted differences (approximately 0.27-0.28). Conversely, while Gemini 1.5 Pro records the lowest count of missed actions, its extremely high number of unnecessary actions (1547 compared to 225 for GPT-4o) indicates an over-triggering strategy that reduces the overall plan efficiency.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > California (0.04)
- Oceania > Australia > Queensland (0.04)
IncidentResponseGPT: Generating Traffic Incident Response Plans with Generative Artificial Intelligence
Grigorev, Artur, Saleh, Adriana-Simona Mihaita Khaled, Ou, Yuming
The proposed IncidentResponseGPT framework - a novel system that applies generative artificial intelligence (AI) to potentially enhance the efficiency and effectiveness of traffic incident response. This model allows for synthesis of region-specific incident response guidelines and generates incident response plans adapted to specific area, aiming to expedite decision-making for traffic management authorities. This approach aims to accelerate incident resolution times by suggesting various recommendations (e.g. optimal rerouting strategies, estimating resource needs) to minimize the overall impact on the urban traffic network. The system suggests specific actions, including dynamic lane closures, optimized rerouting and dispatching appropriate emergency resources. IncidentResponseGPT employs the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) to rank generated response plans based on criteria like impact minimization and resource efficiency based on their proximity to an human-proposed solution.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New York > Onondaga County (0.04)
- North America > United States > Hawaii (0.04)
- North America > United States > Alaska (0.04)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Ground > Road (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Government (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.85)
Exploring LLM-based Agents for Root Cause Analysis
Roy, Devjeet, Zhang, Xuchao, Bhave, Rashi, Bansal, Chetan, Las-Casas, Pedro, Fonseca, Rodrigo, Rajmohan, Saravan
The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a demanding task for on-call engineers, requiring deep domain knowledge and extensive experience with a team's specific services. Automation of RCA can result in significant savings of time, and ease the burden of incident management on on-call engineers. Recently, researchers have utilized Large Language Models (LLMs) to perform RCA, and have demonstrated promising results. However, these approaches are not able to dynamically collect additional diagnostic information such as incident related logs, metrics or databases, severely restricting their ability to diagnose root causes. In this work, we explore the use of LLM based agents for RCA to address this limitation. We present a thorough empirical evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft. Results show that ReAct performs competitively with strong retrieval and reasoning baselines, but with highly increased factual accuracy. We then extend this evaluation by incorporating discussions associated with incident reports as additional inputs for the models, which surprisingly does not yield significant performance improvements. Lastly, we conduct a case study with a team at Microsoft to equip the ReAct agent with tools that give it access to external diagnostic services that are used by the team for manual RCA. Our results show how agents can overcome the limitations of prior work, and practical considerations for implementing such a system in practice.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Washington > King County > Redmond (0.04)
- (4 more...)
Xpert: Empowering Incident Management with Query Recommendations via Large Language Models
Jiang, Yuxuan, Zhang, Chaoyun, He, Shilin, Yang, Zhihao, Ma, Minghua, Qin, Si, Kang, Yu, Dang, Yingnong, Rajmohan, Saravan, Lin, Qingwei, Zhang, Dongmei
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these systems can lead to service disruptions and adversely affect user experience. To swiftly resolve such incidents, on-call engineers depend on crafting domain-specific language (DSL) queries to analyze telemetry data. However, writing these queries can be challenging and time-consuming. This paper presents a thorough empirical study on the utilization of queries of KQL, a DSL employed for incident management in a large-scale cloud management system at Microsoft. The findings obtained underscore the importance and viability of KQL queries recommendation to enhance incident management. Building upon these valuable insights, we introduce Xpert, an end-to-end machine learning framework that automates KQL recommendation process. By leveraging historical incident data and large language models, Xpert generates customized KQL queries tailored to new incidents. Furthermore, Xpert incorporates a novel performance metric called Xcore, enabling a thorough evaluation of query quality from three comprehensive perspectives. We conduct extensive evaluations of Xpert, demonstrating its effectiveness in offline settings. Notably, we deploy Xpert in the real production environment of a large-scale incident management system in Microsoft, validating its efficiency in supporting incident management. To the best of our knowledge, this paper represents the first empirical study of its kind, and Xpert stands as a pioneering DSL query recommendation framework designed for incident management.
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- Asia > India > NCT > New Delhi (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
AI-powered anomaly detection in log data for improved troubleshooting in devops
In summary, implementing a solution for AI-powered anomaly detection in log data for improved troubleshooting in DevOps requires a well-structured plan, a good understanding of the use case, and a good knowledge of the different AI-based anomaly detection techniques. With proper planning, implementation, and maintenance, AI-powered anomaly detection can be a valuable asset for any DevOps team.
Council Post: How To Leverage AI/ML For Predictive Incident Management
Digital technologies have led to the application of new-age technologies that operate with minimal human intervention. And while they may heighten productivity and drive growth, any failure can pose a significant challenge for IT and DevOps teams to resolve. An incident or service disruption is an IT manager's worst nightmare. Very often, factors such as cybersecurity breaches, human error, and the accelerated pace of innovation place significant pressure on enterprises' IT infrastructure, leading to system failures and outages impacting the bottom line. According to the ITIC 2021 Hourly Cost of Downtime Survey, 44% of participants (of 1,200 global organizations) said that hourly downtime costs anywhere from $1 million to over $5 million.
- Information Technology > Security & Privacy (0.77)
- Government > Military > Cyberwarfare (0.56)
Neural Knowledge Extraction From Cloud Service Incidents
Shetty, Manish, Bansal, Chetan, Kumar, Sumit, Rao, Nikitha, Nagappan, Nachiappan, Zimmermann, Thomas
In the last decade, two paradigm shifts have reshaped the software industry - the move from boxed products to services and the widespread adoption of cloud computing. This has had a huge impact on the software development life cycle and the DevOps processes. Particularly, incident management has become critical for developing and operating large-scale services. Incidents are created to ensure timely communication of service issues and, also, their resolution. Prior work on incident management has been heavily focused on the challenges with incident triaging and de-duplication. In this work, we address the fundamental problem of structured knowledge extraction from service incidents. We have built SoftNER, a framework for unsupervised knowledge extraction from service incidents. We frame the knowledge extraction problem as a Named-entity Recognition task for extracting factual information. SoftNER leverages structural patterns like key,value pairs and tables for bootstrapping the training data. Further, we build a novel multi-task learning based BiLSTM-CRF model which leverages not just the semantic context but also the data-types for named-entity extraction. We have deployed SoftNER at Microsoft, a major cloud service provider and have evaluated it on more than 2 months of cloud incidents. We show that the unsupervised machine learning based approach has a high precision of 0.96. Our multi-task learning based deep learning model also outperforms the state of the art NER models. Lastly, using the knowledge extracted by SoftNER we are able to build significantly more accurate models for important downstream tasks like incident triaging.
- North America > United States > District of Columbia > Washington (0.05)
- Asia > India > Karnataka > Bengaluru (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Information Technology > Software (1.00)
- Information Technology > Services (1.00)
BigPanda Brings New Capabilities to Cloud Platform - ITChronicles
BigPanda Inc., provider of the first Autonomous Digital Operations solution, today introduced new capabilities to its cloud platform for IT Operations, including two major product components. First, BigPanda uniquely features Open Box Machine LearningTM, a core component of its platform that offers unrivalled transparency, trust and control to enterprise IT customers. Second, BigPanda's new Unified Analytics offering provides deep insights into the real-time health and performance of IT Operations. BigPanda's Autonomous Digital Operations Platform helps large, global enterprises to lower operational costs, improve service availability and reduce the IT risks associated with digital transformation. BigPanda is the only machine learning solution for autonomous incident management that features an "open" approach.
Demisto - Taking On Incident Response With Machine Learning & Automation
Demisto is the first product to unify security orchestration, incident management and interactive investigation into one solution. Their machine-learning engine is unique as it learns from the real-life analyst interactions and past investigations. So the platform (and you!) gets smarter with every analyst action. Q: Tell us something more about Demisto? A: Demisto Enterprise delivers a complete solution that helps Tier-1 through Tier-3 analysts and SOC managers to optimize the entire incident life cycle while auto documenting and journaling all the evidence.
Incident Management for IoT @ThingsExpo @PagerDuty #AI #IoT #M2M #API
All major researchers estimate there will be tens of billions devices - computers, smartphones, tablets, and sensors - connected to the Internet by 2020. With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Join Cloud Expo / @ThingsExpo conference chair Roger Strukhoff (@IoT2040), June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA for three days of intense Enterprise Cloud and'Digital Transformation' discussion and focus, including Big Data's indispensable role in IoT, Smart Grids and (IIoT) Industrial Internet of Things, Wearables and Consumer IoT, as well as (new) Digital Transformation in Vertical Markets. Accordingly, attendees at the upcoming 20th Cloud Expo / @ThingsExpo June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA will find fresh new content in a new track called FinTech, which will incorporate machine learning, artificial intelligence, deep learning, and blockchain into one track.
- Information Technology (1.00)
- Law > Intellectual Property & Technology Law (0.69)
- Information Technology > Artificial Intelligence > Machine Learning (0.54)
- Information Technology > Communications > Mobile (0.36)
- Information Technology > Data Science > Data Mining > Big Data (0.36)