AITopics

2506.03381

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > California (0.04)
Oceania > Australia > Queensland (0.04)

Genre: Research Report (1.00)

Industry:

Transportation (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Grigorev, Artur, Saleh, Adriana-Simona Mihaita Khaled, Ou, Yuming

IncidentResponseGPT: Generating Traffic Incident Response Plans with Generative Artificial Intelligence

arXiv.org Artificial IntelligenceJul-24-2024

The proposed IncidentResponseGPT framework - a novel system that applies generative artificial intelligence (AI) to potentially enhance the efficiency and effectiveness of traffic incident response. This model allows for synthesis of region-specific incident response guidelines and generates incident response plans adapted to specific area, aiming to expedite decision-making for traffic management authorities. This approach aims to accelerate incident resolution times by suggesting various recommendations (e.g. optimal rerouting strategies, estimating resource needs) to minimize the overall impact on the urban traffic network. The system suggests specific actions, including dynamic lane closures, optimized rerouting and dispatching appropriate emergency resources. IncidentResponseGPT employs the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) to rank generated response plans based on criteria like impact minimization and resource efficiency based on their proximity to an human-proposed solution.

guideline, incident, response plan, (16 more...)

2404.1855

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York > Onondaga County (0.04)
North America > United States > Hawaii (0.04)
North America > United States > Alaska (0.04)

Genre: Research Report (1.00)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.85)

arXiv.org Artificial IntelligenceMar-6-2024

Exploring LLM-based Agents for Root Cause Analysis

Roy, Devjeet, Zhang, Xuchao, Bhave, Rashi, Bansal, Chetan, Las-Casas, Pedro, Fonseca, Rodrigo, Rajmohan, Saravan

The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a demanding task for on-call engineers, requiring deep domain knowledge and extensive experience with a team's specific services. Automation of RCA can result in significant savings of time, and ease the burden of incident management on on-call engineers. Recently, researchers have utilized Large Language Models (LLMs) to perform RCA, and have demonstrated promising results. However, these approaches are not able to dynamically collect additional diagnostic information such as incident related logs, metrics or databases, severely restricting their ability to diagnose root causes. In this work, we explore the use of LLM based agents for RCA to address this limitation. We present a thorough empirical evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft. Results show that ReAct performs competitively with strong retrieval and reasoning baselines, but with highly increased factual accuracy. We then extend this evaluation by incorporating discussions associated with incident reports as additional inputs for the models, which surprisingly does not yield significant performance improvements. Lastly, we conduct a case study with a team at Microsoft to equip the ReAct agent with tools that give it access to external diagnostic services that are used by the team for manual RCA. Our results show how agents can overcome the limitations of prior work, and practical considerations for implementing such a system in practice.

agent, incident, information, (16 more...)

2403.04123

Country:

North America > United States > District of Columbia > Washington (0.05)
North America > United States > New York > New York County > New York City (0.05)
North America > United States > Washington > King County > Redmond (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Services (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceDec-19-2023

Xpert: Empowering Incident Management with Query Recommendations via Large Language Models

Jiang, Yuxuan, Zhang, Chaoyun, He, Shilin, Yang, Zhihao, Ma, Minghua, Qin, Si, Kang, Yu, Dang, Yingnong, Rajmohan, Saravan, Lin, Qingwei, Zhang, Dongmei

Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these systems can lead to service disruptions and adversely affect user experience. To swiftly resolve such incidents, on-call engineers depend on crafting domain-specific language (DSL) queries to analyze telemetry data. However, writing these queries can be challenging and time-consuming. This paper presents a thorough empirical study on the utilization of queries of KQL, a DSL employed for incident management in a large-scale cloud management system at Microsoft. The findings obtained underscore the importance and viability of KQL queries recommendation to enhance incident management. Building upon these valuable insights, we introduce Xpert, an end-to-end machine learning framework that automates KQL recommendation process. By leveraging historical incident data and large language models, Xpert generates customized KQL queries tailored to new incidents. Furthermore, Xpert incorporates a novel performance metric called Xcore, enabling a thorough evaluation of query quality from three comprehensive perspectives. We conduct extensive evaluations of Xpert, demonstrating its effectiveness in offline settings. Notably, we deploy Xpert in the real production environment of a large-scale incident management system in Microsoft, validating its efficiency in supporting incident management. To the best of our knowledge, this paper represents the first empirical study of its kind, and Xpert stands as a pioneering DSL query recommendation framework designed for incident management.

incident, kql query, query, (16 more...)

2312.11988

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
Asia > India > NCT > New Delhi (0.04)
Asia > China > Anhui Province > Hefei (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Services (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceFeb-22-2023, 04:55:26 GMT

AI-powered anomaly detection in log data for improved troubleshooting in devops

In summary, implementing a solution for AI-powered anomaly detection in log data for improved troubleshooting in DevOps requires a well-structured plan, a good understanding of the use case, and a good knowledge of the different AI-based anomaly detection techniques. With proper planning, implementation, and maintenance, AI-powered anomaly detection can be a valuable asset for any DevOps team.

anomaly detection, artificial intelligence, data mining, (19 more...)

Genre: Research Report > Promising Solution (0.38)

Industry: Energy > Oil & Gas > Upstream (0.82)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence (1.00)

#artificialintelligenceSep-20-2022, 02:15:04 GMT

Council Post: How To Leverage AI/ML For Predictive Incident Management

Digital technologies have led to the application of new-age technologies that operate with minimal human intervention. And while they may heighten productivity and drive growth, any failure can pose a significant challenge for IT and DevOps teams to resolve. An incident or service disruption is an IT manager's worst nightmare. Very often, factors such as cybersecurity breaches, human error, and the accelerated pace of innovation place significant pressure on enterprises' IT infrastructure, leading to system failures and outages impacting the bottom line. According to the ITIC 2021 Hourly Cost of Downtime Survey, 44% of participants (of 1,200 global organizations) said that hourly downtime costs anywhere from $1 million to over $5 million.

incident management, leverage ai ml, predictive incident management, (14 more...)

Industry:

Information Technology > Security & Privacy (0.77)
Government > Military > Cyberwarfare (0.56)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.90)

Shetty, Manish, Bansal, Chetan, Kumar, Sumit, Rao, Nikitha, Nagappan, Nachiappan, Zimmermann, Thomas

Neural Knowledge Extraction From Cloud Service Incidents

arXiv.org Artificial IntelligenceJul-15-2020

In the last decade, two paradigm shifts have reshaped the software industry - the move from boxed products to services and the widespread adoption of cloud computing. This has had a huge impact on the software development life cycle and the DevOps processes. Particularly, incident management has become critical for developing and operating large-scale services. Incidents are created to ensure timely communication of service issues and, also, their resolution. Prior work on incident management has been heavily focused on the challenges with incident triaging and de-duplication. In this work, we address the fundamental problem of structured knowledge extraction from service incidents. We have built SoftNER, a framework for unsupervised knowledge extraction from service incidents. We frame the knowledge extraction problem as a Named-entity Recognition task for extracting factual information. SoftNER leverages structural patterns like key,value pairs and tables for bootstrapping the training data. Further, we build a novel multi-task learning based BiLSTM-CRF model which leverages not just the semantic context but also the data-types for named-entity extraction. We have deployed SoftNER at Microsoft, a major cloud service provider and have evaluated it on more than 2 months of cloud incidents. We show that the unsupervised machine learning based approach has a high precision of 0.96. Our multi-task learning based deep learning model also outperforms the state of the art NER models. Lastly, using the knowledge extracted by SoftNER we are able to build significantly more accurate models for important downstream tasks like incident triaging.

incident, machine learning, natural language, (20 more...)

2007.05505

Country:

North America > United States > District of Columbia > Washington (0.05)
Asia > India > Karnataka > Bengaluru (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Software (1.00)
Information Technology > Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

@machinelearnbotMay-17-2018, 15:20:27 GMT

BigPanda Brings New Capabilities to Cloud Platform - ITChronicles

BigPanda Inc., provider of the first Autonomous Digital Operations solution, today introduced new capabilities to its cloud platform for IT Operations, including two major product components. First, BigPanda uniquely features Open Box Machine LearningTM, a core component of its platform that offers unrivalled transparency, trust and control to enterprise IT customers. Second, BigPanda's new Unified Analytics offering provides deep insights into the real-time health and performance of IT Operations. BigPanda's Autonomous Digital Operations Platform helps large, global enterprises to lower operational costs, improve service availability and reduce the IT risks associated with digital transformation. BigPanda is the only machine learning solution for autonomous incident management that features an "open" approach.

bigpanda, cloud computing, machine learning, (12 more...)

@machinelearnbot

Industry: Information Technology > Services (0.72)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Cloud Computing (0.72)

#artificialintelligenceAug-28-2017, 14:35:21 GMT

Demisto - Taking On Incident Response With Machine Learning & Automation

Demisto is the first product to unify security orchestration, incident management and interactive investigation into one solution. Their machine-learning engine is unique as it learns from the real-life analyst interactions and past investigations. So the platform (and you!) gets smarter with every analyst action. Q: Tell us something more about Demisto? A: Demisto Enterprise delivers a complete solution that helps Tier-1 through Tier-3 analysts and SOC managers to optimize the entire incident life cycle while auto documenting and journaling all the evidence.

artificial intelligence, demisto, machine learning, (16 more...)

Country: Asia (0.05)

Genre: Research Report (0.31)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (0.74)
Information Technology > Artificial Intelligence > Machine Learning (0.67)

#artificialintelligenceMay-18-2017, 09:25:43 GMT

Incident Management for IoT @ThingsExpo @PagerDuty #AI #IoT #M2M #API

All major researchers estimate there will be tens of billions devices - computers, smartphones, tablets, and sensors - connected to the Internet by 2020. With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Join Cloud Expo / @ThingsExpo conference chair Roger Strukhoff (@IoT2040), June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA for three days of intense Enterprise Cloud and'Digital Transformation' discussion and focus, including Big Data's indispensable role in IoT, Smart Grids and (IIoT) Industrial Internet of Things, Wearables and Consumer IoT, as well as (new) Digital Transformation in Vertical Markets. Accordingly, attendees at the upcoming 20th Cloud Expo / @ThingsExpo June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA will find fresh new content in a new track called FinTech, which will incorporate machine learning, artificial intelligence, deep learning, and blockchain into one track.

emergency response, IoT, PagerDuty, (26 more...)

Industry:

Information Technology (1.00)
Law > Intellectual Property & Technology Law (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.54)
Information Technology > Communications > Mobile (0.36)
Information Technology > Data Science > Data Mining > Big Data (0.36)