AITopics | failure type

Collaborating Authors

failure type

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Root Cause Analysis for Microservice Systems via Cascaded Conditional Learning with Hypergraphs

Xie, Shuaiyu, He, Hanbin, Wang, Jian, Li, Bing

arXiv.org Artificial IntelligenceNov-25-2025

Abstract--Root cause analysis in microservice systems typically involves two core tasks: root cause localization (RCL) and failure type identification (FTI). Despite substantial research efforts, conventional diagnostic approaches still face two key challenges. First, these methods predominantly adopt a joint learning paradigm for RCL and FTI to exploit shared information and reduce training time. Second, these existing methods primarily focus on point-to-point relationships between instances, overlooking the group nature of inter-instance influences induced by deployment configurations and load balancing. T o overcome these limitations, we propose CCLH, a novel root cause analysis framework that orchestrates diagnostic tasks based on cascaded conditional learning. CCLH provides a three-level taxonomy for group influences between instances and incorporates a heterogeneous hypergraph to model these relationships, facilitating the simulation of failure propagation. Extensive experiments conducted on datasets from three mi-croservice benchmarks demonstrate that CCLH outperforms state-of-the-art methods in both RCL and FTI. Microservice architecture has been widely adopted by cloud-native enterprises due to its flexibility, scalability, and loose coupling. In microservice systems (MSS), each microser-vice typically reproduces multiple instances, which collaborate with instances affiliated with other microservices to handle user requests [1], [2]. As these systems scale up, they may suffer from reliability issues, aka failures, attributable to the increasing complexity and dynamicity. Worse still, diagnosing failures in microservice systems is labor-intensive and time-consuming, due to the intricate failure propagation and the overwhelming volume of telemetry data. For example, GitHub once took approximately one and a half hours to resolve a failure that disrupted the codespace service, affecting millions of developers and repositories [3]. Traditional root cause analysis (RCA) in MSS encompasses two tasks: root cause localization (RCL) and failure type identification (FTI).

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.17566

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

MicroRemed: Benchmarking LLMs in Microservices Remediation

Zhang, Lingzhe, Zhai, Yunpeng, Jia, Tong, Duan, Chiming, He, Minghua, Pan, Leyi, Liu, Zhaoyang, Ding, Bolin, Li, Ying

arXiv.org Artificial IntelligenceNov-4-2025

Large Language Models (LLMs) integrated with agent-based reasoning frameworks have recently shown strong potential for autonomous decision-making and system-level operations. One promising yet underexplored direction is microservice remediation, where the goal is to automatically recover faulty microservice systems. Existing approaches, however, still rely on human-crafted prompts from Site Reliability Engineers (SREs), with LLMs merely converting textual instructions into executable code. To advance research in this area, we introduce MicroRemed, the first benchmark for evaluating LLMs in end-to-end microservice remediation, where models must directly generate executable Ansible playbooks from diagnosis reports to restore system functionality. We further propose ThinkRemed, a multi-agent framework that emulates the reflective and perceptive reasoning of SREs. Experimental results show that MicroRemed presents substantial challenges to current LLMs, while ThinkRemed improves end-to-end remediation performance through iterative reasoning and system reflection. The benchmark is available at https://github.com/LLM4AIOps/MicroRemed.

large language model, machine learning, microservice system, (20 more...)

arXiv.org Artificial Intelligence

2511.01166

Genre: Research Report > New Finding (0.87)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System

Sun, Yongqian, Pan, Xijie, Xiong, Xiao, Tao, Lei, Wang, Jiaju, Zhang, Shenglin, Yuan, Yuan, Li, Yuqi, Jian, Kunlin

arXiv.org Artificial IntelligenceSep-23-2025

Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.

data mining, machine learning, node, (19 more...)

arXiv.org Artificial Intelligence

2506.20673

Country: Europe (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.47)
Telecommunications (0.47)
Information Technology (0.46)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.68)
(2 more...)

Add feedback

CVCM Track Circuits Pre-emptive Failure Diagnostics for Predictive Maintenance Using Deep Neural Networks

Mukherjee, Debdeep, Di Santi, Eduardo, Lefebvre, Clément, Mijatovic, Nenad, Martin, Victor, Josse, Thierry, Brown, Jonathan, Saiah, Kenza

arXiv.org Machine LearningAug-15-2025

Track circuits are critical for railway operations, acting as the main signalling sub-system to locate trains. Continuous Variable Current Modulation (CVCM) is one such technology. Like any field-deployed, safety-critical asset, it can fail, triggering cascading disruptions. Many failures originate as subtle anomalies that evolve over time, often not visually apparent in monitored signals. Conventional approaches, which rely on clear signal changes, struggle to detect them early. Early identification of failure types is essential to improve maintenance planning, minimising downtime and revenue loss. Leveraging deep neural networks, we propose a predictive maintenance framework that classifies anomalies well before they escalate into failures. Validated on 10 CVCM failure cases across different installations, the method is ISO-17359 compliant and outperforms conventional techniques, achieving 99.31% overall accuracy with detection within 1% of anomaly onset. Through conformal prediction, we provide uncertainty estimates, reaching 99% confidence with consistent coverage across classes. Given CVCMs global deployment, the approach is scalable and adaptable to other track circuits and railway systems, enhancing operational reliability.

artificial intelligence, machine learning, prediction, (17 more...)

arXiv.org Machine Learning

2508.09054

Country:

Europe > Ukraine (0.04)
Europe > Switzerland > Geneva > Geneva (0.04)

Genre: Research Report (1.00)

Industry: Transportation > Ground > Rail (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Gazing at Failure: Investigating Human Gaze in Response to Robot Failure in Collaborative Tasks

Tabatabaei, Ramtin, Kostakos, Vassilis, Johal, Wafa

arXiv.org Artificial IntelligenceFeb-24-2025

Robots are prone to making errors, which can negatively impact their credibility as teammates during collaborative tasks with human users. Detecting and recovering from these failures is crucial for maintaining effective level of trust from users. However, robots may fail without being aware of it. One way to detect such failures could be by analysing humans' non-verbal behaviours and reactions to failures. This study investigates how human gaze dynamics can signal a robot's failure and examines how different types of failures affect people's perception of robot. We conducted a user study with 27 participants collaborating with a robotic mobile manipulator to solve tangram puzzles. The robot was programmed to experience two types of failures -- executional and decisional -- occurring either at the beginning or end of the task, with or without acknowledgement of the failure. Our findings reveal that the type and timing of the robot's failure significantly affect participants' gaze behaviour and perception of the robot. Specifically, executional failures led to more gaze shifts and increased focus on the robot, while decisional failures resulted in lower entropy in gaze transitions among areas of interest, particularly when the failure occurred at the end of the task. These results highlight that gaze can serve as a reliable indicator of robot failures and their types, and could also be used to predict the appropriate recovery actions.

interaction, participant, robot, (15 more...)

arXiv.org Artificial Intelligence

2502.16899

Country:

North America > United States (0.06)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.95)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Exploring the extent of similarities in software failures across industries using LLMs

Detloff, Martin

arXiv.org Artificial IntelligenceAug-7-2024

The rapid evolution of software development necessitates enhanced safety measures. Extracting information about software failures from companies is becoming increasingly more available through news articles. This research utilizes the Failure Analysis Investigation with LLMs (FAIL) model to extract industry-specific information. Although the FAIL model's database is rich in information, it could benefit from further categorization and industry-specific insights to further assist software engineers. In previous work news articles were collected from reputable sources and categorized by incidents inside a database. Prompt engineering and Large Language Models (LLMs) were then applied to extract relevant information regarding the software failure. This research extends these methods by categorizing articles into specific domains and types of software failures. The results are visually represented through graphs. The analysis shows that throughout the database some software failures occur significantly more often in specific industries. This categorization provides a valuable resource for software engineers and companies to identify and address common failures. This research highlights the synergy between software engineering and Large Language Models (LLMs) to automate and enhance the analysis of software failures. By transforming data from the database into an industry specific model, we provide a valuable resource that can be used to identify common vulnerabilities, predict potential risks, and implement proactive measures for preventing software failures. Leveraging the power of the current FAIL database and data visualization, we aim to provide an avenue for safer and more secure software in the future.

information, software failure, vulnerability, (14 more...)

arXiv.org Artificial Intelligence

2408.03528

Country:

North America > United States > New York > New York County > New York City (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New Mexico (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Beyond 5G Network Failure Classification for Network Digital Twin Using Graph Neural Network

Isah, Abubakar, Aliyu, Ibrahim, Shim, Jaechan, Ryu, Hoyong, Kim, Jinsul

arXiv.org Artificial IntelligenceJun-6-2024

Fifth-generation (5G) core networks in network digital twins (NDTs) are complex systems with numerous components, generating considerable data. Analyzing these data can be challenging due to rare failure types, leading to imbalanced classes in multiclass classification. To address this problem, we propose a novel method of integrating a graph Fourier transform (GFT) into a message-passing neural network (MPNN) designed for NDTs. This approach transforms the data into a graph using the GFT to address class imbalance, whereas the MPNN extracts features and models dependencies between network components. This combined approach identifies failure types in real and simulated NDT environments, demonstrating its potential for accurate failure classification in 5G and beyond (B5G) networks. Moreover, the MPNN is adept at learning complex local structures among neighbors in an end-to-end setting. Extensive experiments have demonstrated that the proposed approach can identify failure types in three multiclass domain datasets at multiple failure points in real networks and NDT environments. The results demonstrate that the proposed GFT-MPNN can accurately classify network failures in B5G networks, especially when employed within NDTs to detect failure types.

classification, dataset, graph, (17 more...)

arXiv.org Artificial Intelligence

2406.06595

Country:

Asia > South Korea > Gwangju > Gwangju (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > South Korea > Daejeon > Daejeon (0.04)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Promising Solution (0.87)

Industry:

Telecommunications (1.00)
Information Technology > Security & Privacy (0.67)
Information Technology > Networks (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

Quaye, Jessica, Parrish, Alicia, Inel, Oana, Rastogi, Charvi, Kirk, Hannah Rose, Kahng, Minsuk, van Liemt, Erin, Bartolo, Max, Tsang, Jess, White, Justin, Clement, Nathan, Mosquera, Rafael, Ciro, Juan, Reddi, Vijay Janapa, Aroyo, Lora

arXiv.org Artificial IntelligenceMay-13-2024

With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativity is well-suited to uncover. To this end, we built the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing a diverse set of implicitly adversarial prompts. We have assembled a suite of state-of-the-art T2I models, employed a simple user interface to identify and annotate harms, and engaged diverse populations to capture long-tail safety issues that may be overlooked in standard testing. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models. In this paper, we present an in-depth account of our methodology, a systematic study of novel attack strategies and discussion of safety failures revealed by challenge participants. We also release a companion visualization tool for easy exploration and derivation of insights from the dataset. The first challenge round resulted in over 10k prompt-image pairs with machine annotations for safety. A subset of 1.5k samples contains rich human annotations of harm types and attack styles. We find that 14% of images that humans consider harmful are mislabeled as ``safe'' by machines. We have identified new attack strategies that highlight the complexity of ensuring T2I model robustness. Our findings emphasize the necessity of continual auditing and adaptation as new vulnerabilities emerge. We are confident that this work will enable proactive, iterative safety assessments and promote responsible development of T2I models.

annotation, classifier, participant, (15 more...)

arXiv.org Artificial Intelligence

2403.12075

Country:

South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.05)
South America > Colombia (0.04)
North America > Central America (0.04)
(34 more...)

Genre:

Research Report (1.00)
Overview (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.66)

Add feedback

Landslide Topology Uncovers Failure Movements

Rana, Kamal, Bhuyan, Kushanav, Ferrer, Joaquin Vicente, Cotton, Fabrice, Ozturk, Ugur, Catani, Filippo, Malik, Nishant

arXiv.org Artificial IntelligenceOct-14-2023

Eery year, landslides cause economic damages worth 20 billion US dollars [1], and between 2004 and 2019 non-seismic landslides alone caused about 70, 000 fatalities worldwide [2]. Within the first two months of 2023, we have seen reports of devastating landslides in São Paulo, Brazil [3], Southern Peru [4], and New Zealand [5], injuring many and killing approximately 70 people. Adding to this, recent studies count over one million landslide occurrences with annual volumes estimated at fifty-six billion cubic meters globally [6], presenting a risk to sixty million people [7]. With the increase in urbanization, global climate change, and environmental change trends, the frequency of landslides and the associated risks will keep increasing globally over time [7]. In line with this, landslides are anticipated to evolve and remobilize with increased frequency under changing climatic conditions on a decadal scale [8, 9]. Our ability to identify hazards from emerging landslides and dynamically assess impact areas is essential in averting risk to rapidly urbanizing communities and adapting to changing environmental conditions [10, 7]. To address the rising landslide risk, predictive models for hazard, risk, and early warning systems are developed which assist in forecasting landslide occurrences and locating landslide-prone regions to mitigate the associated impacts [11]. However, the efficacy of these models is contingent on the quality of the underlying landslide databases.

failure type, landslide, topological property, (15 more...)

arXiv.org Artificial Intelligence

2310.09631

Country:

South America > Peru (0.24)
Oceania > New Zealand (0.24)
South America > Brazil > São Paulo (0.24)
(10 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Energy (0.93)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Software > Programming Languages (0.92)
Information Technology > Data Science (0.89)

Add feedback

Effects of Explanation Strategies to Resolve Failures in Human-Robot Collaboration

Khanna, Parag, Yadollahi, Elmira, Björkman, Mårten, Leite, Iolanda, Smith, Christian

arXiv.org Artificial IntelligenceSep-18-2023

Despite significant improvements in robot capabilities, they are likely to fail in human-robot collaborative tasks due to high unpredictability in human environments and varying human expectations. In this work, we explore the role of explanation of failures by a robot in a human-robot collaborative task. We present a user study incorporating common failures in collaborative tasks with human assistance to resolve the failure. In the study, a robot and a human work together to fill a shelf with objects. Upon encountering a failure, the robot explains the failure and the resolution to overcome the failure, either through handovers or humans completing the task. The study is conducted using different levels of robotic explanation based on the failure action, failure cause, and action history, and different strategies in providing the explanation over the course of repeated interaction. Our results show that the success in resolving the failures is not only a function of the level of explanation but also the type of failures. Furthermore, while novice users rate the robot higher overall in terms of their satisfaction with the explanation, their satisfaction is not only a function of the robot's explanation level at a certain round but also the prior information they received from the robot.

explanation, participant, robot, (17 more...)

arXiv.org Artificial Intelligence

2309.10127

Country:

North America > United States (0.14)
Europe > Sweden (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.93)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.91)

Add feedback