Goto

Collaborating Authors

 autonomic computing


Self-Healing Machine Learning: A Framework for Autonomous Adaptation in Real-World Environments

Neural Information Processing Systems

Real-world machine learning systems often encounter model performance degradation due to distributional shifts in the underlying data generating process (DGP). Existing approaches to addressing shifts, such as concept drift adaptation, are limited by their *reason-agnostic* nature. By choosing from a pre-defined set of actions, such methods implicitly assume that the causes of model degradation are irrelevant to what actions should be taken, limiting their ability to select appropriate adaptations. In this paper, we propose an alternative paradigm to overcome these limitations, called *self-healing machine learning* (SHML). Contrary to previous approaches, SHML autonomously diagnoses the reason for degradation and proposes diagnosis-based corrective actions. We formalize SHML as an optimization problem over a space of adaptation actions to minimize the expected risk under the shifted DGP. We introduce a theoretical framework for self-healing systems and build an agentic self-healing solution *$\mathcal{H}$-LLM* which uses large language models to perform self-diagnosis by reasoning about the structure underlying the DGP, and self-adaptation by proposing and evaluating corrective actions. Empirically, we analyze different components of *$\mathcal{H}$-LLM* to understand *why* and *when* it works, demonstrating the potential of self-healing ML.


AutoGuard: A Self-Healing Proactive Security Layer for DevSecOps Pipelines Using Reinforcement Learning

Anugula, Praveen, Bhardwaj, Avdhesh Kumar, Chhibber, Navin, Tewari, Rohit, Khemka, Sunil, Ranjan, Piyush

arXiv.org Artificial Intelligence

Contemporary DevSecOps pipelines have to deal with the evolution of security in an ever-continuously integrated and deployed environment. Existing methods,such as rule-based intrusion detection and static vulnerability scanning, are inadequate and unreceptive to changes in the system, causing longer response times and organization needs exposure to emerging attack vectors. In light of the previous constraints, we introduce AutoGuard to the DevSecOps ecosystem, a reinforcement learning (RL)-powered self-healing security framework built to pre-emptively protect DevSecOps environments. AutoGuard is a self-securing security environment that continuously observes pipeline activities for potential anomalies while preemptively remediating the environment. The model observes and reacts based on a policy that is continually learned dynamically over time. The RL agent improves each action over time through reward-based learning aimed at improving the agent's ability to prevent, detect and respond to a security incident in real-time. Testing using simulated ContinuousIntegration / Continuous Deployment (CI/CD) environments showed AutoGuard to successfully improve threat detection accuracy by 22%, reduce mean time torecovery (MTTR) for incidents by 38% and increase overall resilience to incidents as compared to traditional methods. Keywords- DevSecOps, Reinforcement Learning, Self- Healing Security, Continuous Integration, Automated Threat Mitigation


Generative AI for Self-Adaptive Systems: State of the Art and Research Roadmap

Li, Jialong, Zhang, Mingyue, Li, Nianyu, Weyns, Danny, Jin, Zhi, Tei, Kenji

arXiv.org Artificial Intelligence

Self-adaptive systems (SASs) are designed to handle changes and uncertainties through a feedback loop with four core functionalities: monitoring, analyzing, planning, and execution. Recently, generative artificial intelligence (GenAI), especially the area of large language models, has shown impressive performance in data comprehension and logical reasoning. These capabilities are highly aligned with the functionalities required in SASs, suggesting a strong potential to employ GenAI to enhance SASs. However, the specific benefits and challenges of employing GenAI in SASs remain unclear. Yet, providing a comprehensive understanding of these benefits and challenges is complex due to several reasons: limited publications in the SAS field, the technological and application diversity within SASs, and the rapid evolution of GenAI technologies. To that end, this paper aims to provide researchers and practitioners a comprehensive snapshot that outlines the potential benefits and challenges of employing GenAI's within SAS. Specifically, we gather, filter, and analyze literature from four distinct research fields and organize them into two main categories to potential benefits: (i) enhancements to the autonomy of SASs centered around the specific functions of the MAPE-K feedback loop, and (ii) improvements in the interaction between humans and SASs within human-on-the-loop settings. From our study, we outline a research roadmap that highlights the challenges of integrating GenAI into SASs. The roadmap starts with outlining key research challenges that need to be tackled to exploit the potential for applying GenAI in the field of SAS. The roadmap concludes with a practical reflection, elaborating on current shortcomings of GenAI and proposing possible mitigation strategies.


Leveraging AI Agents for Autonomous Networks: A Reference Architecture and Empirical Studies

Wu, Binghan, Wang, Shoufeng, Liu, Yunxin, Zhang, Ya-Qin, Sifakis, Joseph, Ouyang, Ye

arXiv.org Artificial Intelligence

Abstract--The evolution toward Level 4 (L4) Autonomous Networks (AN) represents a strategic inflection point in telecommunications, where networks must transcend reactive automation to achieve genuine cognitive capabilities--fulfilling AN's vision of self-configuring, self-healing, and self-optimizing systems that deliver zero-wait, zero-touch, and zero-fault services. This work bridges the gap between architectural theory and operational reality by implementing Joseph Sifakis's AN Agent reference architecture in a functional cognitive system, deploying coordinated proactive-reactive runtimes driven by hybrid knowledge representation. Specifically, the system demonstrates sub-10 ms real-time control in 5G NR sub-6 GHz environments. Empirical results show a 4% increase in downlink throughput over Outer Loop Link Adaptation (OLLA) algorithms for enhanced mobile broadband (eMBB). Furthermore, for the ultra-reliable low-latency communication (URLLC) scenario, the agent achieves an 85% reduction in Block Error Rate (BLER). These improvements confirm the architecture's viability in overcoming traditional autonomy barriers and advancing critical L4-enabling capabilities toward next-generation objectives. UTONOMOUS Networks (AN), a purpose-specific telecommunications technology pioneered by the TM Forum (TMF) in 2019, target networks with intrinsic self-configuration, self-healing, and self-optimization capabilities--collectively termed the Three-Self Capabilities [1]. These fundamental properties enable the realization of zero-wait, zero-touch, and zero-fault network services, known as the Three-Zero Objectives, which collectively deliver optimal user experiences while maximizing resource utilization throughout the entire network lifecycle. By strategically integrating emerging general-purpose technologies including artificial intelligence (AI), digital twins, and big data analytics, AN not only transforms conventional network operations but fundamentally reorients value creation paradigms from traditional device-centric and management-centric models toward customer-oriented, service-driven, and business-focused frameworks.


Self-Healing Machine Learning: A Framework for Autonomous Adaptation in Real-World Environments

Rauba, Paulius, Seedat, Nabeel, Kacprzyk, Krzysztof, van der Schaar, Mihaela

arXiv.org Artificial Intelligence

Real-world machine learning systems often encounter model performance degradation due to distributional shifts in the underlying data generating process (DGP). Existing approaches to addressing shifts, such as concept drift adaptation, are limited by their reason-agnostic nature. By choosing from a pre-defined set of actions, such methods implicitly assume that the causes of model degradation are irrelevant to what actions should be taken, limiting their ability to select appropriate adaptations. In this paper, we propose an alternative paradigm to overcome these limitations, called self-healing machine learning (SHML). Contrary to previous approaches, SHML autonomously diagnoses the reason for degradation and proposes diagnosis-based corrective actions. We formalize SHML as an optimization problem over a space of adaptation actions to minimize the expected risk under the shifted DGP. We introduce a theoretical framework for self-healing systems and build an agentic self-healing solution H-LLM which uses large language models to perform self-diagnosis by reasoning about the structure underlying the DGP, and self-adaptation by proposing and evaluating corrective actions. Empirically, we analyze different components of H-LLM to understand why and when it works, demonstrating the potential of self-healing ML.


Self-Replicating Mechanical Universal Turing Machine

Lano, Ralph P.

arXiv.org Artificial Intelligence

This paper presents the implementation of a self-replicating finite-state machine (FSM) and a self-replicating Turing Machine (TM) using bio-inspired mechanisms. Building on previous work that introduced self-replicating structures capable of sorting, copying, and reading information, this study demonstrates the computational power of these mechanisms by explicitly constructing a functioning FSM and TM. This study demonstrates the universality of the system by emulating the UTM(5,5) of Neary and Woods.


The Vision of Autonomic Computing: Can LLMs Make It a Reality?

Zhang, Zhiyang, Yang, Fangkai, Qin, Xiaoting, Zhang, Jue, Lin, Qingwei, Cheng, Gong, Zhang, Dongmei, Rajmohan, Saravan, Zhang, Qi

arXiv.org Artificial Intelligence

The Vision of Autonomic Computing (ACV), proposed over two decades ago, envisions computing systems that self-manage akin to biological organisms, adapting seamlessly to changing environments. Despite decades of research, achieving ACV remains challenging due to the dynamic and complex nature of modern computing systems. Recent advancements in Large Language Models (LLMs) offer promising solutions to these challenges by leveraging their extensive knowledge, language understanding, and task automation capabilities. This paper explores the feasibility of realizing ACV through an LLM-based multi-agent framework for microservice management. We introduce a five-level taxonomy for autonomous service maintenance and present an online evaluation benchmark based on the Sock Shop microservice demo project to assess our framework's performance. Our findings demonstrate significant progress towards achieving Level 3 autonomy, highlighting the effectiveness of LLMs in detecting and resolving issues within microservice architectures. This study contributes to advancing autonomic computing by pioneering the integration of LLMs into microservice management frameworks, paving the way for more adaptive and self-managing computing systems. The code will be made available at https://aka.ms/ACV-LLM.


Knowledge Equivalence in Digital Twins of Intelligent Systems

Zhang, Nan, Bahsoon, Rami, Tziritas, Nikos, Theodoropoulos, Georgios

arXiv.org Artificial Intelligence

A digital twin contains up-to-date data-driven models of the physical world being studied and can use simulation to optimise the physical world. However, the analysis made by the digital twin is valid and reliable only when the model is equivalent to the physical world. Maintaining such an equivalent model is challenging, especially when the physical systems being modelled are intelligent and autonomous. The paper focuses in particular on digital twin models of intelligent systems where the systems are knowledge-aware but with limited capability. The digital twin improves the acting of the physical system at a meta-level by accumulating more knowledge in the simulated environment. The modelling of such an intelligent physical system requires replicating the knowledge-awareness capability in the virtual space. Novel equivalence maintaining techniques are needed, especially in synchronising the knowledge between the model and the physical system. This paper proposes the notion of knowledge equivalence and an equivalence maintaining approach by knowledge comparison and updates. A quantitative analysis of the proposed approach confirms that compared to state equivalence, knowledge equivalence maintenance can tolerate deviation thus reducing unnecessary updates and achieve more Pareto efficient solutions for the trade-off between update overhead and simulation reliability.


Self-Healing First-Order Distributed Optimization with Packet Loss

Ridgley, Israel L. Donato, Freeman, Randy A., Lynch, Kevin M.

arXiv.org Artificial Intelligence

We describe SH-SVL, a parameterized family of first-order distributed optimization algorithms that enable a network of agents to collaboratively calculate a decision variable that minimizes the sum of cost functions at each agent. These algorithms are self-healing in that their convergence to the correct optimizer can be guaranteed even if they are initialized randomly, agents join or leave the network, or local cost functions change. We also present simulation evidence that our algorithms are self-healing in the case of dropped communication packets. Our algorithms are the first single-Laplacian methods for distributed convex optimization to exhibit all of these characteristics. We achieve self-healing by sacrificing internal stability, a fundamental trade-off for single-Laplacian methods.


Self-healing metal? It's not just the stuff of science fiction.

The Japan Times

WASHINGTON – In the 1991 film "Terminator 2: Judgment Day," a malevolent time-traveling and shape-shifting android called T-1000 that was made of liquid metal demonstrated a unique quality. Hit with blasts or bullets, its metal would heal itself. Self-healing metal is still just science fiction, right? Scientists on Wednesday described how pieces of pure platinum and copper spontaneously healed cracks caused by metal fatigue during nanoscale experiments that had been designed to study how such cracks form and spread in metal placed under stress. They expressed optimism that this ability can be engineered into metals to create self-healing machines and structures in the relatively near future.