Goto

Collaborating Authors

 Markov Models


Guidelines for Applying RL and MARL in Cybersecurity Applications

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) and Multi-Agent Reinforcement Learning (MARL) offer promising solutions for complex, dynamic environments where decision-making entities must interact and adapt. In cybersecurity, particularly in Automated Cyber Defence(ACD), these techniques can address challenges posed by high-dimensional observations and actions. This document provides guidelines for: Cybersecurity professionals exploring RL and MARL for real-world applications. RL and MARL researchers aiming to tackle the nuanced demands of cybersecurity scenarios. By outlining when RL and MARL are appropriate, addressing cyber-specific challenges, and offering practical considerations for implementation, these guidelines aim to bridge the gap between theoretical research and practical deployment in adversarial settings. We expect that this document will offer support to researchers who are keen to explore topics at the intersection of RL, MARL and ACD by highlighting open research questions and topics that demand further investigation.


Quantum-Inspired Reinforcement Learning in the Presence of Epistemic Ambivalence

arXiv.org Artificial Intelligence

The complexity of online decision-making under uncertainty stems from the requirement of finding a balance between exploiting known strategies and exploring new possibilities. Naturally, the uncertainty type plays a crucial role in developing decision-making strategies that manage complexity effectively. In this paper, we focus on a specific form of uncertainty known as epistemic ambivalence (EA), which emerges from conflicting pieces of evidence or contradictory experiences. It creates a delicate interplay between uncertainty and confidence, distinguishing it from epistemic uncertainty that typically diminishes with new information. Indeed, ambivalence can persist even after additional knowledge is acquired. To address this phenomenon, we propose a novel framework, called the epistemically ambivalent Markov decision process (EA-MDP), aiming to understand and control EA in decision-making processes. This framework incorporates the concept of a quantum state from the quantum mechanics formalism, and its core is to assess the probability and reward of every possible outcome. We calculate the reward function using quantum measurement techniques and prove the existence of an optimal policy and an optimal value function in the EA-MDP framework. We also propose the EA-epsilon-greedy Q-learning algorithm. To evaluate the impact of EA on decision-making and the expedience of our framework, we study two distinct experimental setups, namely the two-state problem and the lattice problem. Our results show that using our methods, the agent converges to the optimal policy in the presence of EA.


Geometric Re-Analysis of Classical MDP Solving Algorithms

arXiv.org Artificial Intelligence

We extend a recently introduced geometric interpretation of Markov Decision Processes (MDPs) that provides a new perspective on MDP algorithms and their dynamics. Based on this view, we develop a novel analytical framework that simplifies the proofs of existing results and enables us to derive new ones. Specifically, we analyze the behavior of two classical MDP-solving algorithms: Policy Iteration (PI) and Value Iteration (VI). For each algorithm, we first describe its dynamics in geometric terms and then present an analysis along with several convergence results. We begin by introducing an MDP transformation that modifies the discount factor ฮณ and demonstrate how this transformation improves the convergence properties of both algorithms, provided that it can be applied such that the resulting system remains a regular MDP. Second, we present a new analysis of PI in a 2-state MDP case, showing that the number of iterations required for convergence is bounded by the number of state-action pairs. Finally, we reveal an additional convergence factor in the VI algorithm for cases with a connected optimal policy, which is attributed to an extra rotation component in the VI dynamics.


Training a Generally Curious Agent

arXiv.org Artificial Intelligence

Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present PAPRIKA, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, PAPRIKA teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with PAPRIKA can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.


Review of Machine Learning for Micro-Electronic Design Verification

arXiv.org Artificial Intelligence

Microelectronic design verification remains a critical bottleneck in device development, traditionally mitigated by expanding verification teams and computational resources. Since the late 1990s, machine learning (ML) has been proposed to enhance verification efficiency, yet many techniques have not achieved mainstream adoption. This review, from the perspective of verification and ML practitioners, examines the application of ML in dynamic-based techniques for functional verification of microelectronic designs, and provides a starting point for those new to this interdisciplinary field. Historical trends, techniques, ML types, and evaluation baselines are analysed to understand why previous research has not been widely adopted in industry. The review highlights the application of ML, the techniques used and critically discusses their limitations and successes. Although there is a wealth of promising research, real-world adoption is hindered by challenges in comparing techniques, identifying suitable applications, and the expertise required for implementation. This review proposes that the field can progress through the creation and use of open datasets, common benchmarks, and verification targets. By establishing open evaluation criteria, industry can guide future research. Parallels with ML in software verification suggest potential for collaboration. Additionally, greater use of open-source designs and verification environments can allow more researchers from outside the hardware verification discipline to contribute to the challenge of verifying microelectronic designs.


DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions

arXiv.org Machine Learning

We consider Inverse Optimal Stopping (IOS) problem where, based on stopped expert trajectories, one aims to recover the optimal stopping region through continuation and stopping gain functions approximation. The uniqueness of the stopping region allows the use of IOS in real-world applications with safety concerns. While current state-of-the-art inverse reinforcement learning methods recover both a Q-function and the corresponding optimal policy, they fail to account for specific challenges posed by optimal stopping problems. These include data sparsity near the stopping region, non-Markovian nature of the continuation gain, a proper treatment of boundary conditions, the need for a stable offline approach for risk-sensitive applications, and a lack of a quality evaluation metric. These challenges are addressed with the proposed Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping (DO-IQS), which incorporates temporal information by approximating the cumulative continuation gain together with the world dynamics and the Q-function without querying to the environment. Moreover, a confidence-based oversampling approach is proposed to treat the data sparsity problem. We demonstrate the performance of our models on real and artificial data including an optimal intervention for critical events problem.


POPGym Arcade: Parallel Pixelated POMDPs

arXiv.org Artificial Intelligence

We introduce POPGym Arcade, a benchmark consisting of 7 pixel-based environments each with three difficulties, utilizing a single observation and action space. Each environment offers both fully observable and partially observable variants, enabling counterfactual studies on partial observability. POPGym Arcade utilizes JIT compilation on hardware accelerators to achieve substantial speedups over CPU-bound environments. Moreover, this enables Podracer-style architectures to further increase hardware utilization and training speed. We evaluate memory models on our environments using a Podracer variant of Q learning, and examine the results. Finally, we generate memory saliency maps, uncovering how memories propagate through policies. Our library is available at https://github.com/bolt-research/


Perceptual Motor Learning with Active Inference Framework for Robust Lateral Control

arXiv.org Artificial Intelligence

This paper presents a novel Perceptual Motor Learning (PML) framework integrated with Active Inference (AIF) to enhance lateral control in Highly Automated Vehicles (HAVs). PML, inspired by human motor learning, emphasizes the seamless integration of perception and action, enabling efficient decision-making in dynamic environments. Traditional autonomous driving approaches--including modular pipelines, imitation learning, and reinforcement learning--struggle with adaptability, generalization, and computational efficiency. In contrast, PML with AIF leverages a generative model to minimize prediction error ("surprise") and actively shape vehicle control based on learned perceptual-motor representations. Our approach unifies deep learning with active inference principles, allowing HAVs to perform lane-keeping maneuvers with minimal data and without extensive retraining across different environments. Extensive experiments in the CARLA simulator demonstrate that PML with AIF enhances adaptability without increasing computational overhead while achieving performance comparable to conventional methods. These findings highlight the potential of PML-driven active inference as a robust alternative for real-world autonomous driving applications.


Quantitative Resilience Modeling for Autonomous Cyber Defense

arXiv.org Artificial Intelligence

Cyber resilience is the ability of a system to recover from an attack with minimal impact on system operations. However, characterizing a network's resilience under a cyber attack is challenging, as there are no formal definitions of resilience applicable to diverse network topologies and attack patterns. In this work, we propose a quantifiable formulation of resilience that considers multiple defender operational goals, the criticality of various network resources for daily operations, and provides interpretability to security operators about their system's resilience under attack. We evaluate our approach within the CybORG environment, a reinforcement learning (RL) framework for autonomous cyber defense, analyzing trade-offs between resilience, costs, and prioritization of operational goals. Furthermore, we introduce methods to aggregate resilience metrics across time-variable attack patterns and multiple network topologies, comprehensively characterizing system resilience. Using insights gained from our resilience metrics, we design RL autonomous defensive agents and compare them against several heuristic baselines, showing that proactive network hardening techniques and prompt recovery of compromised machines are critical for effective cyber defenses.


Human-aligned Safe Reinforcement Learning for Highway On-Ramp Merging in Dense Traffic

arXiv.org Artificial Intelligence

Most reinforcement learning (RL) approaches for the decision-making of autonomous driving consider safety as a reward instead of a cost, which makes it hard to balance the tradeoff between safety and other objectives. Human risk preference has also rarely been incorporated, and the trained policy might be either conservative or aggressive for users. To this end, this study proposes a human-aligned safe RL approach for autonomous merging, in which the high-level decision problem is formulated as a constrained Markov decision process (CMDP) that incorporates users' risk preference into the safety constraints, followed by a model predictive control (MPC)-based low-level control. The safety level of RL policy can be adjusted by computing cost limits of CMDP's constraints based on risk preferences and traffic density using a fuzzy control method. To filter out unsafe or invalid actions, we design an action shielding mechanism that pre-executes RL actions using an MPC method and performs collision checks with surrounding agents. We also provide theoretical proof to validate the effectiveness of the shielding mechanism in enhancing RL's safety and sample efficiency. Simulation experiments in multiple levels of traffic densities show that our method can significantly reduce safety violations without sacrificing traffic efficiency. Furthermore, due to the use of risk preference-aware constraints in CMDP and action shielding, we can not only adjust the safety level of the final policy but also reduce safety violations during the training stage, proving a promising solution for online learning in real-world environments.