Goto

Collaborating Authors

 aggarwal



Primal-Only Actor Critic Algorithm for Robust Constrained Average Cost MDPs

Satheesh, Anirudh, Sathish, Sooraj, Ganesh, Swetha, Powell, Keenan, Aggarwal, Vaneet

arXiv.org Artificial Intelligence

In this work, we study the problem of finding robust and safe policies in Robust Constrained Average-Cost Markov Decision Processes (RCMDPs). A key challenge in this setting is the lack of strong duality, which prevents the direct use of standard primal-dual methods for constrained RL. Additional difficulties arise from the average-cost setting, where the Robust Bellman operator is not a contraction under any norm. To address these challenges, we propose an actor-critic algorithm for Average-Cost RCMDPs. We show that our method achieves both \(ε\)-feasibility and \(ε\)-optimality, and we establish a sample complexities of \(\tilde{O}\left(ε^{-4}\right)\) and \(\tilde{O}\left(ε^{-6}\right)\) with and without slackness assumption, which is comparable to the discounted setting.


AI-Powered Assistant for Long-Term Access to RHIC Knowledge

Atif, Mohammad, Garonne, Vincent, Lancon, Eric, Lauret, Jerome, Prozorov, Alexandr, Vranovsky, Michal

arXiv.org Artificial Intelligence

As the Relativistic Heavy Ion Collider (RHIC) at Brookhaven National Laboratory concludes 25 years of operation, preserving not only its vast data holdings ( 1 ExaByte) but also the embedded scientific knowledge becomes a critical priority. The RHIC Data and Analysis Preservation Plan (DAPP) introduces an AI-powered assistant system that provides natural language access to documentation, workflows, and software, with the aim of supporting reproducibility, education, and future discovery. Built upon Large Language Models using Retrieval-Augmented Generation and the Model Context Protocol, this assistant indexes structured and unstructured content from RHIC experiments and enables domain-adapted interaction. We report on the deployment, computational performance, ongoing multi-experiment integration, and architectural features designed for a sustainable and explainable long-term AI access. Our experience illustrates how modern AI/ML tools can transform the usability and discoverability of scientific legacy data. 1 Introduction.


Variance Reduced Policy Gradient Method for Multi-Objective Reinforcement Learning

Guidobene, Davide, Benedetti, Lorenzo, Arapovic, Diego

arXiv.org Artificial Intelligence

Multi-Objective Reinforcement Learning (MORL) is a generalization of traditional Reinforcement Learning (RL) that aims to optimize multiple, often conflicting objectives simultaneously rather than focusing on a single reward. This approach is crucial in complex decision-making scenarios where agents must balance trade-offs between various goals, such as maximizing performance while minimizing costs. We consider the problem of MORL where the objectives are combined using a non-linear scalarization function. Just like in standard RL, policy gradient methods (PGMs) are amongst the most effective for handling large and continuous state-action spaces in MORL. However, existing PGMs for MORL suffer from high sample inefficiency, requiring large amounts of data to be effective. Previous attempts to solve this problem rely on overly strict assumptions, losing PGMs' benefits in scalability to large state-action spaces. In this work, we address the issue of sample efficiency by implementing variance-reduction techniques to reduce the sample complexity of policy gradients while maintaining general assumptions.


Accelerating Quantum Reinforcement Learning with a Quantum Natural Policy Gradient Based Approach

Xu, Yang, Aggarwal, Vaneet

arXiv.org Machine Learning

We address the problem of quantum reinforcement learning (QRL) under model-free settings with quantum oracle access to the Markov Decision Process (MDP). This paper introduces a Quantum Natural Policy Gradient (QNPG) algorithm, which replaces the random sampling used in classical Natural Policy Gradient (NPG) estimators with a deterministic gradient estimation approach, enabling seamless integration into quantum systems. While this modification introduces a bounded bias in the estimator, the bias decays exponentially with increasing truncation levels. This paper demonstrates that the proposed QNPG algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ for queries to the quantum oracle, significantly improving the classical lower bound of $\tilde{\mathcal{O}}(\epsilon^{-2})$ for queries to the MDP.


Towards Neural Network based Cognitive Models of Dynamic Decision-Making by Humans

Chen, Changyu, Chirra, Shashank Reddy, Ferreira, Maria José, Gonzalez, Cleotilde, Sinha, Arunesh, Varakantham, Pradeep

arXiv.org Artificial Intelligence

Modeling human cognitive processes in dynamic decision-making tasks has been an endeavor in AI for a long time because such models can help make AI systems more intuitive, personalized, mitigate any human biases, and enhance training in simulation. Some initial work has attempted to utilize neural networks (and large language models) but often assumes one common model for all humans and aims to emulate human behavior in aggregate. However, the behavior of each human is distinct, heterogeneous, and relies on specific past experiences in certain tasks. For instance, consider two individuals responding to a phishing email: one who has previously encountered and identified similar threats may recognize it quickly, while another without such experience might fall for the scam. In this work, we build on Instance Based Learning (IBL) that posits that human decisions are based on similar situations encountered in the past. However, IBL relies on simple fixed form functions to capture the mapping from past situations to current decisions. To that end, we propose two new attention-based neural network models to have open form non-linear functions to model distinct and heterogeneous human decision-making in dynamic settings. We experiment with two distinct datasets gathered from human subject experiment data, one focusing on detection of phishing email by humans and another where humans act as attackers in a cybersecurity setting and decide on an attack option. We conducted extensive experiments with our two neural network models, IBL, and GPT3.5, and demonstrate that the neural network models outperform IBL significantly in representing human decision-making, while providing similar interpretability of human decisions as IBL. Overall, our work yields promising results for further use of neural networks in cognitive modeling of human decision making.


Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms

Aggarwal, Vaneet, Mondal, Washim Uddin, Bai, Qinbo

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.


Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm

Bai, Qinbo, Mondal, Washim Uddin, Aggarwal, Vaneet

arXiv.org Artificial Intelligence

The framework of Reinforcement Learning (RL) is concerned with a class of problems where an agent learns to yield the maximum cumulative reward in an unknown environment via repeated interaction. RL finds applications in diverse areas, such as wireless communication, transportation, and epidemic control (Yang et al., 2020; Al-Abbasi et al., 2019; Ling et al., 2023). RL problems are mainly categorized into three setups: episodic, infinite horizon discounted reward, and infinite horizon average reward. Among them, the infinite horizon average reward setup is particularly significant for real-world applications.


Achieving Zero Constraint Violation for Concave Utility Constrained Reinforcement Learning via Primal-Dual Approach

Bai, Qinbo | Bedi, Amrit Singh | Agarwal, Mridul | Koppel, Alec | Aggarwal, Vaneet (a:1:{s:5:"en_US";s:17:"Purdue University";})

Journal of Artificial Intelligence Research

Reinforcement learning (RL) is widely used in applications where one needs to perform sequential decision-making while interacting with the environment. The standard RL problem with safety constraints is generally mathematically modeled by constrained Markov Decision Processes (CMDP), which is linear in objective and rules in occupancy measure space, where the problem becomes challenging in the case where the model is unknown apriori. The problem further becomes challenging when the decision requirement includes optimizing a concave utility while satisfying some nonlinear safety constraints. To solve such a nonlinear problem, we propose a conservative stochastic primal-dual algorithm (CSPDA) via a randomized primal-dual approach. By leveraging a generative model, we prove that CSPDA not only exhibits Õ(1/ε2)sample complexity, but also achieves zero constraint violations for the concave utility CMDP. Compared with the previous works, the best available sample complexity for CMDP with zero constraint violation is Õ(1/ε5). Hence, the proposed algorithm provides a significant improvement as compared to the state-of-the-art.


Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

Mondal, Washim Uddin, Aggarwal, Vaneet

arXiv.org Artificial Intelligence

We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named $\mathrm{DUCRL2}$ to obtain a near-optimal policy for this setting and show that it achieves a regret bound of $\tilde{\mathcal{O}}\left(DS\sqrt{AT} + d (SA)^3\right)$ where $S$ and $A$ are the sizes of the state and action spaces, respectively, $D$ is the diameter of the MDP, $d$ is a parameter upper bounded by the maximum reward delay, and $T$ denotes the time horizon. This demonstrates the optimality of the bound in the order of $T$, and an additive impact of the delay.