Goto

Collaborating Authors

 rl environment




Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient

Zhou, Zhongzhu, Yang, Yibo, Chen, Ziyan, Bie, Fengxiang, Xia, Haojun, Wu, Xiaoxia, Wu, Robert, Athiwaratkun, Ben, Ghanem, Bernard, Song, Shuaiwen Leon

arXiv.org Artificial Intelligence

Policy gradient (PG) methods in reinforcement learning frequently utilize deep neural networks (DNNs) to learn a shared backbone of feature representations used to compute likelihoods in an action selection layer. Numerous studies have been conducted on the convergence and global optima of policy networks, but few have analyzed representational structures of those underlying networks. While training an optimal policy DNN, we observed that under certain constraints, a gentle structure resembling neural collapse, which we refer to as Action Collapse (AC), emerges. This suggests that 1) the state-action activations (i.e. last-layer features) sharing the same optimal actions collapse towards those optimal actions respective mean activations; 2) the variability of activations sharing the same optimal actions converges to zero; 3) the weights of action selection layer and the mean activations collapse to a simplex equiangular tight frame (ETF). Our early work showed those aforementioned constraints to be necessary for these observations. Since the collapsed ETF of optimal policy DNNs maximally separates the pair-wise angles of all actions in the state-action space, we naturally raise a question: can we learn an optimal policy using an ETF structure as a (fixed) target configuration in the action selection layer? Our analytical proof shows that learning activations with a fixed ETF as action selection layer naturally leads to the AC. We thus propose the Action Collapse Policy Gradient (ACPG) method, which accordingly affixes a synthetic ETF as our action selection layer. ACPG induces the policy DNN to produce such an ideal configuration in the action selection layer while remaining optimal. Our experiments across various OpenAI Gym environments demonstrate that our technique can be integrated into any discrete PG methods and lead to favorable reward improvements more quickly and robustly.





BEAVER: Building Environments with Assessable Variation for Evaluating Multi-Objective Reinforcement Learning

Liu, Ruohong, Umenberger, Jack, Chen, Yize

arXiv.org Artificial Intelligence

Recent years have seen significant advancements in designing reinforcement learning (RL)-based agents for building energy management. While individual success is observed in simulated or controlled environments, the scalability of RL approaches in terms of efficiency and generalization across building dynamics and operational scenarios remains an open question. In this work, we formally characterize the generalization space for the cross-environment, multi-objective building energy management task, and formulate the multi-objective contextual RL problem. Such a formulation helps understand the challenges of transferring learned policies across varied operational contexts such as climate and heat convection dynamics under multiple control objectives such as comfort level and energy consumption. We provide a principled way to parameterize such contextual information in realistic building RL environments, and construct a novel benchmark to facilitate the evaluation of generalizable RL algorithms in practical building control tasks. Our results show that existing multi-objective RL methods are capable of achieving reasonable trade-offs between conflicting objectives. However, their performance degrades under certain environment variations, underscoring the importance of incorporating dynamics-dependent contextual information into the policy learning process.


DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning

Wang, Borui, McKeown, Kathleen, Ying, Rex

arXiv.org Artificial Intelligence

Reinforcement learning from expert demonstrations has long remained a challenging research problem, and existing state-of-the-art methods using behavioral cloning plus further RL training often suffer from poor generalization, low sample efficiency, and poor model interpretability. Inspired by the strong reasoning abilities of large language models (LLMs), we propose a novel strategy-based reinforcement learning framework integrated with LLMs called DYnamic STrategy Induction with Llms for reinforcement learning (DYSTIL) to overcome these limitations. DYSTIL dynamically queries a strategy-generating LLM to induce textual strategies based on advantage estimations and expert demonstrations, and gradually internalizes induced strategies into the RL agent through policy optimization to improve its performance through boosting policy generalization and enhancing sample efficiency. It also provides a direct textual channel to observe and interpret the evolution of the policy's underlying strategies during training. We test DYSTIL over challenging RL environments from Minigrid and BabyAI, and empirically demonstrate that DYSTIL significantly outperforms state-of-the-art baseline methods by 17.75% in average success rate while also enjoying higher sample efficiency during the learning process.


pix2pockets: Shot Suggestions in 8-Ball Pool from a Single Image in the Wild

Schiøtt, Jonas Myhre, Petersen, Viktor Sebastian, Papadopoulos, Dim P.

arXiv.org Artificial Intelligence

Computer vision models have seen increased usage in sports, and reinforcement learning (RL) is famous for beating humans in strategic games such as Chess and Go. In this paper, we are interested in building upon these advances and examining the game of classic 8-ball pool. We introduce pix2pockets, a foundation for an RL-assisted pool coach. Given a single image of a pool table, we first aim to detect the table and the balls and then propose the optimal shot suggestion. For the first task, we build a dataset with 195 diverse images where we manually annotate all balls and table dots, leading to 5748 object segmentation masks. For the second task, we build a standardized RL environment that allows easy development and benchmarking of any RL algorithm. Our object detection model yields an AP50 of 91.2 while our ball location pipeline obtains an error of only 0.4 cm. Furthermore, we compare standard RL algorithms to set a baseline for the shot suggestion task and we show that all of them fail to pocket all balls without making a foul move. We also present a simple baseline that achieves a per-shot success rate of 94.7% and clears a full game in a single turn 30% of the time.


Reinforcement Learning Environment with LLM-Controlled Adversary in D&D 5th Edition Combat

Dayo, Joseph Emmanuel DL, Ogbinar, Michel Onasis S., Naval, Prospero C. Jr

arXiv.org Artificial Intelligence

The objective of this study is to design and implement a reinforcement learning (RL) environment using D\&D 5E combat scenarios to challenge smaller RL agents through interaction with a robust adversarial agent controlled by advanced Large Language Models (LLMs) like GPT-4o and LLaMA 3 8B. This research employs Deep Q-Networks (DQN) for the smaller agents, creating a testbed for strategic AI development that also serves as an educational tool by simulating dynamic and unpredictable combat scenarios. We successfully integrated sophisticated language models into the RL framework, enhancing strategic decision-making processes. Our results indicate that while RL agents generally outperform LLM-controlled adversaries in standard metrics, the strategic depth provided by LLMs significantly enhances the overall AI capabilities in this complex, rule-based setting. The novelty of our approach and its implications for mastering intricate environments and developing adaptive strategies are discussed, alongside potential innovations in AI-driven interactive simulations. This paper aims to demonstrate how integrating LLMs can create more robust and adaptable AI systems, providing valuable insights for further research and educational applications.