Goto

Collaborating Authors

 Zhang, Fuxiang


Improving Sample Efficiency of Reinforcement Learning with Background Knowledge from Large Language Models

arXiv.org Artificial Intelligence

Low sample efficiency is an enduring challenge of reinforcement learning (RL). With the advent of versatile large language models (LLMs), recent works impart common-sense knowledge to accelerate policy learning for RL processes. However, we note that such guidance is often tailored for one specific task but loses generalizability. In this paper, we introduce a framework that harnesses LLMs to extract background knowledge of an environment, which contains general understandings of the entire environment, making various downstream RL tasks benefit from one-time knowledge representation. We ground LLMs by feeding a few pre-collected experiences and requesting them to delineate background knowledge of the environment. Afterward, we represent the output knowledge as potential functions for potential-based reward shaping, which has a good property for maintaining policy optimality from task rewards. We instantiate three variants to prompt LLMs for background knowledge, including writing code, annotating preferences, and assigning goals. Our experiments show that these methods achieve significant sample efficiency improvements in a spectrum of downstream tasks from Minigrid and Crafter domains.


Q-Adapter: Training Your LLM Adapter as a Residual Q-Function

arXiv.org Artificial Intelligence

We consider the problem of adapting Large Language Models (LLMs) pre-trained with Reinforcement Learning from Human Feedback (RLHF) to downstream preference data. Naive approaches to achieve this could be supervised fine-tuning on preferred responses or reinforcement learning with a learned reward model. However, the LLM runs the risk of forgetting its initial knowledge as the fine-tuning progresses. To customize the LLM while preserving its existing capabilities, this paper proposes a novel method, named as Q-Adapter. We start by formalizing LLM adaptation as a problem of maximizing the linear combination of two rewards, one of which corresponds to the reward optimized by the pre-trained LLM and the other to the downstream preference data. Although both rewards are unknown, we show that this can be solved by directly learning a new module from the preference data that approximates the \emph{residual Q-function}. We consider this module to be an adapter because the original pre-trained LLM, together with it, can form the optimal customised LLM. Empirically, experiments on a range of domain-specific tasks and safety alignment tasks illustrate the superiority of Q-Adapter in both anti-forgetting and learning from new preferences.


Policy Regularization with Dataset Constraint for Offline Reinforcement Learning

arXiv.org Artificial Intelligence

We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL). A common taxonomy of existing offline RL works is policy regularization, which typically constrains the learned policy by distribution or support of the behavior policy. However, distribution and support constraints are overly conservative since they both force the policy to choose similar actions as the behavior policy when considering particular states. It will limit the learned policy's performance, especially when the behavior policy is sub-optimal. In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with Dataset Constraint (PRDC). When updating the policy in a given state, PRDC searches the entire dataset for the nearest state-action sample and then restricts the policy with the action of this sample. Unlike previous works, PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state. It is a softer constraint but still keeps enough conservatism from out-of-distribution actions. Empirical evidence and theoretical analysis show that PRDC can alleviate offline RL's fundamentally challenging value overestimation issue with a bounded performance gap. Moreover, on a set of locomotion and navigation tasks, PRDC achieves state-of-the-art performance compared with existing methods. Code is available at https://github.com/LAMDA-RL/PRDC


Multi-agent Continual Coordination via Progressive Task Contextualization

arXiv.org Artificial Intelligence

Cooperative Multi-agent Reinforcement Learning (MARL) has attracted prominent attention in recent years [1], and achieved great progress in multiple aspects, like path finding [2], active voltage control [3], and dynamic algorithm configuration [4]. Among the multitudinous methods, researchers, on the one hand, focus on facilitating coordination ability via solving specific challenges, including non-stationarity [5], credit assignment [6], and scalability [7]. Other works, on the other hand, investigate the cooperative MARL from multiple aspects, like efficient communication [8], zero-shot coordination (ZSC) [9], policy robustness [10], etc. A lot of methods emerge as promising solutions for different scenarios, including policy-based ones [11,12], value-based series [13,14], and many other variants, showing remarkable coordination ability in a wide range of tasks like SMAC [15]. Despite the great success, the mainstream cooperative MARL methods are still restricted to being trained in one single task or multiple tasks simultaneously, assuming that the agents have access to data from all tasks at all times, which is unrealistic for physical agents in the real world that can only attend to one task at a time. Continual Reinforcement Learning plays a promising role in the mentioned problem [16], where the agent aims to avoid catastrophic forgetting, as well as enable knowledge transfer to new tasks (a.k.a.