Goto

Collaborating Authors

 reinforcement learning


Compositional Automata Embeddings for Goal-Conditioned Reinforcement Learning

Neural Information Processing Systems

Goal-conditioned reinforcement learning is a powerful way to control an AI agent's behavior at runtime. That said, popular goal representations, e.g., target states or natural language, are either limited to Markovian tasks or rely on ambiguous task semantics. We propose representing temporal goals using compositions of deterministic finite automata (cDFAs) and use cDFAs to guide RL agents.


Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data

Neural Information Processing Systems

Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithmhyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task-and methodagnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size.


Learning Versatile Skills with Curriculum Masking Yao Tang 1 Zichuan Lin 2 Deheng Ye2

Neural Information Processing Systems

Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask, a curriculum masking pretraining paradigm for sequential decision making. Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills. Through extensive experiments, we show that CurrMask exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks. Additionally, our analysis of training dynamics reveals that CurrMask gradually acquires skills of varying complexity by dynamically adjusting its masking scheme. Code is available at here.


Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning

Neural Information Processing Systems

Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values. Leveraging this insight, we provide a very strong performance bound, the very first to combine two desirable aspects: a linear dependency to the horizon (instead of quadratic) and an error propagation term involving an averaging e ect of the estimation errors (instead of an accumulation e ect). We also study the more general case of an additional entropy regularizer. The resulting abstract scheme encompasses many existing RL algorithms. Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study.


ODRL: A Benchmark for Off-Dynamics Reinforcement Learning

Neural Information Processing Systems

We consider off-dynamics reinforcement learning (RL) where one needs to transfer policies across different domains with dynamics mismatch. Despite the focus on developing dynamics-aware algorithms, this field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ODRL, the first benchmark tailored for evaluating off-dynamics RL methods. ODRL contains four experimental settings where the source and target domains can be either online or offline, and provides diverse tasks and a broad spectrum of dynamics shifts, making it a reliable platform to comprehensively evaluate the agent's adaptation ability to the target domain. Furthermore, ODRL includes recent off-dynamics RL algorithms in a unified framework and introduces some extra baselines for different settings, all implemented in a single-file manner. To unpack the true adaptation capability of existing methods, we conduct extensive benchmarking experiments, which show that no method has universal advantages across varied dynamics shifts. We hope this benchmark can serve as a cornerstone for future research endeavors.


Supplementary Material: A Boolean Task Algebra For Reinforcement Learning

Neural Information Processing Systems

A Boolean algebra is a set B equipped with the binary operators (disjunction) and (conjunction), and the unary operator (negation), which satisfies the following Boolean algebra axioms for a, b, c in B: (i) Idempotence: a a = a a = a. Let M be a set of tasks which adhere to Assumption 2. Then (M,,,, M M. We show that,, satisfy the Boolean properties (i) - (vii). D. But this is a contradiction since the result obtained by following an optimal trajectory up to a terminal state without the reward for entering the terminal state must be strictly less that receiving r This follows directly from Lemma 2. Since all tasks M M share the same optimal policy ฯ€ Corollary 2. Let F: M Q This follows from Theorem 3. Below we list the pseudocode for the modified Q-learning algorithm used in the four-rooms domain. Note that the greedy action selection step is equivalent to generalised policy improvement (Barreto et al., 2017) over the set of extended value functions. The theoretical results presented in this work rely on Assumptions 1 and 2, which restrict the tasks' transition dynamics and reward functions in potentially problematic ways.


A Boolean Task Algebra For Reinforcement Learning

Neural Information Processing Systems

The ability to compose learned skills to solve new tasks is an important property of lifelong-learning agents.


Safe and Efficient: A Primal-Dual Method for Offline Convex CMDPs under Partial Data Coverage

Neural Information Processing Systems

Offline safe reinforcement learning (RL) aims to find an optimal policy using a pre-collected dataset when data collection is impractical or risky. We propose a novel linear programming (LP) based primal-dual algorithm for convex MDPs that incorporates "uncertainty" parameters to improve data efficiency while requiring only partial data coverage assumption. Our theoretical results achieve a sample complexity of O(1/(1 ฮณ) n) under general function approximation, improving the current state-of-the-art by a factor of 1/(1 ฮณ), where n is the number of data samples in an offline dataset, and ฮณ is the discount factor. The numerical experiments validate our theoretical findings, demonstrating the practical efficacy of our approach in achieving improved safety and learning efficiency in safe offline settings.


Bayesian Bellman Operators

Neural Information Processing Systems

We introduce a novel perspective on Bayesian reinforcement learning (RL); whereas existing approaches infer a posterior over the transition distribution or Q-function, we characterise the uncertainty in the Bellman operator. Our Bayesian Bellman operator (BBO) framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions. In this paper, we use BBO to provide a rigorous theoretical analysis of model-free Bayesian RL to better understand its relationship to established frequentist RL methodologies. We prove that Bayesian solutions are consistent with frequentist RL solutions, even when approximate inference is used, and derive conditions for which convergence properties hold. Empirically, we demonstrate that algorithms derived from the BBO framework have sophisticated deep exploration properties that enable them to solve continuous control tasks at which state-of-the-art regularised actor-critic algorithms fail catastrophically.


Policy Improvement via Imitation of Multiple Oracles

Neural Information Processing Systems

Despite its promise, reinforcement learning's real-world adoption has been hampered by the need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an oracle policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner has access to multiple suboptimal oracles, which may provide conflicting advice in a state. The existing IL literature provides a limited treatment of such scenarios. Whereas in the single-oracle case, the return of the oracle's policy provides an obvious benchmark for the learner to compete against, neither such a benchmark nor principled ways of outperforming it are known for the multi-oracle setting. In this paper, we propose the state-wise maximum of the oracle policies' values as a natural baseline to resolve conflicting advice from multiple oracles. Using a reduction of policy optimization to online learning, we introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark.