Goto

Collaborating Authors

 Jonsson, Anders


Provably Efficient Exploration in Reward Machines with Low Regret

arXiv.org Artificial Intelligence

We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge of the task in the form of reward machines is available to the learner. We consider probabilistic reward machines with initially unknown dynamics, and investigate RL under the average-reward criterion, where the learning performance is assessed through the notion of regret. Our main algorithmic contribution is a model-based RL algorithm for decision processes involving probabilistic reward machines that is capable of exploiting the structure induced by such machines. We further derive high-probability and non-asymptotic bounds on its regret and demonstrate the gain in terms of regret over existing algorithms that could be applied, but obliviously to the structure. We also present a regret lower bound for the studied setting. To the best of our knowledge, the proposed algorithm constitutes the first attempt to tailor and analyze regret specifically for RL with probabilistic reward machines.


Hierarchical Average-Reward Linearly-solvable Markov Decision Processes

arXiv.org Artificial Intelligence

We introduce a novel approach to hierarchical reinforcement learning for Linearly-solvable Markov Decision Processes (LMDPs) in the infinite-horizon average-reward setting. Unlike previous work, our approach allows learning low-level and high-level tasks simultaneously, without imposing limiting restrictions on the low-level tasks. Our method relies on partitions of the state space that create smaller subtasks that are easier to solve, and the equivalence between such partitions to learn more efficiently. We then exploit the compositionality of low-level tasks to exactly represent the value function of the high-level task. Experiments show that our approach can outperform flat average-reward reinforcement learning by one or several orders of magnitude.


Bisimulation Metrics are Optimal Transport Distances, and Can be Computed Efficiently

arXiv.org Machine Learning

We propose a new framework for formulating optimal transport distances between Markov chains. Previously known formulations studied couplings between the entire joint distribution induced by the chains, and derived solutions via a reduction to dynamic programming (DP) in an appropriately defined Markov decision process. This formulation has, however, not led to particularly efficient algorithms so far, since computing the associated DP operators requires fully solving a static optimal transport problem, and these operators need to be applied numerous times during the overall optimization process. In this work, we develop an alternative perspective by considering couplings between a flattened version of the joint distributions that we call discounted occupancy couplings, and show that calculating optimal transport distances in the full space of joint distributions can be equivalently formulated as solving a linear program (LP) in this reduced space. This LP formulation allows us to port several algorithmic ideas from other areas of optimal transport theory. In particular, our formulation makes it possible to introduce an appropriate notion of entropy regularization into the optimization problem, which in turn enables us to directly calculate optimal transport distances via a Sinkhorn-like method we call Sinkhorn Value Iteration (SVI). We show both theoretically and empirically that this method converges quickly to an optimal coupling, essentially at the same computational cost of running vanilla Sinkhorn in each pair of states. Along the way, we point out that our optimal transport distance exactly matches the common notion of bisimulation metrics between Markov chains, and thus our results also apply to computing such metrics, and in fact our algorithm turns out to be significantly more efficient than the best known methods developed so far for this purpose.


Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

arXiv.org Artificial Intelligence

Autonomous agents that interact with an environment usually To alleviate this issue, one can consider methods that condition face tasks that comprise complex, entangled behaviors over the policy or the value function on the specification of long horizons. Conventional reinforcement learning (RL) the whole task (Schaul et al. 2015) and such approaches were methods have successfully addressed this. However, in cases recently also proposed for tasks with non-Markovian reward when the agent is meant to perform several tasks across similar functions (Vaezipoor et al. 2021). However, the methods that environments, training a policy for every task separately specify the whole task usually rely on a blackbox neural network can be time-consuming and requires a lot of data. In such for planning when determining which sub-goal to reach cases, the agent can utilize a method that has built-in generalization next. This makes it hard to interpret the plan to solve the task capabilities. One such method relies on the assumption and although they show promising results in practice, it is that reward functions of these tasks can be decomposed unclear whether and when these approaches will generalize into a linear combination of successor features (Barreto et al. to a new task.


Asymmetric Norms to Approximate the Minimum Action Distance

arXiv.org Artificial Intelligence

This paper presents a state representation for reward-free Markov decision processes. The idea is to learn, in a self-supervised manner, an embedding space where distances between pairs of embedded states correspond to the minimum number of actions needed to transition between them. Unlike previous methods, our approach incorporates an asymmetric norm parametrization, enabling accurate approximations of minimum action distances in environments with inherent asymmetry. We show how this representation can be leveraged to learn goal-conditioned policies, providing a notion of similarity between states and goals and a useful heuristic distance to guide planning. To validate our approach, we conduct empirical experiments on both symmetric and asymmetric environments. Our results show that our asymmetric norm parametrization performs comparably to symmetric norms in symmetric environments and surpasses symmetric norms in asymmetric environments.


Hierarchies of Reward Machines

arXiv.org Artificial Intelligence

Hierarchical reinforcement learning (HRL; Barto & Mahadevan, 2003) frameworks, such as options (Sutton et al., Reward machines (RMs) are a recent formalism 1999), have been used to exploit RMs by learning policies for representing the reward function of a reinforcement at two levels of abstraction: (i) select a formula (i.e., subgoal) learning task through a finite-state machine from a given RM state, and (ii) select an action to whose edges encode subgoals of the task using (eventually) satisfy the chosen formula (Toro Icarte et al., high-level events. The structure of RMs enables 2018; Furelos-Blanco et al., 2021). The subtask decomposition the decomposition of a task into simpler and independently powered by HRL enables learning at multiple scales solvable subtasks that help tackle longhorizon simultaneously, and eases the handling of long-horizon and and/or sparse reward tasks. We propose sparse reward tasks. In addition, several works have considered a formalism for further abstracting the subtask the problem of learning the RMs themselves from structure by endowing an RM with the ability to interaction (e.g., Toro Icarte et al., 2019; Xu et al., 2020; call other RMs, thus composing a hierarchy of Furelos-Blanco et al., 2021; Hasanbeig et al., 2021).


Generalized Planning as Heuristic Search: A new planning search-space that leverages pointers over objects

arXiv.org Artificial Intelligence

Planning as heuristic search is one of the most successful approaches to classical planning but unfortunately, it does not extend trivially to Generalized Planning (GP). GP aims to compute algorithmic solutions that are valid for a set of classical planning instances from a given domain, even if these instances differ in the number of objects, the number of state variables, their domain size, or their initial and goal configuration. The generalization requirements of GP make it impractical to perform the state-space search that is usually implemented by heuristic planners. This paper adapts the planning as heuristic search paradigm to the generalization requirements of GP, and presents the first native heuristic search approach to GP. First, the paper introduces a new pointer-based solution space for GP that is independent of the number of classical planning instances in a GP problem and the size of those instances (i.e. the number of objects, state variables and their domain sizes). Second, the paper defines a set of evaluation and heuristic functions for guiding a combinatorial search in our new GP solution space. The computation of these evaluation and heuristic functions does not require grounding states or actions in advance. Therefore our GP as heuristic search approach can handle large sets of state variables with large numerical domains, e.g.~integers. Lastly, the paper defines an upgraded version of our novel algorithm for GP called Best-First Generalized Planning (BFGP), that implements a best-first search in our pointer-based solution space, and that is guided by our evaluation/heuristic functions for GP.


Globally Optimal Hierarchical Reinforcement Learning for Linearly-Solvable Markov Decision Processes

arXiv.org Artificial Intelligence

In this work we present a novel approach to hierarchical reinforcement learning for linearly-solvable Markov decision processes. Our approach assumes that the state space is partitioned, and the subtasks consist in moving between the partitions. We represent value functions on several levels of abstraction, and use the compositionality of subtasks to estimate the optimal values of the states in each partition. The policy is implicitly defined on these optimal value estimates, rather than being decomposed among the subtasks. As a consequence, our approach can learn the globally optimal policy, and does not suffer from the non-stationarity of high-level decisions. If several partitions have equivalent dynamics, the subtasks of those partitions can be shared. If the set of boundary states is smaller than the entire state space, our approach can have significantly smaller sample complexity than that of a flat learner, and we validate this empirically in several experiments.


Hierarchical Representation Learning for Markov Decision Processes

arXiv.org Artificial Intelligence

In this paper we present a novel method for learning hierarchical representations of Markov decision processes. Our method works by partitioning the state space into subsets, and defines subtasks for performing transitions between the partitions. We formulate the problem of partitioning the state space as an optimization problem that can be solved using gradient descent given a set of sampled trajectories, making our method suitable for high-dimensional problems with large state spaces. We empirically validate the method, by showing that it can successfully learn a useful hierarchical representation in a navigation domain. Once learned, the hierarchical representation can be used to solve different tasks in the given domain, thus generalizing knowledge across tasks.


Generalized Planning as Heuristic Search

arXiv.org Artificial Intelligence

Although heuristic search is one of the most successful approaches to classical planning, this planning paradigm does not apply straightforwardly to Generalized Planning (GP). Planning as heuristic search traditionally addresses the computation of sequential plans by searching in a grounded state-space. On the other hand GP aims at computing algorithm-like plans, that can branch and loop, and that generalize to a (possibly infinite) set of classical planning instances. This paper adapts the planning as heuristic search paradigm to the particularities of GP, and presents the first native heuristic search approach to GP. First, the paper defines a novel GP solution space that is independent of the number of planning instances in a GP problem, and the size of these instances. Second, the paper defines different evaluation and heuristic functions for guiding a combinatorial search in our GP solution space. Lastly the paper defines a GP algorithm, called Best-First Generalized Planning (BFGP), that implements a best-first search in the solution space guided by our evaluation/heuristic functions.