transition law
Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) aims to infer a reward from expert demonstrations, motivated by the idea that the reward, rather than the policy, is the most succinct and transferable description of a task [Ng et al., 2000]. However, the reward corresponding to an optimal policy is not unique, making it unclear if an IRL-learned reward is transferable to new transition laws in the sense that its optimal policy aligns with the optimal policy corresponding to the expert's true reward. Past work has addressed this problem only under the assumption of full access to the expert's policy, guaranteeing transferability when learning from two experts with the same reward but different transition laws that satisfy a specific rank condition [Rolland et al., 2022]. In this work, we show that the conditions developed under full access to the expert's policy cannot guarantee transferability in the more practical scenario where we have access only to demonstrations of the expert. Instead of a binary rank condition, we propose principal angles as a more refined measure of similarity and dissimilarity between transition laws. Based on this, we then establish two key results: 1) a sufficient condition for transferability to any transition laws when learning from at least two experts with sufficiently different transition laws, and 2) a sufficient condition for transferability to local changes in the transition law when learning from a single expert. Furthermore, we also provide a probably approximately correct (PAC) algorithm and an end-to-end analysis for learning transferable rewards from demonstrations of multiple experts.
Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) aims to infer a reward from expert demonstrations, motivated by the idea that the reward, rather than the policy, is the most succinct and transferable description of a task [Ng et al., 2000]. However, the reward corresponding to an optimal policy is not unique, making it unclear if an IRL-learned reward is transferable to new transition laws in the sense that its optimal policy aligns with the optimal policy corresponding to the expert's true reward. Past work has addressed this problem only under the assumption of full access to the expert's policy, guaranteeing transferability when learning from two experts with the same reward but different transition laws that satisfy a specific rank condition [Rolland et al., 2022]. In this work, we show that the conditions developed under full access to the expert's policy cannot guarantee transferability in the more practical scenario where we have access only to demonstrations of the expert. Instead of a binary rank condition, we propose principal angles as a more refined measure of similarity and dissimilarity between transition laws.
Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning
Schlaginhaufen, Andreas, Kamgarpour, Maryam
Inverse reinforcement learning (IRL) aims to infer a reward from expert demonstrations, motivated by the idea that the reward, rather than the policy, is the most succinct and transferable description of a task [Ng et al., 2000]. However, the reward corresponding to an optimal policy is not unique, making it unclear if an IRL-learned reward is transferable to new transition laws in the sense that its optimal policy aligns with the optimal policy corresponding to the expert's true reward. Past work has addressed this problem only under the assumption of full access to the expert's policy, guaranteeing transferability when learning from two experts with the same reward but different transition laws that satisfy a specific rank condition [Rolland et al., 2022]. In this work, we show that the conditions developed under full access to the expert's policy cannot guarantee transferability in the more practical scenario where we have access only to demonstrations of the expert. Instead of a binary rank condition, we propose principal angles as a more refined measure of similarity and dissimilarity between transition laws. Based on this, we then establish two key results: 1) a sufficient condition for transferability to any transition laws when learning from at least two experts with sufficiently different transition laws, and 2) a sufficient condition for transferability to local changes in the transition law when learning from a single expert. Furthermore, we also provide a probably approximately correct (PAC) algorithm and an end-to-end analysis for learning transferable rewards from demonstrations of multiple experts.
Identifiability and Generalizability in Constrained Inverse Reinforcement Learning
Schlaginhaufen, Andreas, Kamgarpour, Maryam
Two main challenges in Reinforcement Learning (RL) are designing appropriate reward functions and ensuring the safety of the learned policy. To address these challenges, we present a theoretical framework for Inverse Reinforcement Learning (IRL) in constrained Markov decision processes. From a convex-analytic perspective, we extend prior results on reward identifiability and generalizability to both the constrained setting and a more general class of regularizations. In particular, we show that identifiability up to potential shaping (Cao et al., 2021) is a consequence of entropy regularization and may generally no longer hold for other regularizations or in the presence of safety constraints. We also show that to ensure generalizability to new transition laws and constraints, the true reward must be identified up to a constant. Additionally, we derive a finite sample guarantee for the suboptimality of the learned rewards, and validate our results in a gridworld environment.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
Accelerating the Computation of UCB and Related Indices for Reinforcement Learning
Cowan, Wesley, Katehakis, Michael N., Pirutinsky, Daniel
In this paper we derive an efficient method for computing the indices associated with an asymptotically optimal upper confidence bound algorithm (MDP-UCB) of Burnetas and Katehakis (1997) that only requires solving a system of two non-linear equations with two unknowns, irrespective of the cardinality of the state space of the Markovian decision process (MDP). In addition, we develop a similar acceleration for computing the indices for the MDP-Deterministic Minimum Empirical Divergence (MDP-DMED) algorithm developed in Cowan et al. (2019), based on ideas from Honda and Takemura (2011), that involves solving a single equation of one variable. We provide experimental results demonstrating the computational time savings and regret performance of these algorithms. In these comparison we also consider the Optimistic Linear Programming (OLP) algorithm (Tewari and Bartlett, 2008) and a method based on Posterior sampling (MDP-PS).
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- North America > United States > Florida > Orange County > Orlando (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.83)
Deep learning based unsupervised concept unification in the embedding space
Nenadović, Luka, Prelovac, Vladimir
Humans are able to conceive physical reality by jointly learning different facets thereof. To every pair of notions related to a perceived reality may correspond a mutual relation, which is a notion on its own, but one-level higher. Thus, we may have a description of perceived reality on at least two levels and the translation map between them is in general, due to their different content corpus, one-to-many. Following success of the unsupervised neural machine translation models, which are essentially one-to-one mappings trained separately on monolingual corpora, we examine further capabilities of unsupervised deep learning methods used there and apply these methods to sets of notions of different level and measure. Using the graph and word embedding-like techniques, we build one-to-many map without parallel data in order to establish a unified latent mental representation of the outer world, by combining notions of different kind into a unique conceptual framework. Due to latent similarity, by aligning two embedding spaces in purely unsupervised way, one obtains a geometric relation between objects of cognition on the two levels, making it possible to express a natural knowledge using one description in the context of the other.
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (3 more...)