spirl
NBDI: A Simple and Efficient Termination Condition for Skill Extraction from Task-Agnostic Demonstrations
Kim, Myunsoo, Lee, Hayeong, Shim, Seong-Woong, Seo, JunHo, Lee, Byung-Jun
Intelligent agents are able to make decisions based on different levels of granularity and duration. Recent advances in skill learning enabled the agent to solve complex, long-horizon tasks by effectively guiding the agent in choosing appropriate skills. However, the practice of using fixed-length skills can easily result in skipping valuable decision points, which ultimately limits the potential for further exploration and faster policy learning. In this work, we propose to learn a simple and effective termination condition that identifies decision points through a state-action novelty module that leverages agent experience data. Our approach, Novelty-based Decision Point Identification (NBDI), outperforms previous baselines in complex, long-horizon tasks, and remains effective even in the presence of significant variations in the environment configurations of downstream tasks, highlighting the importance of decision point identification in skill learning.
Unsupervised Salient Patch Selection for Data-Efficient Reinforcement Learning
To improve the sample efficiency of vision-based deep reinforcement learning (RL), we propose a novel method, called SPIRL, to automatically extract important patches from input images. Following Masked Auto-Encoders, SPIRL is based on Vision Transformer models pre-trained in a self-supervised fashion to reconstruct images from randomly-sampled patches. These pre-trained models can then be exploited to detect and select salient patches, defined as hard to reconstruct from neighboring patches. In RL, the SPIRL agent processes selected salient patches via an attention module. We empirically validate SPIRL on Atari games to test its data-efficiency against relevant state-of-the-art methods, including some traditional model-based methods and keypoint-based models. In addition, we analyze our model's interpretability capabilities.
Skill-Critic: Refining Learned Skills for Reinforcement Learning
Hao, Ce, Weaver, Catherine, Tang, Chen, Kawamoto, Kenta, Tomizuka, Masayoshi, Zhan, Wei
Incorporating prior experience by learning from demonstration can facilitate efficient exploration in complex environments Figure 1: Our Skill-Critic approach leverages lowcoverage [9]. For example, statistical methods demonstrations to facilitate hierarchical can infer the hidden structure of offline data and reinforcement learning by (1) acquiring a basic inform the decision-making process [6, 7]. However, skill-set from demonstrations that (2) guides learning offline data alone may not suffice for determining online skill selection and skill improvement an optimal policy, particularly when the data originates from simpler environments or pertains to intricate or stochastic tasks. In such cases, online policy optimization is imperative to refine suboptimal policies. In this work, we present a hierarchical RL framework that can leverage offline data to accelerate RL training without limiting its performance by the quality of offline data. Our framework employs skills, temporally extended sequences of primitive actions [10]. Previous works extract skills from unstructured data and transfer them to downstream RL tasks with a skill selection policy whose action space is the skill itself [11].
Interaction-limited Inverse Reinforcement Learning
Troussard, Martin, Pignat, Emmanuel, Kamalaruban, Parameswaran, Calinon, Sylvain, Cevher, Volkan
Learning from Demonstrations (LfD) is an active research area that addresses the problem of learning how to perform a task by observing the demonstrations provided by an expert. This approach plays an important role in many real-life learning settings, including human-to-robot interaction [1, 2, 3, 4, 5]. The two popular approaches for LfD include (i) behavioral cloning, which directly mimics the expert behavior, without understanding the objective [6], and (ii) inverse reinforcement learning (IRL), which infers the reward function (i.e., the objective of the task) explaining the expert behavior [7]. In this work, we focus on the IRL approach to LfD. Typically, the IRL learner assumes that the demonstrated expert behavior is optimal with respect to some reward function, even if the reward function cannot be specified explicitly as in typical reinforcement learning (RL).