Plotting

You Only Live Once: Single-Life Reinforcement Learning Annie S. Chen 1, Chelsea Finn 1

Neural Information Processing Systems

Reinforcement learning algorithms are typically designed to learn a performant policy that can repeatedly and autonomously complete a task, usually starting from scratch. However, in many real-world situations, the goal might not be to learn a policy that can do the task repeatedly, but simply to perform a new task successfully once in a single trial. For example, imagine a disaster relief robot tasked with retrieving an item from a fallen building, where it cannot get direct supervision from humans. It must retrieve this object within one test-time trial, and must do so while tackling unknown obstacles, though it may leverage knowledge it has of the building before the disaster. We formalize this problem setting, which we call single-life reinforcement learning (SLRL), where an agent must complete a task within a single episode without interventions, utilizing its prior experience while contending with some form of novelty. SLRL provides a natural setting to study the challenge of autonomously adapting to unfamiliar situations, and we find that algorithms designed for standard episodic reinforcement learning often struggle to recover from out-of-distribution states in this setting.


A Proof for Equation (7) in Section 3.2

Neural Information Processing Systems

In Section 3.2, we propose a shifting operation in eq. Below, we summarize the shifting operation and prove its efficacy in proposition A.1. As presented in Section 3.2, for an f-divergence, its convex conjugate generator function f The environments we used for our experiments are from the OpenAI Gym [10] including the CartPole [8] from the classic RL literature, and five complex tasks simulated with MuJoCo [32], such as HalfCheetah, Hopper, Reacher, Walker, and Humanoid with task screenshots and version numbers shown in Figure 1. Note that behavior cloning (BC) employs the same structure to train a policy network with supervised learning. The reward signal network used in GAIL, BC+GAIL, AIRL, RKL-VIM and f-GAIL are all composed of three hidden layers of 100 units each with first two layers activated with tanh, and the final activation layers listed in Tab. 3. Details of f For the ablation study in Sec 4.3, we changed the number of linear layers to be 1, 2, 4 and 7 (with 100 nodes per layer) and the number of nodes per layer to be 25, 50, 100, and 200 (with 4 layers).


f-GAIL: Learning f-Divergence for Generative Adversarial Imitation Learning

Neural Information Processing Systems

Imitation learning (IL) aims to learn a policy from expert demonstrations that minimizes the discrepancy between the learner and expert behaviors. Various imitation learning algorithms have been proposed with different pre-determined divergences to quantify the discrepancy. This naturally gives rise to the following question: Given a set of expert demonstrations, which divergence can recover the expert policy more accurately with higher data efficiency? In this work, we propose f-GAIL, a new generative adversarial imitation learning (GAIL) model, that automatically learns a discrepancy measure from the f-divergence family as well as a policy capable of producing expert-like behaviors. Compared with IL baselines with various predefined divergence measures, f-GAIL learns better policies with higher data efficiency in six physics-based control tasks.


aims to match the state-action distributions between the learner and the

Neural Information Processing Systems

Thank reviewers for the comments. Please find our responses below, with reference indices consistent with the paper. Q3-5: Meaning of the learned divergence? We agree that BC minimizes the policy KL divergence as what we noted in Sec. 4 (line 200). It is consistent with the literature, e.g., Sec. 2 in [Yu et al. arXiv:1909.09314].


7c3465ba08732cc2db38f070bfae601a-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

Currently the dataset can be downloaded under this link (2.2 GB, compressed tar file): The Muscles in Time dataset will be published under a CC BY-NC 4.0 license as defined under Our data generation pipeline is licensed under Apache License Version 2.0 as defined under Data structure The structure of the provided MinT data is intentionally kept simple. The first and last 0.14 seconds are cut off since the muscle activation A short example on the musint package usage is displayed in Listing 2. The musint package can be installed via pip install musint. In Figure 9 we provide additional information on the data analyzed provided with Muscles in Time. Total Capture makes up a small part of the dataset with exceptionally long sequences. Dataset provides the largest contribution with 3.2h of analyzed recordings.


Muscles in Time: Learning to Understand Human Motion by Simulating Muscle Activations

Neural Information Processing Systems

Exploring the intricate dynamics between muscular and skeletal structures is pivotal for understanding human motion. This domain presents substantial challenges, primarily attributed to the intensive resources required for acquiring ground truth muscle activation data, resulting in a scarcity of datasets. In this work, we address this issue by establishing Muscles in Time (MinT), a large-scale synthetic muscle activation dataset. For the creation of MinT, we enriched existing motion capture datasets by incorporating muscle activation simulations derived from biomechanical human body models using the OpenSim platform, a common approach in biomechanics and human motion research. Starting from simple pose sequences, our pipeline enables us to extract detailed information about the timing of muscle activations within the human musculoskeletal system. Muscles in Time contains over nine hours of simulation data covering 227 subjects and 402 simulated muscle strands. We demonstrate the utility of this dataset by presenting results on neural network-based muscle activation estimation from human pose sequences with two different sequence-to-sequence architectures.


A Theoretical Analysis

Neural Information Processing Systems

This section contains the theoretical analysis of the loss functions of offline experience replay (Proposition 2), augmented experience replay (Proposition 3), and online experience replay with reservoir sampling (Proposition 1). At each iteration t, t = 1,..T, a batch of data is sampled from the incoming task, B Note 3: Consider a balanced continual learning dataset (e.g., Split-CIFAR100, Split-Mini-ImageNet) where |D Note 4: Consider general continual learning datasets. Table 3 lists the image size, the number of classes, the number of tasks, and data size per task of the four CL benchmarks. C.1 Continual Learning Implementation The hyperparameter settings are summarized in Table 4. All models are optimized using vanilla SGD.


Repeated Augmented Rehearsal: A Simple but Strong Baseline for Online Continual Learning

Neural Information Processing Systems

Online continual learning (OCL) aims to train neural networks incrementally from a non-stationary data stream with a single pass through data. Rehearsal-based methods attempt to approximate the observed input distributions over time with a small memory and revisit them later to avoid forgetting. Despite their strong empirical performance, rehearsal methods still suffer from a poor approximation of past data's loss landscape with memory samples. This paper revisits the rehearsal dynamics in online settings. We provide theoretical insights on the inherent memory overfitting risk from the viewpoint of biased and dynamic empirical risk minimization, and examine the merits and limits of repeated rehearsal. Inspired by our analysis, a simple and intuitive baseline, repeated augmented rehearsal (RAR), is designed to address the underfitting-overfitting dilemma of online rehearsal. Surprisingly, across four rather different OCL benchmarks, this simple baseline outperforms vanilla rehearsal by 9%-17% and also significantly improves the state-of-the-art rehearsal-based methods MIR, ASER, and SCR. We also demonstrate that RAR successfully achieves an accurate approximation of the loss landscape of past data and high-loss ridge aversion in its learning trajectory. Extensive ablation studies are conducted to study the interplay between repeated and augmented rehearsal, and reinforcement learning (RL) is applied to dynamically adjust the hyperparameters of RAR to balance the stability-plasticity trade-off online.


OptEx: Expediting First-Order Optimization with Approximately Parallelized Iterations, Ying Tiffany He

Neural Information Processing Systems

First-order optimization (FOO) algorithms are pivotal in numerous computational domains, such as reinforcement learning and deep learning. However, their application to complex tasks often entails significant optimization inefficiency due to their need of many sequential iterations for convergence. In response, we introduce first-order optimization expedited with approximately parallelized iterations (OptEx), the first general framework that enhances the optimization efficiency of FOO by leveraging parallel computing to directly mitigate its requirement of many sequential iterations for convergence. To achieve this, OptEx utilizes a kernelized gradient estimation that is based on the history of evaluated gradients to predict the gradients required by the next few sequential iterations in FOO, which helps to break the inherent iterative dependency and hence enables the approximate parallelization of iterations in FOO. We further establish theoretical guarantees for the estimation error of our kernelized gradient estimation and the iteration complexity of SGD-based OptEx, confirming that the estimation error diminishes to zero as the history of gradients accumulates and that our SGD-based OptEx enjoys an effective acceleration rate of Θ( N) over standard SGD given parallelism of N, in terms of the sequential iterations required for convergence. Finally, we provide extensive empirical studies, including synthetic functions, reinforcement learning tasks, and neural network training on various datasets, to underscore the substantial efficiency improvements achieved by OptEx in practice. Our implementation is available at https://github.com/youyve/OptEx.