Goto

Collaborating Authors

 reinforcement learning


Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization Zifeng Zhuang 1,2

Neural Information Processing Systems

Specifically, this non-iterative paradigm allows us to conduct inner-level optimization (value estimation) in training, while performing outer-level optimization (policy extraction) in testing. Naturally, such a paradigm raises three core questions that are not fully answered by prior non-iterative offline RL counterparts like rewardconditioned policy: Q1) What information should we transfer from the inner-level to the outer-level? Q2) What should we pay attention to when exploiting the transferred information for safe/confident outer-level optimization? Q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization (MBO), we propose DROP (Design fROm Policies), which fully answers the above questions. Specifically, in the inner-level, DROP decomposes offline data into multiple subsets and learns an MBO score model (A1). To keep safe exploitation to the score model in the outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (A2). During testing, we show that DROP permits test-time adaptation, enabling an adaptive inference across states (A3). Empirically, we find that DROP, compared to prior non-iterative offline RL counterparts, gains an average improvement probability of more than 80%, and achieves comparable or better performance compared to prior iterative baselines.


A Unifying View of Optimism in Episodic Reinforcement Learning

Neural Information Processing Systems

In this paper we provide a general framework for designing, analyzing and implementing such algorithms in the episodic reinforcement learning problem. This framework is built upon Lagrangian duality, and demonstrates that every model-optimistic algorithm that constructs an optimistic MDP has an equivalent representation as a value-optimistic dynamic programming algorithm. Typically, it was thought that these two classes of algorithms were distinct, with model-optimistic algorithms benefiting from a cleaner probabilistic analysis while value-optimistic algorithms are easier to implement and thus more practical. With the framework developed in this paper, we show that it is possible to get the best of both worlds by providing a class of algorithms which have a computationally efficient dynamic-programming implementation and also a simple probabilistic analysis. Besides being able to capture many existing algorithms in the tabular setting, our framework can also address large-scale problems under realizable function approximation, where it enables a simple model-based analysis of some recently proposed methods.


Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Neural Information Processing Systems

By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.


Safe Policy Improvement by Minimizing Robust Baseline Regret

Neural Information Processing Systems

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, which is guaranteed to outperform a given baseline strategy. In this paper, we develop and analyze a new model-based approach that computes a safe policy, given an inaccurate model of the system's dynamics and guarantees on the accuracy of this model. The new robust method uses this model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and to seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose a simple approximate algorithm. Our empirical results on several domains further show that even the simple approximate algorithm can outperform standard approaches.


Linear Feature Encoding for Reinforcement Learning

Neural Information Processing Systems

Feature construction is of vital importance in reinforcement learning, as the quality of a value function or policy is largely determined by the corresponding features. Typical deep RL approaches use a linear output layer, which means that deep RL can be interpreted as a feature construction/encoding network followed by linear value function approximation. This paper develops and evaluates a theory of linear feature encoding. We extend theoretical results on feature quality for linear value function approximation from the uncontrolled case to the controlled case. We then develop a supervised linear feature encoding method that is motivated by insights from linear value function approximation theory, as well as empirical successes from deep RL.


Adaptive optimal training of animal behavior

Neural Information Processing Systems

Neuroscience experiments often require training animals to perform tasks designed to elicit various sensory, cognitive, and motor behaviors. Training typically involves a series of gradual adjustments of stimulus conditions and rewards in order to bring about learning. However, training protocols are usually hand-designed, relying on a combination of intuition, guesswork, and trial-and-error, and often require weeks or months to achieve a desired level of task performance. Here we combine ideas from reinforcement learning and adaptive optimal experimental design to formulate methods for adaptive optimal training of animal behavior. Our work addresses two intriguing problems at once: first, it seeks to infer the learning rules underlying an animal's behavioral changes during training; second, it seeks to exploit these rules to select stimuli that will maximize the rate of learning toward a desired objective.


Learning values across many orders of magnitude

Neural Information Processing Systems

Most learning algorithms are not invariant to the scale of the signal that is being approximated. We propose to adaptively normalize the targets used in the learning updates. This is important in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior.


Learning Goal-Conditioned Representations for Language Reward Models Jeff Da Yuntao Ma Hugh Zhang Spencer Whitehead Sean Hendryx

Neural Information Processing Systems

Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning. Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback on language models. In this work, we propose training reward models (RMs) in a contrastive, goal-conditioned fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories. This objective significantly improves reward model performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well - on the Helpful-Harmless dataset, we observe 2.3% increase in accuracy.


Optimistic Critic Reconstruction and Constrained Fine-Tuning for General Offline-to-Online RL Qin-Wen Luo, Ye-Wen Wang 1, Sheng-Jun Huang

Neural Information Processing Systems

Offline-to-online (O2O) reinforcement learning (RL) provides an effective means of leveraging an offline pre-trained policy as initialization to improve performance rapidly with limited online interactions. Recent studies often design fine-tuning strategies for a specific offline RL method and cannot perform general O2O learning from any offline method. To deal with this problem, we disclose that there are evaluation and improvement mismatches between the offline dataset and the online environment, which hinders the direct application of pre-trained policies to online fine-tuning. In this paper, we propose to handle these two mismatches simultaneously, which aims to achieve general O2O learning from any offline method to any online method. Before online fine-tuning, we re-evaluate the pessimistic critic trained on the offline dataset in an optimistic way and then calibrate the misaligned critic with the reliable offline actor to avoid erroneous update. After obtaining an optimistic and and aligned critic, we perform constrained fine-tuning to combat distribution shift during online learning. We show empirically that the proposed method can achieve stable and efficient performance improvement on multiple simulated tasks when compared to the state-of-the-art methods.


Normalization and effective learning rates in reinforcement learning

Neural Information Processing Systems

Layer normalization has demonstrated remarkable effectiveness at preventing plasticity loss in continual and reinforcement learning (RL), though the precise reasons for this effectiveness remain mysterious. In this work, we identify new mechanisms by which layer normalization can help - and hinder - training in neural networks, and leverage these insights to improve the robustness of gradientbased optimization algorithms to nonstationarity. Our analysis reveals a surprising ability of layer normalization to revive dormant ReLU units, along with an underappreciated vulnerability to unconstrained decay of the effective learning rate (ELR), which can drive loss of plasticity in long-running nonstationary tasks. Motivated by these findings, we propose Normalize-and-Project (NaP), a simple protocol designed to provide the numerous benefits of normalization while ensuring that the effective learning rate remains constant throughout training. To do so, NaP couples the insertion of normalization layers with weight projection.