Reinforcement Learning
Hierarchical Reinforcement Learning for Open-Domain Dialog
Saleh, Abdelrhman, Jaques, Natasha, Ghandeharioun, Asma, Shen, Judy Hanwen, Picard, Rosalind
Open-domain dialog generation is a challenging problem; maximum likelihood training can lead to repetitive outputs, models have difficulty tracking long-term conversational goals, and training on standard movie or online datasets may lead to the generation of inappropriate, biased, or offensive text. Reinforcement Learning (RL) is a powerful framework that could potentially address these issues, for example by allowing a dialog model to optimize for reducing toxicity and repetitiveness. However, previous approaches which apply RL to open-domain dialog generation do so at the word level, making it difficult for the model to learn proper credit assignment for long-term conversational rewards. In this paper, we propose a novel approach to hierarchical reinforcement learning, VHRL, which uses policy gradients to tune the utterance-level embedding of a variational sequence model. This hierarchical approach provides greater flexibility for learning long-term, conversational rewards. We use self-play and RL to optimize for a set of human-centered conversation metrics, and show that our approach provides significant improvements -- in terms of both human evaluation and automatic metrics -- over state-of-the-art dialog models, including Transformers.
Value function estimation in Markov reward processes: Instance-dependent $\ell_\infty$-bounds for policy evaluation
Pananjady, Ashwin, Wainwright, Martin J.
A variety of applications spanning science and engineering use Markov reward processes as models for real-world phenomena, including queueing systems, transportation networks, robotic exploration, game playing, and epidemiology. In some of these settings, the underlying parameters that govern the process are known to the modeller, but in others, these must be estimated from observed data. A salient example of the latter setting, which forms the main motivation for this paper, is the policy evaluation problem encountered in Markov decision processes (MDPs) and reinforcement learning [Ber95a; Ber95b; SB18]. Here an agent operates in an environment whose dynamics are unknown: at each step, it observes the current state of the environment, and takes an action that changes its state according to some stochastic transition function determined by the environment. The goal is to evaluate the utility of some policy--that is, a mapping from states to actions, where utility is measured using rewards that the agent receives from the environment. These rewards are usually assumed to be additive over time, and since the policy determines the action to be taken at each state, the reward obtained at any time is simply a function of the current state of the agent. Thus, this setting induces a Markov reward process (MRP) on the state space, in which both the underlying transitions and rewards are unknown to the agent. The agent only observes samples of state transitions and rewards. 1
Sample Efficient Policy Gradient Methods with Recursive Variance Reduction
Xu, Pan, Gao, Felicia, Gu, Quanquan
Improving the sample efficiency in reinforcement learning has been a long-standing research problem. In this work, we aim to reduce the sample complexity of existing policy gradient methods. We propose a novel policy gradient algorithm called SRVR-PG, which only requires $O(1/\epsilon^{3/2})$ episodes to find an $\epsilon$-approximate stationary point of the nonconcave performance function $J(\boldsymbol{\theta})$ (i.e., $\boldsymbol{\theta}$ such that $\|\nabla J(\boldsymbol{\theta})\|_2^2\leq\epsilon$). This sample complexity improves the best known result $O(1/\epsilon^{5/3})$ for policy gradient algorithms by a factor of $O(1/\epsilon^{1/6})$. In addition, we also propose a variant of SRVR-PG with parameter exploration, which explores the initial policy parameter from a prior probability distribution. We conduct numerical experiments on classic control problems in reinforcement learning to validate the performance of our proposed algorithms.
Fine-Tuning Language Models from Human Preferences
Ziegler, Daniel M., Stiennon, Nisan, Wu, Jeffrey, Brown, Tom B., Radford, Alec, Amodei, Dario, Christiano, Paul, Irving, Geoffrey
Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.
Predicting optimal value functions by interpolating reward functions in scalarized multi-objective reinforcement learning
Kusari, Arpan, How, Jonathan P.
A common approach for defining a reward function for Multi-objective Reinforcement Learning (MORL) problems is the weighted sum of the multiple objectives. The weights are then treated as design parameters dependent on the expertise (and preference) of the person performing the learning, with the typical result that a new solution is required for any change in these settings. This paper investigates the relationship between the reward function and the optimal value function for MORL; specifically addressing the question of how to approximate the optimal value function well beyond the set of weights for which the optimization problem was actually solved, thereby avoiding the need to recompute for any particular choice. We prove that the value function transforms smoothly given a transformation of weights of the reward function (and thus a smooth interpolation in the policy space). A Gaussian process is used to obtain a smooth interpolation over the reward function weights of the optimal value function for three well-known examples: GridWorld, Objectworld and Pendulum. The results show that the interpolation can provide very robust values for sample states and action space in discrete and continuous domain problems. Significant advantages arise from utilizing this interpolation technique in the domain of autonomous vehicles: easy, instant adaptation of user preferences while driving and true randomization of obstacle vehicle behavior preferences during training.
A Human-Centered Data-Driven Planner-Actor-Critic Architecture via Logic Programming
Lyu, Daoming, Yang, Fangkai, Liu, Bo, Gustafson, Steven
Recent successes of Reinforcement Learning (RL) allow an agent to learn policies that surpass human experts but suffers from being time-hungry and data-hungry. By contrast, human learning is significantly faster because prior and general knowledge and multiple information resources are utilized. In this paper, we propose a Planner-Actor-Critic architecture for huMAN-centered planning and learning (PACMAN), where an agent uses its prior, high-level, deterministic symbolic knowledge to plan for goal-directed actions, and also integrates the Actor-Critic algorithm of RL to fine-tune its behavior towards both environmental rewards and human feedback. This work is the first unified framework where knowledge-based planning, RL, and human teaching jointly contribute to the policy learning of an agent. Our experiments demonstrate that PACMAN leads to a significant jump-start at the early stage of learning, converges rapidly and with small variance, and is robust to inconsistent, infrequent, and misleading feedback.
Multi-Robot Deep Reinforcement Learning with Macro-Actions
Xiao, Yuchen, Hoffman, Joshua, Xia, Tian, Amato, Christopher
A. MacDec-POMDPs Decentralized fully collaborative multi-agent decision-making under uncertainty can be modeled as a decentralized POMDP (Dec-POMDP) [14]. Due to the assumption of synchronous actions that require the same amount of time for each agent, Dec-POMDPs are not applicable to multi-robot planning and learning scenarios in real-world. MacDec-POMDPs, formalized by introducing macro-actions into Dec-POMDPs, inherently allow asynchronous execution among robots with temporally extended macro-actions that can begin and end at different times for each agent. Formally, a MacDec-POMDP is defined as a tuple nullI,S,A, Ω,M,ζ,O,T,Z,R null, where I is a finite set of agents; S is a finite set of environment states; A iA i and Ω iΩ i are the spaces of joint-primitive-action and joint-primitive-observation respectively; M iM i is the joint set of each agent's finite macro-action space M i; ζ iζ i is the set of joint macro-observations over agents' finite macro-observation space ζ i. Given a macro-action- based policy, each agent i is allowed to asynchronously choose a macro-action m i nullβ m,I m,π m null i that depends on individual macro-action-observation histories, where β m: H A i [0, 1] is the stochastic termination condition and I m H M i is the initiation set of the corresponding macro-action m i, respectively depending on the primitive-action- observation history space H A i and macro-action-observation history space H M i of agent i; π m: H A i A i denotes the low-level policy to achieve the macro-action m, and during the execution, each agent's primitive-observation o i Ω i is generated according to probability observation function O i(o i,a i,s) Pr( o i a i,s), and a shared immediate reward r ( s,null a), where null a A iA i, is issued according to the reward function R: S A R .
Robust Opponent Modeling via Adversarial Ensemble Reinforcement Learning in Asymmetric Imperfect-Information Games
Shen, Macheng, How, Jonathan P.
This paper presents an algorithmic framework for learning robust policies in asymmetric imperfect-information games, where the joint reward could depend on the uncertain opponent type (a private information known only to the opponent itself and its ally). In order to maximize the reward, the protagonist agent has to infer the opponent type through agent modeling. We use multiagent reinforcement learning (MARL) to learn opponent models through self-play, which captures the full strategy interaction and reasoning between agents. However, agent policies learned from self-play can suffer from mutual overfitting. Ensemble training methods can be used to improve the robustness of agent policy against different opponents, but it also significantly increases the computational overhead. In order to achieve a good trade-off between the robustness of the learned policy and the computation complexity, we propose to train a separate opponent policy against the protagonist agent for evaluation purposes. The reward achieved by this opponent is a noisy measure of the robustness of the protagonist agent policy due to the intrinsic stochastic nature of a reinforcement learner. To handle this stochasticity, we apply a stochastic optimization scheme to dynamically update the opponent ensemble to optimize an objective function that strikes a balance between robustness and computation complexity. We empirically show that, under the same limited computational budget, the proposed method results in more robust policy learning than standard ensemble training.
Segregation Dynamics with Reinforcement Learning and Agent Based Modeling
Sert, Egemen, Bar-Yam, Yaneer, Morales, Alfredo J.
Societies are complex. Properties of social systems can be explained by the interplay and weaving of individual actions. Incentives are key to understand people's choices and decisions. For instance, individual preferences of where to live may lead to the emergence of social segregation. In this paper, we combine Reinforcement Learning (RL) with Agent Based Models (ABM) in order to address the self-organizing dynamics of social segregation and explore the space of possibilities that emerge from considering different types of incentives. Our model promotes the creation of interdependencies and interactions among multiple agents of two different kinds that want to segregate from each other. For this purpose, agents use Deep Q-Networks to make decisions based on the rules of the Schelling Segregation model and the Predator-Prey model. Despite the segregation incentive, our experiments show that spatial integration can be achieved by establishing interdependencies among agents of different kinds. They also reveal that segregated areas are more probable to host older people than diverse areas, which attract younger ones. Through this work, we show that the combination of RL and ABMs can create an artificial environment for policy makers to observe potential and existing behaviors associated to incentives.
A Hierarchical Two-tier Approach to Hyper-parameter Optimization in Reinforcement Learning
Barsce, Juan Cruz, Palombarini, Jorge A., Martínez, Ernesto
Optimization of hyper-parameters in reinforcement learning (RL) algorithms is a key task, because they determine how the agent will learn its policy by interacting with its environment, and thus what data is gathered. In this work, an approach that uses Bayesian optimization to perform a two-step optimization is proposed: first, categorical RL structure hyper-parameters are taken as binary variables and optimized with an acquisition function tailored for such variables. Then, at a lower level of abstraction, solution-level hyper-parameters are optimized by resorting to the expected improvement acquisition function, while using the best categorical hyper-parameters found in the optimization at the upper-level of abstraction. This two-tier approach is validated in a simulated control task. Results obtained are promising and open the way for more user-independent applications of reinforcement learning.