Reinforcement Learning
Combining Parametric and Nonparametric Models for Off-Policy Evaluation
Gottesman, Omer, Liu, Yao, Sussex, Scott, Brunskill, Emma, Doshi-Velez, Finale
We consider a model-based approach to perform batch off-policy evaluation in reinforcement learning. Our method takes a mixture-of-experts approach to combine parametric and non-parametric models of the environment such that the final value estimate has the least expected error. We do so by first estimating the local accuracy of each model and then using a planner to select which model to use at every time step as to minimize the return error estimate along entire trajectories. Across a variety of domains, our mixture-based approach outperforms the individual models alone as well as state-of-the-art importance sampling-based estimators.
Successor Options: An Option Discovery Framework for Reinforcement Learning
Ramesh, Rahul, Tomar, Manan, Ravindran, Balaraman
The options framework in reinforcement learning models the notion of a skill or a temporally extended sequence of actions. The discovery of a reusable set of skills has typically entailed building options, that navigate to bottleneck states. This work adopts a complementary approach, where we attempt to discover options that navigate to landmark states. These states are prototypical representatives of well-connected regions and can hence access the associated region with relative ease. In this work, we propose Successor Options, which leverages Successor Representations to build a model of the state space. The intra-option policies are learnt using a novel pseudo-reward and the model scales to high-dimensional spaces easily. Additionally, we also propose an Incremental Successor Options model that iterates between constructing Successor Representations and building options, which is useful when robust Successor Representations cannot be built solely from primitive actions. We demonstrate the efficacy of our approach on a collection of grid-worlds, and on the high-dimensional robotic control environment of Fetch.
Trajectory-Based Off-Policy Deep Reinforcement Learning
Doerr, Andreas, Volpp, Michael, Toussaint, Marc, Trimpe, Sebastian, Daniel, Christian
Policy gradient methods are powerful reinforcement learning algorithms and have been demonstrated to solve many complex tasks. However, these methods are also data-inefficient, afflicted with high variance gradient estimates, and frequently get stuck in local optima. This work addresses these weaknesses by combining recent improvements in the reuse of off-policy data and exploration in parameter space with deterministic behavioral policies. The resulting objective is amenable to standard neural network optimization strategies like stochastic gradient descent or stochastic gradient Hamiltonian Monte Carlo. Incorporation of previous rollouts via importance sampling greatly improves data-efficiency, whilst stochastic optimization schemes facilitate the escape from local optima. We evaluate the proposed approach on a series of continuous control benchmark tasks. The results show that the proposed algorithm is able to successfully and reliably learn solutions using fewer system interactions than standard policy gradient methods.
TauRieL: Targeting Traveling Salesman Problem with a deep reinforcement learning inspired architecture
Malazgirt, Gorker Alp, Unsal, Osman S., Kestelman, Adrian Cristal
In this paper, we propose TauRieL and target Traveling Salesman Problem (TSP) since it has broad applicability in theoretical and applied sciences. TauRieL utilizes an actor-critic inspired architecture that adopts ordinary feedforward nets to obtain a policy update vector $v$. Then, we use $v$ to improve the state transition matrix from which we generate the policy. Also, the state transition matrix allows the solver to initialize from precomputed solutions such as nearest neighbors. In an online learning setting, TauRieL unifies the training and the search where it can generate near-optimal results in seconds. The input to the neural nets in the actor-critic architecture are raw 2-D inputs, and the design idea behind this decision is to keep neural nets relatively smaller than the architectures with wide embeddings with the tradeoff of omitting any distributed representations of the embeddings. Consequently, TauRieL generates TSP solutions two orders of magnitude faster per TSP instance as compared to state-of-the-art offline techniques with a performance impact of 6.1\% in the worst case.
Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment
Huang, Chen, Zhai, Shuangfei, Talbott, Walter, Bautista, Miguel Angel, Sun, Shih-Yu, Guestrin, Carlos, Susskind, Josh
In most machine learning training paradigms a fixed, often handcrafted, loss function is assumed to be a good proxy for an underlying evaluation metric. In this work we assess this assumption by meta-learning an adaptive loss function to directly optimize the evaluation metric. We propose a sample efficient reinforcement learning approach for adapting the loss dynamically during training. We empirically show how this formulation improves performance by simultaneously optimizing the evaluation metric and smoothing the loss landscape. We verify our method in metric learning and classification scenarios, showing considerable improvements over the state-of-the-art on a diverse set of tasks. Importantly, our method is applicable to a wide range of loss functions and evaluation metrics. Furthermore, the learned policies are transferable across tasks and data, demonstrating the versatility of the method.
Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates
Soemers, Dennis J. N. J., Piette, รric, Stephenson, Matthew, Browne, Cameron
In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value estimates, rather than MCTS visit counts. We empirically evaluate various properties of resulting policies, in a variety of board games.
Bias-Reduced Hindsight Experience Replay with Virtual Goal Prioritization
Manela, Binyamin, Biess, Armin
Hindsight Experience Replay (HER) is a multi-goal reinforcement learning algorithm for sparse reward functions. The algorithm treats every failure as a success for an alternative (virtual) goal that has been achieved in the episode. Virtual goals are randomly selected, irrespective of which are most instructive for the agent. In this paper, we present two improvements over the existing HER algorithm. First, we prioritize virtual goals from which the agent will learn more valuable information. We call this property the instructiveness of the virtual goal and define it by a heuristic measure, which expresses how well the agent will be able to generalize from that virtual goal to actual goals. Secondly, we reduce existing bias in HER by the removal of misleading samples. To test our algorithms, we built two challenging environments with sparse reward functions. Our empirical results in both environments show vast improvement in the final success rate and sample efficiency when compared to the original HER algorithm.
QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning
Son, Kyunghwan, Kim, Daewoo, Kang, Wan Ju, Hostallero, David Earl, Yi, Yung
We explore value-based solutions for multi-agent reinforcement learning (MARL) tasks in the centralized training with decentralized execution (CTDE) regime popularized recently. However, VDN and QMIX are representative examples that use the idea of factorization of the joint action-value function into individual ones for decentralized execution. VDN and QMIX address only a fraction of factorizable MARL tasks due to their structural constraint in factorization such as additivity and monotonicity. In this paper, we propose a new factorization method for MARL, QTRAN, which is free from such structural constraints and takes on a new approach to transforming the original joint action-value function into an easily factorizable one, with the same optimal actions. QTRAN guarantees more general factorization than VDN or QMIX, thus covering a much wider class of MARL tasks than does previous methods. Our experiments for the tasks of multi-domain Gaussian-squeeze and modified predator-prey demonstrate QTRAN's superior performance with especially larger margins in games whose payoffs penalize non-cooperative behavior more aggressively.
Generalizable Deep Reinforcement Learning
Transfer learning is all the rage in the machine learning community these days. Transfer learning serves as the basis for many of the managed AutoML services that Google, Salesforce, IBM, and Azure provide. It now figures prominently in the latest NLP research -- appearing in Google's Bidirectional Encoder Representations from Transformers (BERT) model and in Sebastian Ruder and Jeremy Howard's Universal Language Model Fine-tuning for Text Classification (ULMFIT). As Sebastian writes in his blog post, 'NLP's ImageNet moment has arrived': We're also starting to see examples of neural networks that can handle multiple tasks using transfer learning across domains. Paras Chopra has an excellent tutorial for one PyTorch network that can conduct an image search based on a textual description, search for similar images and words, and write captions for images (link to his post below). The main question at hand is: could transfer learning have applications within reinforcement learning?
Robots that learn to adapt
Humans have the ability to seamlessly adapt to changes in their environments: adults can learn to walk on crutches in just a few seconds, people can adapt almost instantaneously to picking up an object that is unexpectedly heavy, and children who can walk on flat ground can quickly adapt their gait to walk uphill without having to relearn how to walk. This adaptation is critical for functioning in the real world. Robots, on the other hand, are typically deployed with a fixed behavior (be it hard-coded or learned), allowing them succeed in specific settings, but leading to failure in others: experiencing a system malfunction, encountering a new terrain or environment changes such as wind, or needing to cope with a payload or other unexpected perturbations. The idea behind our latest research is that the mismatch between predicted and observed recent states should inform the robot to update its model into one that more accurately describes the current situation. Noticing our car skidding on the road, for example, informs us that our actions are having a different effect than expected, and thus allows us to plan our consequent actions accordingly (Figure 1).