Goto

Collaborating Authors

 Reinforcement Learning


Towards a Data Efficient Off-Policy Policy Gradient

AAAI Conferences

The ability to learn from off-policy data -- data generated from past interaction with the environment -- is essential to data efficient reinforcement learning. Recent work has shown that the use of off-policy data not only allows the re-use of data but can even improve performance in comparison to on-policy reinforcement learning. In this work we investigate if a recently proposed method for learning a better data generation policy, commonly called a behavior policy, can also increase the data efficiency of policy gradient reinforcement learning. Empirical results demonstrate that with an appropriately selected behavior policy we can estimate the policy gradient more accurately. The results also motivate further work into developing methods for adapting the behavior policy as the policy we are learning changes.


Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Sponsored search is an indispensable business model and a major revenue contributor of almost all the search engines. From the advertisers' side, participating in ranking the search results by paying for the sponsored search advertisement to attract more awareness and purchase facilitates their commercial goal. From the users' side, presenting personalized advertisement reflecting their propensity would make their online search experience more satisfactory. Sponsored search platforms rank the advertisements by a ranking function to determine the list of advertisements to show and the charging price for the advertisers. Hence, it is crucial to find a good ranking function which can simultaneously satisfy the platform, the users and the advertisers. Moreover, advertisements showing positions under different queries from different users may associate with advertisement candidates of different bid price distributions and click probability distributions, which requires the ranking functions to be optimized adaptively to the traffic characteristics. In this work, we proposed a generic framework to optimize the ranking functions by deep reinforcement learning methods. The framework is composed of two parts: an offline learning part which initializes the ranking functions by learning from a simulated advertising environment, allowing adequate exploration of the ranking function parameter space without hurting the performance of the commercial platform. An online learning part which further optimizes the ranking functions by adapting to the online data distribution. Experimental results on a large-scale sponsored search platform confirm the effectiveness of the proposed method.


Meta Reinforcement Learning with Latent Variable Gaussian Processes

arXiv.org Machine Learning

Data efficiency, i.e., learning from small data sets, is critical in many practical applications where data collection is time consuming or expensive, e.g., robotics, animal experiments or drug design. Meta learning is one way to increase the data efficiency of learning algorithms by generalizing learned concepts from a set of training tasks to unseen, but related, tasks. Often, this relationship between tasks is hard coded or relies in some other way on human expertise. In this paper, we propose to automatically learn the relationship between tasks using a latent variable model. Our approach finds a variational posterior over tasks and averages over all plausible (according to this posterior) tasks when making predictions. We apply this framework within a model-based reinforcement learning setting for learning dynamics models and controllers of many related tasks. We apply our framework in a model-based reinforcement learning setting, and show that our model effectively generalizes to novel tasks, and that it reduces the average interaction time needed to solve tasks by up to 60% compared to strong baselines.


Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation

arXiv.org Artificial Intelligence

Existing research studies on vision and language grounding for robot navigation focus on improving model-free deep reinforcement learning (DRL) models in synthetic environments. However, model-free DRL models do not consider the dynamics in the real-world environments, and they often fail to generalize to new scenes. In this paper, we take a radical approach to bridge the gap between synthetic studies and real-world practices---We propose a novel, planned-ahead hybrid reinforcement learning model that combines model-free and model-based reinforcement learning to solve a real-world vision-language navigation task. Our look-ahead module tightly integrates a look-ahead policy model with an environment model that predicts the next state and the reward. Experimental results suggest that our proposed method significantly outperforms the baselines and achieves the best on the real-world Room-to-Room dataset. Moreover, our scalable method is more generalizable when transferring to unseen environments, and the relative success rate is increased by 15.5% on the unseen test set.


Expected Policy Gradients

arXiv.org Machine Learning

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates across the action when estimating the gradient, instead of relying only on the action in the sampled trajectory. We establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. We also prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead. Finally, we show that it is optimal in a certain sense to explore with a Gaussian policy such that the covariance is proportional to the exponential of the scaled Hessian of the critic with respect to the actions. We present empirical results confirming that this new form of exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic in four challenging MuJoCo domains.


Reinforcement Learning Cheat Sheet โ€“ Towards Data Science

@machinelearnbot

Disclaimer: This is a work in progress project there may be errors! In order to fast recap my knowledge of Reinforcement Learning, I created this Cheat Sheet with all the basic formulas and algorithms. I hope this may be useful to you. You can find the full pdf here, and the repo here. Thanks to AlexandreBeaulne that added Contraction Mapping, Sarsa and cleanup the latex.


Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents

Journal of Artificial Intelligence Research

The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). In this article we take a big picture look at how the ALE is being used by the research community. We show how diverse the evaluation methodologies in the ALE have become with time, and highlight some key concerns when evaluating agents in the ALE. We use this discussion to present some methodological best practices and provide new benchmark results using these best practices. To further the progress in the field, we introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions. We conclude this big picture look by revisiting challenges posed when the ALE was introduced, summarizing the state-of-the-art in various problems and highlighting problems that remain open.


Setting up a Reinforcement Learning Task with a Real-World Robot

arXiv.org Machine Learning

Despite some recent successes (e.g., Levine et al. 2016, Gu et al. 2017), real-world robots are under-utilized in the quest for general reinforcement learning (RL) agents, which at this time is primarily confined to simulation. This underutilization is largely due to frustrations around unreliable and poor learning performance with robots. Although several RL methods are recently shown to be highly effective in simulations (Duan et al. 2016), they often yield poor performance when applied off-the-shelf to real-world tasks. Such ineffectiveness is sometimes attributed to some of the integral aspects of the real world including slow rate of data collection, partial observability, noisy sensors, safety, and frailty of physical devices. This barrier contributed to a reliance on indirect approaches such as simulation-to-reality transfer (Rusu et al. 2017) and collective learning (Yahya et al. 2017, Gu et al. 2017), which sometimes compensate for failures to learn from a single stream of real experience. One oft-ignored shortcoming in real-world RL research is the lack of benchmark learning tasks or standards for setting up experiments with robots. Experiments with simulated robots are typically done on benchmark tasks with easily available simulators and standardized interfaces, relieving experimenters of many task-setup details such as the action space, the action cycle time and system delays. On the other hand, setting up a learning task with real-world robots is far from obvious. An experimenter has to put a lot of work into establishing a device-specific sensorimotor interface between the learning agent and the robot as well as determining all the aspects of the environment that define the learning task.


Simple random search provides a competitive approach to reinforcement learning

arXiv.org Machine Learning

A common belief in model-free reinforcement learning is that methods based on random search in the parameter space of policies exhibit significantly worse sample complexity than those that explore the space of actions. We dispel such beliefs by introducing a random search method for training static, linear policies for continuous control problems, matching state-of-the-art sample efficiency on the benchmark MuJoCo locomotion tasks. Our method also finds a nearly optimal controller for a challenging instance of the Linear Quadratic Regulator, a classical problem in control theory, when the dynamics are not known. Computationally, our random search algorithm is at least 15 times more efficient than the fastest competing model-free methods on these benchmarks. We take advantage of this computational efficiency to evaluate the performance of our method over hundreds of random seeds and many different hyperparameter configurations for each benchmark task. Our simulations highlight a high variability in performance in these benchmark tasks, suggesting that commonly used estimations of sample efficiency do not adequately evaluate the performance of RL algorithms.


What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

arXiv.org Machine Learning

An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon T of the experiment. A well-known technique to obtain an anytime algorithm from any non-anytime algorithm is the "Doubling Trick". In the context of adversarial or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing horizons (geometric and exponential) to generalize previously known results that certain doubling tricks can be used to conserve certain regret bounds. In a broad setting, we prove that a geometric doubling trick can be used to conserve (minimax) bounds in $R\_T = O(\sqrt{T})$ but cannot conserve (distribution-dependent) bounds in $R\_T = O(\log T)$. We give insights as to why exponential doubling tricks may be better, as they conserve bounds in $R\_T = O(\log T)$, and are close to conserving bounds in $R\_T = O(\sqrt{T})$.