Reinforcement Learning
Does Knowledge Transfer Always Help to Learn a Better Policy?
Feng, Fei, Yin, Wotao, Yang, Lin F.
One of the key approaches to save samples when learning a policy for a reinforcement learning problem is to use knowledge from an approximate model such as its simulator. However, does knowledge transfer from approximate models always help to learn a better policy? Despite numerous empirical studies of transfer reinforcement learning, an answer to this question is still elusive. In this paper, we provide a strong negative result, showing that even the full knowledge of an approximate model may not help reduce the number of samples for learning an accurate policy of the true model. We construct an example of reinforcement learning models and show that the complexity with or without knowledge transfer has the same order. On the bright side, effective knowledge transferring is still possible under additional assumptions. In particular, we demonstrate that knowing the (linear) bases of the true model significantly reduces the number of samples for learning an accurate policy.
RoboNet: A Dataset for Large-Scale Multi-Robot Learning
Our goal is to pre-train reinforcement learning models on a diverse dataset and then transfer knowledge (either zero-shot or with fine-tuning) to a different test environment. In the last decade, we've seen learning-based systems provide transformative solutions for a wide range of perception and reasoning problems, from recognizing objects in images to recognizing and translating human speech. If fruitful, this line of work could allow learning-based systems to tackle active control tasks, such as robotics and autonomous driving, alongside the passive perception tasks to which they have already been successfully applied. While deep reinforcement learning methods โ like Soft Actor Criticโ can learn impressive motor skills, they are challenging to train on large and broad data that is not from the target environment. In contrast, the success of deep networks in fields like computer vision was arguably predicated just as much on large datasets, such as ImageNet, as on large neural network architectures.
Machine Learning
The NICE Customer Engagement Analytics offers a best-of-breed machine learning system that can be deployed across multiple channels at enterprise scale to enable automated marketing decision-making for customers. Built on top of the NICE Customer Engagement Analytics core infrastructure including the identity graph and predictive profile, NICE utilizes breakthrough technology for reinforcement learning to deliver a scalable, reliable and robust decision engine.
3 Microsoft Reinforcement Learning Environments Every ML Researcher Should Know
Reinforcement learning is the study of decision making over time with consequences. The field has developed systems to make decisions in complex environments based on external, and possibly delayed, feedback. "Microsoft Research, works on developing the theory, algorithms and systems for technology that learns from its own successes (and failures), explores the world "just enough" to learn, and can infer which decisions have led to those outcomes. Our primary goal is reinforcement learning in the real world: understanding how to build systems that work, even when simulation is unavailable and samples are scarce." To celebrate hosting the Reinforcement Learning Israel Meetup organized by the talented Shani Gamrian at the Microsoft Reactor here is a list of three Reinforcement Learning Environments every ML enthusiast should know.
Introducing SafeLife: Safety Benchmarks for Reinforcement Learning - The Partnership on AI
SafeLife is part of a broader PAI initiative to develop benchmarks for safety, fairness, and other ethical objectives for machine learning systems. Since so much of machine learning is driven, shaped, and measured by benchmarks (and the datasets and environments they are based on), we believe it is essential that those benchmarks come to incorporate safety and ethics goals on a widespread basis, and we're working to make that happen.
Learning Human Objectives by Evaluating Hypothetical Behavior
Reddy, Siddharth, Dragan, Anca D., Levine, Sergey, Legg, Shane, Leike, Jan
We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user's reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST). We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.
A pedestrian path-planning model in accordance with obstacle's danger with reinforcement learning
Trinh, Thanh-Trung, Vu, Dinh-Minh, Kimura, Masaomi
Most microscopic pedestrian navigation models use the concept of "forces" applied to the pedestrian agents to replicate the navigation environment. While the approach could provide believable results in regular situations, it does not always resemble natural p edestrian navigation behaviour in many typical settings. In our research, we proposed a novel approach using reinforcement learning for simulation of pedestrian agent path planning and collision avoidance problem. The primary focus of this approach is usi ng human perception of the environment and danger awareness of interferences . The implementation of our model has shown that the path planned by the agent shares many similarities with a human pedestrian in several aspects such as following common walking conventions and human behaviours .
Risk-Aware MMSE Estimation
Kalogerias, Dionysios S., Chamon, Luiz F. O., Pappas, George J., Ribeiro, Alejandro
Despite the simplicity and intuitive interpretation of Minimum Mean Squared Error (MMSE) estimators, their effectiveness in certain scenarios is questionable. Indeed, minimizing squared errors on average does not provide any form of stability, as the volatility of the estimation error is left unconstrained. When this volatility is statistically significant, the difference between the average and realized performance of the MMSE estimator can be drastically different. To address this issue, we introduce a new risk-aware MMSE formulation which trades between mean performance and risk by explicitly constraining the expected predictive variance of the involved squared error. We show that, under mild moment boundedness conditions, the corresponding risk-aware optimal solution can be evaluated explicitly, and has the form of an appropriately biased nonlinear MMSE estimator. We further illustrate the effectiveness of our approach via several numerical examples, which also showcase the advantages of risk-aware MMSE estimation against risk-neutral MMSE estimation, especially in models involving skewed, heavy-tailed distributions.
Combining Q-Learning and Search with Amortized Value Estimates
Hamrick, Jessica B., Bapst, Victor, Sanchez-Gonzalez, Alvaro, Pfaff, Tobias, Weber, Theophane, Buesing, Lars, Battaglia, Peter W.
We introduce "Search with Amortized Value Estimates" (SAVE), an approach for combining model-free Q-learning with model-based Monte-Carlo Tree Search (MCTS). In SAVE, a learned prior over state-action values is used to guide MCTS, which estimates an improved set of state-action values. The new Q-estimates are then used in combination with real experience to update the prior. This effectively amortizes the value computation performed by MCTS, resulting in a cooperative relationship between model-free learning and model-based search. SAVE can be implemented on top of any Q-learning agent with access to a model, which we demonstrate by incorporating it into agents that perform challenging physical reasoning tasks and Atari. SAVE consistently achieves higher rewards with fewer training steps, and---in contrast to typical model-based search approaches---yields strong performance with very small search budgets. By combining real experience with information computed during search, SAVE demonstrates that it is possible to improve on both the performance of model-free learning and the computational cost of planning.
Dynamic Pricing on E-commerce Platform with Deep Reinforcement Learning
Liu, Jiaxi, Zhang, Yidong, Wang, Xiaoqing, Deng, Yuming, Wu, Xingyu
In this paper we present an end-to-end framework for addressing the problem of dynamic pricing on E-commerce platform using methods based on deep reinforcement learning (DRL). By using four groups of different business data to represent the states of each time period, we model the dynamic pricing problem as a Markov Decision Process (MDP). Compared with the state-of-the-art DRL-based dynamic pricing algorithms, our approaches make the following three contributions. First, we extend the discrete set problem to the continuous price set. Second, instead of using revenue as the reward function directly, we define a new function named difference of revenue conversion rates (DRCR). Third, the cold-start problem of MDP is tackled by pre-training and evaluation using some carefully chosen historical sales data. Our approaches are evaluated by both offline evaluation method using real dataset of Alibaba Inc., and online field experiments on Tmall.com, a major online shopping website owned by Alibaba Inc.. In particular, experiment results suggest that DRCR is a more appropriate reward function than revenue, which is widely used by current literature. In the end, field experiments, which last for months on 1000 stock keeping units (SKUs) of products demonstrate that continuous price sets have better performance than discrete sets and show that our approaches significantly outperformed the manual pricing by operation experts.