Reinforcement Learning
A Distributed Reinforcement Learning Solution With Knowledge Transfer Capability for A Bike Rebalancing Problem
Rebalancing is a critical service bottleneck for many transportation services, such as Citi Bike. Citi Bike relies on manual orchestrations of rebalancing bikes between dispatchers and field agents. Motivated by such problem and the lack of smart autonomous solutions in this area, this project explored a new RL architecture called Distributed RL (DiRL) with Transfer Learning (TL) capability. The DiRL solution is adaptive to changing traffic dynamics when keeping bike stock under control at the minimum cost. DiRL achieved a 350% improvement in bike rebalancing autonomously and TL offered a 62.4% performance boost in managing an entire bike network. Lastly, a field trip to the dispatch office of Chariot, a ride-sharing service, provided insights to overcome challenges of deploying an RL solution in the real world.
Robot Representing and Reasoning with Knowledge from Reinforcement Learning
Lu, Keting, Zhang, Shiqi, Stone, Peter, Chen, Xiaoping
Reinforcement learning (RL) agents aim at learning by interacting with an environment, and are not designed for representing or reasoning with declarative knowledge. Knowledge representation and reasoning (KRR) paradigms are strong in declarative KRR tasks, but are ill-equipped to learn from such experiences. In this work, we integrate logical-probabilistic KRR with model-based RL, enabling agents to simultaneously reason with declarative knowledge and learn from interaction experiences. The knowledge from humans and RL is unified and used for dynamically computing task-specific planning models under potentially new environments. Experiments were conducted using a mobile robot working on dialog, navigation, and delivery tasks. Results show significant improvements, in comparison to existing model-based RL methods.
Multi-agent Deep Reinforcement Learning for Zero Energy Communities
Advances in renewable energy generation and introduction of the government targets to improve energy efficiency gave rise to a concept of a Zero Energy Building (ZEB). A ZEB is a building whose net energy usage over a year is zero, i.e., its energy use is not larger than its overall renewables generation. A collection of ZEBs forms a Zero Energy Community (ZEC). This paper addresses the problem of energy sharing in such a community. This is different from previously addressed energy sharing between buildings as our focus is on the improvement of community energy status, while traditionally research focused on reducing losses due to transmission and storage, or achieving economic gains. We model this problem in a multi-agent environment and propose a Deep Reinforcement Learning (DRL) based solution. Each building is represented by an intelligent agent that learns over time the appropriate behaviour to share energy. We have evaluated the proposed solution in a multi-agent simulation built using osBrain. Results indicate that with time agents learn to collaborate and learn a policy comparable to the optimal policy, which in turn improves the ZEC's energy status. Buildings with no renewables preferred to request energy from their neighbours rather than from the supply grid.
Energy-Based Hindsight Experience Prioritization
In Hindsight Experience Replay (HER), a reinforcement learning agent is trained by treating whatever it has achieved as virtual goals. However, in previous work, the experience was replayed at random, without considering which episode might be the most valuable for learning. In this paper, we develop an energy-based framework for prioritizing hindsight experience in robotic manipulation tasks. Our approach is inspired by the work-energy principle in physics. We define a trajectory energy function as the sum of the transition energy of the target object over the trajectory. We hypothesize that replaying episodes that have high trajectory energy is more effective for reinforcement learning in robotics. To verify our hypothesis, we designed a framework for hindsight experience prioritization based on the trajectory energy of goal states. The trajectory energy function takes the potential, kinetic, and rotational energy into consideration. We evaluate our Energy-Based Prioritization (EBP) approach on four challenging robotic manipulation tasks in simulation. Our empirical results show that our proposed method surpasses state-of-the-art approaches in terms of both performance and sample-efficiency on all four tasks, without increasing computational time. A video showing experimental results is available at https://youtu.be/jtsF2tTeUGQ
Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards
Vecerik, Mel, Hester, Todd, Scholz, Jonathan, Wang, Fumin, Pietquin, Olivier, Piot, Bilal, Heess, Nicolas, Rothรถrl, Thomas, Lampe, Thomas, Riedmiller, Martin
We propose a general and model-free approach for Reinforcement Learning (RL) on real robotics with sparse rewards. We build upon the Deep Deterministic Policy Gradient (DDPG) algorithm to use demonstrations. Both demonstrations and actual interactions are used to fill a replay buffer and the sampling ratio between demonstrations and transitions is automatically tuned via a prioritized replay mechanism. Typically, carefully engineered shaping rewards are required to enable the agents to efficiently explore on high dimensional control problems such as robotics. They are also required for model-based acceleration methods relying on local solvers such as iLQG (e.g. Guided Policy Search and Normalized Advantage Function). The demonstrations replace the need for carefully engineered rewards, and reduce the exploration problem encountered by classical RL approaches in these domains. Demonstrations are collected by a robot kinesthetically force-controlled by a human demonstrator. Results on four simulated insertion tasks show that DDPG from demonstrations out-performs DDPG, and does not require engineered rewards. Finally, we demonstrate the method on a real robotics task consisting of inserting a clip (flexible object) into a rigid object.
Reinforcement Evolutionary Learning Method for self-learning
Pathak, Kumarjit, Kapila, Jitin
In statistical modelling the biggest threat is concept drift which makes the model gradually showing deteriorating performance over time. There are state of the art methodologies to detect the impact of concept drift, however general strategy considered to overcome the issue in performance is to rebuild or re-calibrate the model periodically as the variable patterns for the model changes significantly due to market change or consumer behavior change etc. Quantitative research is the most widely spread application of data science in Marketing or financial domain where applicability of state of the art reinforcement learning for auto-learning is less explored paradigm. Reinforcement learning is heavily dependent on having a simulated environment which is majorly available for gaming or online systems, to learn from the live feedback. However, there are some research happened on the area of online advertisement, pricing etc where due to the nature of the online learning environment scope of reinforcement learning is explored. Our proposed solution is a reinforcement learning based, true self-learning algorithm which can adapt to the data change or concept drift and auto learn and self-calibrate for the new patterns of the data solving the problem of concept drift. Index Terms-- Reinforcement learning, Genetic Algorithm, Q-learning, Classification modelling, CMA-ES, NES, Multi objective optimization, Concept drift, Population stability index, Incremental learning, F1-measure, Predictive Modelling, Self-learning, MCTS, AlphaGo, AlphaZero 1. Introduction Concept drift is well known challenge for sustainability of any machine learning predictive model over time. Machine learning offers diverse techniques to understand the underlying pattern of the data and associate the same with prediction objective. Any predictive modelling activity in either Marketing, Finance, Management are heavily dependent on the assumption that the training data represents the pattern of target population under specific study such as Fraud Identification, Customer churn prediction, Marketing mix modelling, Target customer identification for specific type of promotion etc. However due to social & economic development, customer behavior changes combined with other external factors making past learned pattern, irrelevant for current predictions.
Economics of Artificial Intelligence
An NBER conference on Economics of Artificial Intelligence took place in Toronto on September 13-14, 2018. Research Associates Ajay K. Agrawal, Joshua S. Gans and Avi Goldfarb of University of Toronto and Catherine Tucker of MIT organized the meeting, sponsored by the Alfred P. Sloan Foundation, CIFAR, and the Creative Destruction Lab. These researchers' papers were presented and discussed: Emilio Calvano, Vencenzo Denicolรฒ, and Sergio Pastorello, University of Bologna, and Giacomo Calzolari, European University Institute Q-Learning to Cooperate AI algorithms are increasingly replacing human decision making in real marketplaces. To inform the debate on potential consequences, Calvano, Calzolari, Denicolรฒ, and Pastorello run experiments with AI agents powered by reinforcement learning in controlled environments (computer simulations). In particular, the researchers study multi-agent interaction in the context of a workhorse oligopoly model: price competition with Logit demand and constant marginal costs.
Reinforcement Learning with Perturbed Rewards
Wang, Jingkang, Liu, Yang, Li, Bo
Recent studies have shown the vulnerability of reinforcement learning (RL) models in noisy settings. The sources of noises differ across scenarios. For instance, in practice, the observed reward channel is often subject to noise (e.g., when observed rewards are collected through sensors), and thus observed rewards may not be credible as a result. Also, in applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors. In this paper, we consider noisy RL problems where observed rewards by RL agents are generated with a reward confusion matrix. We call such observed rewards as perturbed rewards. We develop an unbiased reward estimator aided robust RL framework that enables RL agents to learn in noisy environments while observing only perturbed rewards. Our framework draws upon approaches for supervised learning with noisy data. The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 67.5% and 46.7% improvements in average on five Atari games, when the error rates are 10% and 30% respectively.
Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods
Henderson, Peter, Romoff, Joshua, Pineau, Joelle
Recent analyses of certain gradient descent optimization methods have shown that performance can degrade in some settings - such as with stochasticity or implicit momentum. In deep reinforcement learning (Deep RL), such optimization methods are often used for training neural networks via the temporal difference error or policy gradient. As an agent improves over time, the optimization target changes and thus the loss landscape (and local optima) change. Due to the failure modes of those methods, the ideal choice of optimizer for Deep RL remains unclear. As such, we provide an empirical analysis of the effects that a wide range of gradient descent optimizers and their hyperparameters have on policy gradient methods, a subset of Deep RL algorithms, for benchmark continuous control tasks. We find that adaptive optimizers have a narrow window of effective learning rates, diverging in other cases, and that the effectiveness of momentum varies depending on the properties of the environment. Our analysis suggests that there is significant interplay between the dynamics of the environment and Deep RL algorithm properties which aren't necessarily accounted for by traditional adaptive gradient methods. We provide suggestions for optimal settings of current methods and further lines of research based on our findings.
Model-Ensemble Trust-Region Policy Optimization
Kurutach, Thanard, Clavera, Ignasi, Duan, Yan, Tamar, Aviv, Abbeel, Pieter
Model-free reinforcement learning (RL) methods are succeeding in a growing number of tasks, aided by recent advances in deep learning. However, they tend to suffer from high sample complexity, which hinders their use in real-world domains. Alternatively, model-based reinforcement learning promises to reduce sample complexity, but tends to require careful tuning and to date have succeeded mainly in restrictive domains where simple models are sufficient for learning. In this paper, we analyze the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and show that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training. To overcome this issue, we propose to use an ensemble of models to maintain the model uncertainty and regularize the learning process. We further show that the use of likelihood ratio derivatives yields much more stable learning than backpropagation through time. Altogether, our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) significantly reduces the sample complexity compared to model-free deep RL methods on challenging continuous control benchmark tasks.