Reinforcement Learning
How to Combine Tree-Search Methods in Reinforcement Learning
Efroni, Yonathan, Dalal, Gal, Scherrer, Bruno, Mannor, Shie
Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero). Referring to the planning problem as tree search, a reasonable practice in these implementations is to back up the value only at the leaves while the information obtained at the root is not leveraged other than for updating the policy. Here, we question the potency of this approach.Namely, the latter procedure is non-contractive in general, and its convergence is not guaranteed. Our proposed enhancement is straightforward and simple: use the return from the optimal tree path to back up the values at the descendants of the root. This leads to a \gamma^h-contracting procedure, where \gamma is the discount factor and $h$ is the tree depth. To establish our results, we first introduce a notion called multiple-step greedy consistency. We then provide convergence rates for two algorithmic instantiations of the above enhancement in the presence of noise injected to both the tree search stage and value estimation stage.
ARCHER: Aggressive Rewards to Counter bias in Hindsight Experience Replay
Experience replay is an important technique for addressing sample-inefficiency in deep reinforcement learning (RL), but faces difficulty in learning from binary and sparse rewards due to disproportionately few successful experiences in the replay buffer. Hindsight experience replay (HER) (Andrychowicz et al. 2017) was recently proposed to tackle this difficulty by manipulating unsuccessful transitions, but in doing so, HER introduces a significant bias in the replay buffer experiences and therefore achieves a suboptimal improvement in sample-efficiency. In this paper, we present an analysis on the source of bias in HER, and propose a simple and effective method to counter the bias, to most effectively harness the sample-efficiency provided by HER. Our method, motivated by counterfactual reasoning and called ARCHER, extends HER with a tradeoff to make rewards calculated for hindsight experiences numerically greater than real rewards. We validate our algorithm on two continuous control environments from DeepMind Control Suite (Tassa et al. 2018) - Reacher and Finger, which simulate manipulation tasks with a robotic arm - in combination with various reward functions, task complexities and goal sampling strategies. Our experiments consistently demonstrate that countering bias using more aggressive hindsight rewards increases sample efficiency, thus establishing the greater benefit of ARCHER in RL applications with limited computing budget.
ANS: Adaptive Network Scaling for Deep Rectifier Reinforcement Learning Models
Wu, Yueh-Hua, Sun, Fan-Yun, Chang, Yen-Yu, Lin, Should-De
This work provides a thorough study on how reward scaling can affect performance of deep reinforcement learning agents. In particular, we would like to answer the question that How does reward scaling affect non-saturating ReLU networks in RL? This question matters because ReLU is one of the most effective activation functions for deep learning models. We also propose an Adaptive Network Scaling framework to find a suitable scale of the rewards during learning for better performance. We conducted empirical studies to justify the solution.
Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising
Wu, Di, Chen, Xiujun, Yang, Xun, Wang, Hao, Tan, Qing, Zhang, Xiaoxun, Xu, Jian, Gai, Kun
Real-time bidding (RTB) is an important mechanism in online display advertising, where a proper bid for each page view plays an essential role for good marketing results. Budget constrained bidding is a typical scenario in RTB where the advertisers hope to maximize the total value of the winning impressions under a pre-set budget constraint. However, the optimal bidding strategy is hard to be derived due to the complexity and volatility of the auction environment. To address these challenges, in this paper, we formulate budget constrained bidding as a Markov Decision Process and propose a model-free reinforcement learning framework to resolve the optimization problem. Our analysis shows that the immediate reward from environment is misleading under a critical resource constraint. Therefore, we innovate a reward function design methodology for the reinforcement learning problems with constraints. Based on the new reward design, we employ a deep neural network to learn the appropriate reward so that the optimal policy can be learned effectively. Different from the prior model-based work, which suffers from the scalability problem, our framework is easy to be deployed in large-scale industrial applications. The experimental evaluations demonstrate the effectiveness of our framework on large-scale real datasets.
Reinforcement Learning under Threats
Gallego, Vรญctor, Naveiro, Roi, Insua, David Rรญos
In several reinforcement learning (RL) scenarios, mainly in security settings, there may be adversaries trying to interfere with the reward generating process. In this paper, we introduce Threatened Markov Decision Processes (TMDPs), which provide a framework to support a decision maker against a potential adversary in RL. Furthermore, we propose a level-$k$ thinking scheme resulting in a new learning framework to deal with TMDPs. After introducing our framework and deriving theoretical results, relevant empirical evidence is given via extensive experiments, showing the benefits of accounting for adversaries while the agent learns.
Deep Reinforcement Learning in High Frequency Trading
Ganesh, Prakhar, Rakheja, Puneet
The ability to give a precise and fast prediction for the price movement of stocks is the key to profitability in High Frequency Trading. The main objective of this paper is to propose a novel way of modeling the high frequency trading problem using Deep Reinforcement Learning and to argue why Deep RL can have a lot of potential in the field of High Frequency Trading. We have analyzed the model's performance based on it's prediction accuracy as well as prediction speed across full-day trading simulations.
Recurrent World Models Facilitate Policy Evolution
Ha, David, Schmidhuber, Jรผrgen
A generative recurrent neural network is quickly trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatio-temporal representations. The world model's extracted features are fed into compact and simple policies trained by evolution, achieving state of the art results in various environments. We also train our agent entirely inside of an environment generated by its own internal world model, and transfer this policy back into the actual environment. Interactive version of paper at https://worldmodels.github.io
Effective Exploration for Deep Reinforcement Learning via Bootstrapped Q-Ensembles under Tsallis Entropy Regularization
Chen, Gang, Peng, Yiming, Zhang, Mengjie
Recently deep reinforcement learning (DRL) has achieved outstanding success on solving many difficult and large-scale RL problems. However the high sample cost required for effective learning often makes DRL unaffordable in resource-limited applications. With the aim of improving sample efficiency and learning performance, we will develop a new DRL algorithm in this paper that seamless integrates entropy-induced and bootstrap-induced techniques for efficient and deep exploration of the learning environment. Specifically, a general form of Tsallis entropy regularizer will be utilized to drive entropy-induced exploration based on efficient approximation of optimal action-selection policies. Different from many existing works that rely on action dithering strategies for exploration, our algorithm is efficient in exploring actions with clear exploration value. Meanwhile, by employing an ensemble of Q-networks under varied Tsallis entropy regularization, the diversity of the ensemble can be further enhanced to enable effective bootstrap-induced exploration. Experiments on Atari game playing tasks clearly demonstrate that our new algorithm can achieve more efficient and effective exploration for DRL, in comparison to recently proposed exploration methods including Bootstrapped Deep Q-Network and UCB Q-Ensemble.
A Contextual-bandit-based Approach for Informed Decision-making in Clinical Trials
Varatharajah, Yogatheesan, Berry, Brent, Koyejo, Sanmi, Iyer, Ravishankar
Clinical trials involving multiple treatments utilize randomization of the treatment assignments to enable the evaluation of treatment efficacies in an unbiased manner. Such evaluation is performed in post hoc studies that usually use supervised-learning methods that rely on large amounts of data collected in a randomized fashion. That approach often proves to be suboptimal in that some participants may suffer and even die as a result of having not received the most appropriate treatments during the trial. Reinforcement-learning methods improve the situation by making it possible to learn the treatment efficacies dynamically during the course of the trial, and to adapt treatment assignments accordingly. Recent efforts using \textit{multi-arm bandits}, a type of reinforcement-learning methods, have focused on maximizing clinical outcomes for a population that was assumed to be homogeneous. However, those approaches have failed to account for the variability among participants that is becoming increasingly evident as a result of recent clinical-trial-based studies. We present a contextual-bandit-based online treatment optimization algorithm that, in choosing treatments for new participants in the study, takes into account not only the maximization of the clinical outcomes but also the patient characteristics. We evaluated our algorithm using a real clinical trial dataset from the International Stroke Trial. The results of our retrospective analysis indicate that the proposed approach performs significantly better than either a random assignment of treatments (the current gold standard) or a multi-arm-bandit-based approach, providing substantial gains in the percentage of participants who are assigned the most suitable treatments. The contextual-bandit and multi-arm bandit approaches provide 72.63% and 64.34% gains, respectively, compared to a random assignment.
APES: a Python toolbox for simulating reinforcement learning environments
Labash, Aqeel, Tampuu, Ardi, Matiisen, Tambet, Aru, Jaan, Vicente, Raul
Assisted by neural networks, reinforcement learning agents have been able to solve increasingly complex tasks over the last years. The simulation environment in which the agents interact is an essential component in any reinforcement learning problem. The environment simulates the dynamics of the agents' world and hence provides feedback to their actions in terms of state observations and external rewards. To ease the design and simulation of such environments this work introduces APES, a highly customizable and open source package in Python to create 2D grid-world environments for reinforcement learning problems. APES equips agents with algorithms to simulate any field of vision, it allows the creation and positioning of items and rewards according to user-defined rules, and supports the interaction of multiple agents.