Reinforcement Learning
Optimal Bidding Strategy without Exploration in Real-time Bidding
Ghosh, Aritra, Mitra, Saayan, Sarkhel, Somdeb, Swaminathan, Viswanathan
Maximizing utility with a budget constraint is the primary goal for advertisers in real-time bidding (RTB) systems. The policy maximizing the utility is referred to as the optimal bidding strategy. Earlier works on optimal bidding strategy apply model-based batch reinforcement learning methods which can not generalize to unknown budget and time constraint. Further, the advertiser observes a censored market price which makes direct evaluation infeasible on batch test datasets. Previous works ignore the losing auctions to alleviate the difficulty with censored states; thus significantly modifying the test distribution. We address the challenge of lacking a clear evaluation procedure as well as the error propagated through batch reinforcement learning methods in RTB systems. We exploit two conditional independence structures in the sequential bidding process that allow us to propose a novel practical framework using the maximum entropy principle to imitate the behavior of the true distribution observed in real-time traffic. Moreover, the framework allows us to train a model that can generalize to the unseen budget conditions than limit only to those observed in history. We compare our methods on two real-world RTB datasets with several baselines and demonstrate significantly improved performance under various budget settings.
Robotic Table Tennis with Model-Free Reinforcement Learning
Gao, Wenbo, Graesser, Laura, Choromanski, Krzysztof, Song, Xingyou, Lazic, Nevena, Sanketi, Pannag, Sindhwani, Vikas, Jaitly, Navdeep
We propose a model-free algorithm for learning efficient policies capable of returning table tennis balls by controlling robot joints at a rate of 100Hz. We demonstrate that evolutionary search (ES) methods acting on CNN-based policy architectures for non-visual inputs and convolving across time learn compact controllers leading to smooth motions. Furthermore, we show that with appropriately tuned curriculum learning on the task and rewards, policies are capable of developing multi-modal styles, specifically forehand and backhand stroke, whilst achieving 80\% return rate on a wide range of ball throws. We observe that multi-modality does not require any architectural priors, such as multi-head architectures or hierarchical policies.
Importance of using appropriate baselines for evaluation of data-efficiency in deep reinforcement learning for Atari
Reinforcement learning (RL) has seen great advancements in the past few years. Nevertheless, the consensus among the RL community is that currently used methods, despite all their benefits, suffer from extreme data inefficiency, especially in the rich visual domains like Atari. To circumvent this problem, novel approaches were introduced that often claim to be much more efficient than popular variations of the state-of-the-art DQN algorithm. In this paper, however, we demonstrate that the newly proposed techniques simply used unfair baselines in their experiments. Namely, we show that the actual improvement in the efficiency came from allowing the algorithm for more training updates for each data sample, and not from employing the new methods. By allowing DQN to execute network updates more frequently we manage to reach similar or better results than the recently proposed advancement, often at a fraction of complexity and computational costs. Furthermore, based on the outcomes of the study, we argue that the agent similar to the modified DQN that is presented in this paper should be used as a baseline for any future work aimed at improving sample efficiency of deep reinforcement learning.
Learning to Ask Medical Questions using Reinforcement Learning
Shaham, Uri, Zahavy, Tom, Caraballo, Cesar, Mahajan, Shiwani, Massey, Daisy, Krumholz, Harlan
Feature selection is an important topic in traditional machine learning [Li et al., 2018], which motivated a large number of widely adopted works, e.g., Lasso [Tibshirani, 1996]. In various cases, the process of obtaining input measurements requires considerable effort (e.g., time, money, technology). For example, in medical datasets input features may correspond to lab tests, medical imaging results, or even answers to questionnaires, which are expensive and slow to produce. Allowing oneself to to be able to accurately predict a response variable from a small set of input features is thus a desirable goal, which can be manifested in saving time, money, and sometimes even human lives. As a running example, consider the case of a patient complaining to a family doctor about not feeling well. The doctor then asks the patients several questions about his current condition and medical background, and may also ask the patient to do some lab tests. Implicitly, the doctor is aiming at quickly collecting relevant details on the patient that will allow her to have a clear understanding of the patients' medical status, and consequently decide on an appropriate action (e.g., medication prescription, admit for hospitalization etc.).
Mimicking Evolution with Reinforcement Learning
Abrantes, João P., Abrantes, Arnaldo J., Oliehoek, Frans A.
Evolution gave rise to human and animal intelligence here on Earth. We argue that the path to developing artificial human-like-intelligence will pass through mimicking the evolutionary process in a nature-like simulation. In Nature, there are two processes driving the development of the brain: evolution and learning. Evolution acts slowly, across generations, and amongst other things, it defines what agents learn by changing their internal reward function. Learning acts fast, across one's lifetime, and it quickly updates agents' policy to maximise pleasure and minimise pain. The reward function is slowly aligned with the fitness function by evolution, however, as agents evolve the environment and its fitness function also change, increasing the misalignment between reward and fitness. It is extremely computationally expensive to replicate these two processes in simulation. This work proposes Evolution via Evolutionary Reward (EvER) that allows learning to single-handedly drive the search for policies with increasingly evolutionary fitness by ensuring the alignment of the reward function with the fitness function. In this search, EvER makes use of the whole state-action trajectories that agents go through their lifetime. In contrast, current evolutionary algorithms discard this information and consequently limit their potential efficiency at tackling sequential decision problems. We test our algorithm in two simple bio-inspired environments and show its superiority at generating more capable agents at surviving and reproducing their genes when compared with a state-of-the-art evolutionary algorithm.
Optimising Lockdown Policies for Epidemic Control using Reinforcement Learning
Khadilkar, Harshad, Ganu, Tanuja, Seetharam, Deva P
In the context of the ongoing Covid-19 pandemic, several reports and studies have attempted to model and predict the spread of the disease. There is also intense debate about policies for limiting the damage, both to health and to the economy. On the one hand, the health and safety of the population is the principal consideration for most countries. On the other hand, we cannot ignore the potential for long-term economic damage caused by strict nation-wide lockdowns. In this working paper, we present a quantitative way to compute lockdown decisions for individual cities or regions, while balancing health and economic considerations. Furthermore, these policies are \textit{learnt} automatically by the proposed algorithm, as a function of disease parameters (infectiousness, gestation period, duration of symptoms, probability of death) and population characteristics (density, movement propensity). We account for realistic considerations such as imperfect lockdowns, and show that the policy obtained using reinforcement learning is a viable quantitative approach towards lockdowns.
Enhanced Rolling Horizon Evolution Algorithm with Opponent Model Learning: Results for the Fighting Game AI Competition
Tang, Zhentao, Zhu, Yuanheng, Zhao, Dongbin, Lucas, Simon M.
The Fighting Game AI Competition (FTGAIC) provides a challenging benchmark for 2-player video game AI. The challenge arises from the large action space, diverse styles of characters and abilities, and the real-time nature of the game. In this paper, we propose a novel algorithm that combines Rolling Horizon Evolution Algorithm (RHEA) with opponent model learning. The approach is readily applicable to any 2-player video game. In contrast to conventional RHEA, an opponent model is proposed and is optimized by supervised learning with cross-entropy and reinforcement learning with policy gradient and Q-learning respectively, based on history observations from opponent. The model is learned during the live gameplay. With the learned opponent model, the extended RHEA is able to make more realistic plans based on what the opponent is likely to do. This tends to lead to better results. We compared our approach directly with the bots from the FTGAIC 2018 competition, and found our method to significantly outperform all of them, for all three character. Furthermore, our proposed bot with the policy-gradient-based opponent model is the only one without using Monte-Carlo Tree Search (MCTS) among top five bots in the 2019 competition in which it achieved second place, while using much less domain knowledge than the winner.
Key Concepts of Modern Reinforcement Learning
As the Agent interacts with the Environment, it learns a policy. A policy is a "learned strategy" that governs the agents' behaviour in selecting an action at a particular time t of the Environment. A policy can be seen as a mapping from states of an Environment to the actions taken in those states. The goal of the reinforcement Agent is to maximize its long-term rewards as it interacts with the Environment in the feedback configuration. The response the Agent gets from each state-action cycle (where an Agent selects an action from a set of actions at each state of the Environment) is called the reward function. The reward function (or simply rewards) is a signal of the desirability of that state based on the action made by the Agent.
Agent57: Outperforming the Atari Human Benchmark
Badia, Adrià Puigdomènech, Piot, Bilal, Kapturowski, Steven, Sprechmann, Pablo, Vitvitskyi, Alex, Guo, Daniel, Blundell, Charles
Atari games have been a long-standing benchmark in the reinforcement learning (RL) community for the past decade. This benchmark was proposed to test general competency of RL algorithms. Previous work has achieved good average performance by doing outstandingly well on many games of the set, but very poorly in several of the most challenging games. We propose Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games. To achieve this result, we train a neural network which parameterizes a family of policies ranging from very exploratory to purely exploitative. We propose an adaptive mechanism to choose which policy to prioritize throughout the training process. Additionally, we utilize a novel parameterization of the architecture that allows for more consistent and stable learning.
Model-Reference Reinforcement Learning Control of Autonomous Surface Vehicles with Uncertainties
Zhang, Qingrui, Pan, Wei, Reppa, Vasso
This paper presents a novel model-reference reinforcement learning control method for uncertain autonomous surface vehicles. The proposed control combines a conventional control method with deep reinforcement learning. With the conventional control, we can ensure the learning-based control law provides closed-loop stability for the overall system, and potentially increase the sample efficiency of the deep reinforcement learning. With the reinforcement learning, we can directly learn a control law to compensate for modeling uncertainties. In the proposed control, a nominal system is employed for the design of a baseline control law using a conventional control approach. The nominal system also defines the desired performance for uncertain autonomous vehicles to follow. In comparison with traditional deep reinforcement learning methods, our proposed learning-based control can provide stability guarantees and better sample efficiency. We demonstrate the performance of the new algorithm via extensive simulation results.