Reinforcement Learning
Prediction problems inspired by animal learning
Rafiee, Banafsheh, Ghiassian, Sina, Kumaraswamy, Raksha, Sutton, Richard, Ludvig, Elliot, White, Adam
We present three problems modeled after animal learning experiments designed to test online state construction or representation learning algorithms. Our test problems require the learning system to construct compact summaries of their past interaction with the world in order to predict the future, updating online and incrementally on each time step without an explicit training-testing split. The majority of recent work in Deep Reinforcement Learning focuses on either fully observable tasks, or games where stacking a handful of recent frames is sufficient for good performance. Current benchmarks used for evaluating memory and recurrent learning make use of 3D visual environments (e.g., DeepMind Lab) which require billions of training samples, complex agent architectures, and cloud-scale compute. These domains are thus not well suited for rapid prototyping, hyper-parameter study, or extensive replication study. In this paper, we contribute a set of test problems and benchmark results to fill this gap. Our test problems are designed to be the simplest instantiation and test of learning capabilities which animals readily exhibit, including (1) trace conditioning (remembering a cue in order to predict another far in the future), (2) patterning (a particular combination of cues predict another), (3) and combinations of both with additional non-relevant distracting signals. We provide baselines for each problem including heuristics from the early days of neural network learning and simple ideas inspired by computational models of animal learning. Our results highlight the difficulty of our test problems for online recurrent learning systems and how the agent's performance often exhibits substantial sensitivity to the choice of key problem and agent parameters.
Geometric Deep Reinforcement Learning for Dynamic DAG Scheduling
Grinsztajn, Nathan, Beaumont, Olivier, Jeannot, Emmanuel, Preux, Philippe
In practice, it is quite common to face combinatorial optimization problems which contain uncertainty along with non-determinism and dynamicity. These three properties call for appropriate algorithms; reinforcement learning (RL) is dealing with them in a very natural way. Today, despite some efforts, most real-life combinatorial optimization problems remain out of the reach of reinforcement learning algorithms. In this paper, we propose a reinforcement learning approach to solve a realistic scheduling problem, and apply it to an algorithm commonly executed in the high performance computing community, the Cholesky factorization. On the contrary to static scheduling, where tasks are assigned to processors in a predetermined ordering before the beginning of the parallel execution, our method is dynamic: task allocations and their execution ordering are decided at runtime, based on the system state and unexpected events, which allows much more flexibility. To do so, our algorithm uses graph neural networks in combination with an actor-critic algorithm (A2C) to build an adaptive representation of the problem on the fly. We show that this approach is competitive with state-of-the-art heuristics used in high-performance computing runtime systems. Moreover, our algorithm does not require an explicit model of the environment, but we demonstrate that extra knowledge can easily be incorporated and improves performance. We also exhibit key properties provided by this RL approach, and study its transfer abilities to other instances.
Reward Conditioned Neural Movement Primitives for Population Based Variational Policy Optimization
Akbulut, M. Tuluhan, Bozdogan, Utku, Tekden, Ahmet, Ugur, Emre
The aim of this paper is to study the reward based policy exploration problem in a supervised learning approach and enable robots to form complex movement trajectories in challenging reward settings and search spaces. For this, the experience of the robot, which can be bootstrapped from demonstrated trajectories, is used to train a novel Neural Processes-based deep network that samples from its latent space and generates the required trajectories given desired rewards. Our framework can generate progressively improved trajectories by sampling them from high reward landscapes, increasing the reward gradually. Variational inference is used to create a stochastic latent space to sample varying trajectories in generating population of trajectories given target rewards. We benefit from Evolutionary Strategies and propose a novel crossover operation, which is applied in the self-organized latent space of the individual policies, allowing blending of the individuals that might address different factors in the reward function. Using a number of tasks that require sequential reaching to multiple points or passing through gaps between objects, we showed that our method provides stable learning progress and significant sample efficiency compared to a number of state-of-the-art robotic reinforcement learning methods. Finally, we show the real-world suitability of our method through real robot execution involving obstacle avoidance.
Exploring market power using deep reinforcement learning for intelligent bidding strategies
Kell, Alexander J. M., Forshaw, Matthew, McGough, A. Stephen
Decentralized electricity markets are often dominated by a small set of generator companies who control the majority of the capacity. In this paper, we explore the effect of the total controlled electricity capacity by a single, or group, of generator companies can have on the average electricity price. We demonstrate this through the use of ElecSim, a simulation of a country-wide energy market. We develop a strategic agent, representing a generation company, which uses a deep deterministic policy gradient reinforcement learning algorithm to bid in a uniform pricing electricity market. A uniform pricing market is one where all players are paid the highest accepted price. ElecSim is parameterized to the United Kingdom for the year 2018. This work can help inform policy on how to best regulate a market to ensure that the price of electricity remains competitive. We find that capacity has an impact on the average electricity price in a single year. If any single generator company, or a collaborating group of generator companies, control more than ${\sim}$11$\%$ of generation capacity and bid strategically, prices begin to increase by ${\sim}$25$\%$. The value of ${\sim}$25\% and ${\sim}$11\% may vary between market structures and countries. For instance, different load profiles may favour a particular type of generator or a different distribution of generation capacity. Once the capacity controlled by a generator company, which bids strategically, is higher than ${\sim}$35\%, prices increase exponentially. We observe that the use of a market cap of approximately double the average market price has the effect of significantly decreasing this effect and maintaining a competitive market. A fair and competitive electricity market provides value to consumers and enables a more competitive economy through the utilisation of electricity by both industry and consumers.
Reliable Off-policy Evaluation for Reinforcement Learning
Wang, Jie, Gao, Rui, Zha, Hongyuan
Reinforcement learning (RL) has achieved phenomenal success in games and robotics [,, ] in the past decade, which also stimulates the enthusiasm of extending these techniques in other areas including healthcare [, ], education [ ], autonomous driving [ ], recommendation systems [, ], etc. One of the major challenges in applying RL to these real-world applications, especially those involve high-stake environments, is the problem of o -policy evaluation (OPE): how one can evaluate a new policy before deployment, using only historical data collected from a di erent policy, known as the behavior policy. Indeed, for many practical applications, one may not have a faithful simulator of the domain from which su cient amount of data can be exploited to train the RL system, and it may not always be feasible to try out a new policy without causing unintended harms. For example, consider the problem of finding the best treatment plan for a patient, or testing the performance of an automated driving system, or suggesting a personalized curriculum for a student. In those tasks, conducting experimentation involves interactions with real people, thus it can be costly to collect data and even worse, a bad policy can be risky or unethical and may result in severe consequences. Therefore, it is important for the RL system to have the ability to predict how well a new policy would perform without having to deploy it first. While most existing works on OPE aim to provide accurate point estimates for short-horizon problems [,,, ] as well as long-or infinite-horizon problems [,,,,, ], it is equally important to quantify the uncertainty of the OPE point estimates for both safe exploration and optimistic planning.
Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient
Hao, Botao, Duan, Yaqi, Lattimore, Tor, Szepesvári, Csaba, Wang, Mengdi
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient. We first consider the off-policy policy evaluation problem. To evaluate a new target policy, we analyze a Lasso fitted Q-evaluation method and establish a finite-sample error bound that has no polynomial dependence on the ambient dimension. To reduce the Lasso bias, we further propose a post model-selection estimator that applies fitted Q-evaluation to the features selected via group Lasso. Under an additional signal strength assumption, we derive a sharper instance-dependent error bound that depends on a divergence function measuring the distribution mismatch between the data distribution and occupancy measure of the target policy. Further, we study the Lasso fitted Q-iteration for batch policy optimization and establish a finite-sample error bound depending on the ratio between the number of relevant features and restricted minimal eigenvalue of the data's covariance. In the end, we complement the results with minimax lower bounds for batch-data policy evaluation/optimization that nearly match our upper bounds. The results suggest that having well-conditioned data is crucial for sparse batch policy learning.
Online Sparse Reinforcement Learning
Hao, Botao, Lattimore, Tor, Szepesvári, Csaba, Wang, Mengdi
Sparse models in classical statistics often yield the best of both worlds: high representation power is achieved by including many features while sparsity leads to efficient estimation. There is a growing interest in applying the tools developed by statisticians to sequential settings such as contextual bandits and reinforcement learning (RL). As we now explore, in online RL this leads to a number of delicate tradeoffs between assumptions and sample complexity. The use of sparsity in reinforcement learning (RL) has been explored before in the context of policy evaluation or policy optimization in the batch setting [Kolter and Ng, 2009, Geist and Scherrer, 2011, Hoffman et al., 2011, Painter-Wakefield and Parr, 2012, Ghavamzadeh et al., 2011]. As far as we know, there has been very little work on the role of sparsity in online RL. In batch RL, the dataset is given a priori and the focus is typically on evaluating a given target policy or learning a near-optimal policy. By contrast, the central question in online RL is how to sequentially interact with the environment to balance the tradeoff between exploration and exploitation, measured here by the cumulative regret. We ask the following question: Under what circumstances does sparsity help when minimizing regret in online RL?
Identifying Critical States by the Action-Based Variance of Expected Return
Karino, Izumi, Ohmura, Yoshiyuki, Kuniyoshi, Yasuo
The balance of exploration and exploitation plays a crucial role in accelerating reinforcement learning (RL). To deploy an RL agent in human society, its explainability is also essential. However, basic RL approaches have difficulties in deciding when to choose exploitation as well as in extracting useful points for a brief explanation of its operation. One reason for the difficulties is that these approaches treat all states the same way. Here, we show that identifying critical states and treating them specially is commonly beneficial to both problems. These critical states are the states at which the action selection changes the potential of success and failure substantially. We propose to identify the critical states using the variance in the Q-function for the actions and to perform exploitation with high probability on the identified states. These simple methods accelerate RL in a grid world with cliffs and two baseline tasks of deep RL. Our results also demonstrate that the identified critical states are intuitively interpretable regarding the crucial nature of the action selection. Furthermore, our analysis of the relationship between the timing of the identification of especially critical states and the rapid progress of learning suggests there are a few especially critical states that have important information for accelerating RL rapidly.
Multi-Agent Reinforcement Learning in Time-varying Networked Systems
Lin, Yiheng, Qu, Guannan, Huang, Longbo, Wierman, Adam
In comparison to single-agent reinforcement learning (RL), MARL poses many challenges, chief of which is scalability [49]. Even if each agent's local state/action spaces are small, the size of the global state/action space can be large, potentially exponentially large in the number of agents, which renders many RL algorithms such as -learning not applicable. A promising approach for addressing the scalability challenge that has received attention in recent years is to exploit application-specific structures, e.g., [16, 32, 35]. A particularly important example of such a structure is a networked structure, e.g., applications in multi-agent networked systems such as social networks [6, 24], communication networks [44, 52], queueing networks [31], and smart transportation networks [51]. In these networked systems, it is often possible to exploit static, local dependency structures [1, 14, 15, 29], e.g., the fact that agents only interact with a fixed set of neighboring agents throughout the game. This sort of dependency structure often leads to scalable, distributed algorithms for optimization and control [1, 14, 29], and has proven effective for designing scalable and distributed MARL algorithms, e.g.
On the role of planning in model-based deep reinforcement learning
Hamrick, Jessica B., Friesen, Abram L., Behbahani, Feryal, Guez, Arthur, Viola, Fabio, Witherspoon, Sims, Anthony, Thomas, Buesing, Lars, Veličković, Petar, Weber, Théophane
Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm, under a number of interventions and ablations and across a wide range of environments including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) The primary benefit of planning is in driving policy learning. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.