Undirected Networks
On Overfitting and Asymptotic Bias in Batch Reinforcement Learning with Partial Observability
Francois-Lavet, Vincent, Rabusseau, Guillaume, Pineau, Joelle, Ernst, Damien, Fonteneau, Raphael
This paper provides an analysis of the tradeoff between asymptotic bias (suboptimality with unlimited data) and overfitting (additional suboptimality due to limited data) in the context of reinforcement learning with partial observability.ย Our theoretical analysis formally characterizes that while potentially increasing the asymptotic bias, a smaller state representation decreases the risk of overfitting.ย This analysis relies on expressing the quality of a state representation by bounding $L_1$ error terms of the associated belief states.ย ย Theoretical results are empirically illustrated when the state representation is a truncated history of observations, both on synthetic POMDPs and on a large-scale POMDP in the context of smartgrids, with real-world data.ย Finally, similarly to known results in the fully observable setting, we also briefly discuss and empirically illustrate how using function approximators and adapting the discount factor may enhance the tradeoff between asymptotic bias and overfitting in the partially observable context.
Model-Free Reinforcement Learning for Financial Portfolios: A Brief Survey
Financial portfolio management is one of the problems that are most frequently encountered in the investment industry. Nevertheless, it is not widely recognized that both Kelly Criterion and Risk Parity collapse into Mean Variance under some conditions, which implies that a universal solution to the portfolio optimization problem could potentially exist. In fact, the process of sequential computation of optimal component weights that maximize the portfolio's expected return subject to a certain risk budget can be reformulated as a discrete-time Markov Decision Process (MDP) and hence as a stochastic optimal control, where the system being controlled is a portfolio consisting of multiple investment components, and the control is its component weights. Consequently, the problem could be solved using model-free Reinforcement Learning (RL) without knowing specific component dynamics. By examining existing methods of both value-based and policy-based model-free RL for the portfolio optimization problem, we identify some of the key unresolved questions and difficulties facing today's portfolio managers of applying model-free RL to their investment portfolios.
Behavior Planning of Autonomous Cars with Social Perception
Sun, Liting, Zhan, Wei, Chan, Ching-Yao, Tomizuka, Masayoshi
Autonomous cars have to navigate in dynamic environment which can be full of uncertainties. The uncertainties can come either from sensor limitations such as occlusions and limited sensor range, or from probabilistic prediction of other road participants, or from unknown social behavior in a new area. To safely and efficiently drive in the presence of these uncertainties, the decision-making and planning modules of autonomous cars should intelligently utilize all available information and appropriately tackle the uncertainties so that proper driving strategies can be generated. In this paper, we propose a social perception scheme which treats all road participants as distributed sensors in a sensor network. By observing the individual behaviors as well as the group behaviors, uncertainties of the three types can be updated uniformly in a belief space. The updated beliefs from the social perception are then explicitly incorporated into a probabilistic planning framework based on Model Predictive Control (MPC). The cost function of the MPC is learned via inverse reinforcement learning (IRL). Such an integrated probabilistic planning module with socially enhanced perception enables the autonomous vehicles to generate behaviors which are defensive but not overly conservative, and socially compatible. The effectiveness of the proposed framework is verified in simulation on an representative scenario with sensor occlusions.
A tutorial on recursive models for analyzing and predicting path choice behavior
Zimmermann, Maรซlle, Frejinger, Emma
The problem at the heart of this tutorial consists in modeling the path choice behavior of network users. This problem has extensively been studied in transportation science and econometrics, where it is known as the route choice problem. In this literature, individuals' choice of paths are typically predicted from discrete choice models. The aim of this tutorial is to present this problem from the novel and more general perspective of inverse optimization, in order to describe the modeling approaches proposed in related research areas and thereby motivate the use of so-called recursive models. The latter have the advantage of predicting path choices without generating choice sets. In this paper, we contextualize discrete choice models as a probabilistic approach to an inverse shortest path problem with noisy data, highlighting that recursive discrete choice models in particular originate from viewing the inner shortest path problem as a parametric Markov Decision Process. We also illustrate through simple numerical examples that recursive models overcome issues associated with the path-based discrete choice models commonly found in the transportation literature.
Restricted Boltzmann Machine Assignment Algorithm: Application to solve many-to-one matching problems on weighted bipartite graph
In this work an iterative algorithm based on unsupervised learning is presented, specifically on a Restricted Boltzmann Machine (RBM) to solve a perfect matching problem on a bipartite weighted graph. Iteratively is calculated the weights $w_{ij}$ and the bias parameters $\theta = ( a_i, b_j) $ that maximize the energy function and assignment element $i$ to element $j$. An application of real problem is presented to show the potentiality of this algorithm.
Learning higher-order sequential structure with cloned HMMs
Dedieu, Antoine, Gothoskar, Nishad, Swingle, Scott, Lehrach, Wolfgang, Lรกzaro-Gredilla, Miguel, George, Dileep
Sequence modeling is a fundamental real-world problem with a wide range of applications. Recurrent neural networks (RNNs) are currently preferred in sequence prediction tasks due to their ability to model long-term and variable order dependencies. However, RNNs have disadvantages in several applications because of their inability to natively handle uncertainty, and because of the inscrutable internal representations. Probabilistic sequence models like Hidden Markov Models (HMM) have the advantage of more interpretable representations and the ability to handle uncertainty. Although overcomplete HMMs with many more hidden states compared to the observed states can, in theory, model long-term temporal dependencies [23], training HMMs is challenging due to credit diffusion [3]. For this reason, simpler and inflexible n-gram models are preferred to HMMs for tasks like language modeling. Tensor decomposition methods [1] have been suggested for the learning of HMMs in order to overcome the credit diffusion problem, but current methods are not applicable to the overcomplete setting where the full-rank requirements on the transition and emission matrices are not fulfilled. Recently there has been renewed interest in the topic of training overcomplete HMMs for higher-order dependencies with the expectation that sparsity structures could potentially alleviate the credit diffusion problem [23]. In this paper we demonstrate that a particular sparsity structure on the emission matrix can help HMMs learn higher-order temporal structure using the standard Expectation-Maximization algorithms [26] (Baum-Welch) and its online variants.
Efficient Model-free Reinforcement Learning in Metric Spaces
Model-free Reinforcement Learning (RL) algorithms such as Q-learning [Watkins, Dayan 92] have been widely used in practice and can achieve human level performance in applications such as video games [Mnih et al. 15]. Recently, equipped with the idea of optimism in the face of uncertainty, Q-learning algorithms [Jin, Allen-Zhu, Bubeck, Jordan 18] can be proven to be sample efficient for discrete tabular Markov Decision Processes (MDPs) which have finite number of states and actions. In this work, we present an efficient model-free Q-learning based algorithm in MDPs with a natural metric on the state-action space--hence extending efficient model-free Q-learning algorithms to continuous state-action space. Compared to previous model-based RL algorithms for metric spaces [Kakade, Kearns, Langford 03], our algorithm does not require access to a black-box planning oracle.
Generative Adversarial Imagination for Sample Efficient Deep Reinforcement Learning
Reinforcement learning has seen great advancements in the past five years. The successful introduction of deep learning in place of more traditional methods allowed reinforcement learning to scale to very complex domains achieving super-human performance in environments like the game of Go or numerous video games. Despite great successes in multiple domains, these new methods suffer from their own issues that make them often inapplicable to the real world problems. Extreme lack of data efficiency, together with huge variance and difficulty in enforcing safety constraints, is one of the three most prominent issues in the field. Usually, millions of data points sampled from the environment are necessary for these algorithms to converge to acceptable policies. This thesis proposes novel Generative Adversarial Imaginative Reinforcement Learning algorithm. It takes advantage of the recent introduction of highly effective generative adversarial models, and Markov property that underpins reinforcement learning setting, to model dynamics of the real environment within the internal imagination module. Rollouts from the imagination are then used to artificially simulate the real environment in a standard reinforcement learning process to avoid, often expensive and dangerous, trial and error in the real environment. Experimental results show that the proposed algorithm more economically utilises experience from the real environment than the current state-of-the-art Rainbow DQN algorithm, and thus makes an important step towards sample efficient deep reinforcement learning.
Towards Sampling from Nondirected Probabilistic Graphical models using a D-Wave Quantum Annealer
Koshka, Yaroslav, Novotny, M. A.
A D-Wave quantum annealer (QA) having a 2048 qubit lattice, with no missing qubits and couplings, allowed embedding of a complete graph of a Restricted Boltzmann Machine (RBM). A handwritten digit OptDigits data set having 8x7 pixels of visible units was used to train the RBM using a classical Contrastive Divergence. Embedding of the classically-trained RBM into the D-Wave lattice was used to demonstrate that the QA offers a high-efficiency alternative to the classical Markov Chain Monte Carlo (MCMC) for reconstructing missing labels of the test images as well as a generative model. At any training iteration, the D-Wave-based classification had classification error more than two times lower than MCMC. The main goal of this study was to investigate the quality of the sample from the RBM model distribution and its comparison to a classical MCMC sample. For the OptDigits dataset, the states in the D-Wave sample belonged to about two times more local valleys compared to the MCMC sample. All the lowest-energy (the highest joint probability) local minima in the MCMC sample were also found by the D-Wave. The D-Wave missed many of the higher-energy local valleys, while finding many "new" local valleys consistently missed by the MCMC. It was established that the "new" local valleys that the D-Wave finds are important for the model distribution in terms of the energy of the corresponding local minima, the width of the local valleys, and the height of the escape barrier.
Challenges of Real-World Reinforcement Learning
Dulac-Arnold, Gabriel, Mankowitz, Daniel, Hester, Todd
Reinforcement learning (RL) has proven its worth in a series of artificial domains, and is beginning to show some successes in real-world scenarios. However, much of the research advances in RL are often hard to leverage in real-world systems due to a series of assumptions that are rarely satisfied in practice. We present a set of nine unique challenges that must be addressed to productionize RL to real world problems. For each of these challenges, we specify the exact meaning of the challenge, present some approaches from the literature, and specify some metrics for evaluating that challenge. An approach that addresses all nine challenges would be applicable to a large number of real world problems. We also present an example domain that has been modified to present these challenges as a testbed for practical RL research.