Tram, Tommy, Batkovic, Ivo, Ali, Mohammad, Sjöberg, Jonas

Learning When to Drive in Intersections by Combining Reinforcement Learning and Model Predictive Control Tommy Tram 1, 2, 3, Ivo Batkovic 1, 2, 3, Mohammad Ali 1, and Jonas Sj oberg 2 Abstract -- In this paper, we propose a decision making algorithm intended for automated vehicles that negotiate with other possibly non-automated vehicles in intersections. The decision algorithm is separated into two parts: a high-level decision module based on reinforcement learning, and a low-level planning module based on model predictive control. Traffic is simulated with numerous predefined driver behaviors and intentions, and the performance of the proposed decision algorithm was evaluated against another controller . The results show that the proposed decision algorithm yields shorter training episodes and an increased performance in success rate compared to the other controller . Interactions between road users in intersections is a complex problem to solve, making it difficult to address using conventional rule based systems. Many advancements aim to solve this problem by trying to imitate human drivers [1] or predicting what other drivers in traffic are planning to do [2]. In [3], the authors show that by modeling the decision process as a partially observable Markov decision process, the model can account for uncertainty in sensing the environment and [4] showed some probabilistic guarantees when solving the problem using reinforcement learning (RL).

We consider Bayesian analysis of a class of multiple changepoint models. While there are a variety of efficient ways to analyse these models if the parameters associated with each segment are independent, there are few general approaches for models where the parameters are dependent. Under the assumption that the dependence is Markov, we propose an efficient online algorithm for sampling from an approximation to the posterior distribution of the number and position of the changepoints. In a simulation study, we show that the approximation introduced is negligible. We illustrate the power of our approach through fitting piecewise polynomial models to data, under a model which allows for either continuity or discontinuity of the underlying curve at each changepoint. This method is competitive with, or out-performs, other methods for inferring curves from noisy data; and uniquely it allows for inference of the locations of discontinuities in the underlying curve.

Morimura, Tetsuro, Osogami, Takayuki, Ide, Tsuyoshi

The Markov chain is a convenient tool to represent the dynamics of complex systems such as traffic and social systems, where probabilistic transition takes place between internal states. A Markov chain is characterized by initial-state probabilities and a state-transition probability matrix. In the traditional setting, a major goal is to figure out properties of a Markov chain when those probabilities are known. This paper tackles an inverse version of the problem: we find those probabilities from partial observations at a limited number of states. The observations include the frequency of visiting a state and the rate of reaching a state from another. Practical examples of this task include traffic monitoring systems in cities, where we need to infer the traffic volume on every single link on a road network from a very limited number of observation points. We formulate this task as a regularized optimization problem for probability functions, which is efficiently solved using the notion of natural gradient. Using synthetic and real-world data sets including city traffic monitoring data, we demonstrate the effectiveness of our method.

Fox, Emily B., Sudderth, Erik B., Jordan, Michael I., Willsky, Alan S.

We propose a Bayesian nonparametric approach to the problem of jointly modeling multiple related time series. Our approach is based on the discovery of a set of latent, shared dynamical behaviors. Using a beta process prior, the size of the set and the sharing pattern are both inferred from data. We develop efficient Markov chain Monte Carlo methods based on the Indian buffet process representation of the predictive distribution of the beta process, without relying on a truncated model. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth and death proposals. We examine the benefits of our proposed feature-based model on several synthetic datasets, and also demonstrate promising results on unsupervised segmentation of visual motion capture data.

Cai, Chenghui, Liao, Xuejun, Carin, Lawrence

A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.