Autonomous vehicles must be comprehensively evaluated before deployed in cities and highways. Current evaluation procedures lack the abilities of weakness-aiming and evolving, thus they could hardly generate adversarial environments for autonomous vehicles, leading to insufficient challenges. To overcome the shortage of static evaluation methods, this paper proposes a novel method to generate adversarial environments with deep reinforcement learning, and to cluster them with a nonparametric Bayesian method. As a representative task of autonomous driving, lane-change is used to demonstrate the superiority of the proposed method. First, two lane-change models are separately developed by a rule-based method and a learning-based method, waiting for evaluation and comparison. Next, adversarial environments are generated by training surrounding interactive vehicles with deep reinforcement learning for local optimal ensembles. Then, a nonparametric Bayesian approach is utilized to cluster the adversarial policies of the interactive vehicles. Finally, the adversarial environment patterns are illustrated and the performances of two lane-change models are evaluated and compared. The simulation results indicate that both models perform significantly worse in adversarial environments than in naturalistic environments, with plenty of weaknesses successfully extracted in a few tests.
Machine learning is driving development across many fields in science and engineering. A simple and efficient programming language could accelerate applications of machine learning in various fields. Currently, the programming languages most commonly used to develop machine learning algorithms include Python, MATLAB, and C/C ++. However, none of these languages well balance both efficiency and simplicity. The Julia language is a fast, easy-to-use, and open-source programming language that was originally designed for high-performance computing, which can well balance the efficiency and simplicity. This paper summarizes the related research work and developments in the application of the Julia language in machine learning. It first surveys the popular machine learning algorithms that are developed in the Julia language. Then, it investigates applications of the machine learning algorithms implemented with the Julia language. Finally, it discusses the open issues and the potential future directions that arise in the use of the Julia language in machine learning.
Reinforcement learning (RL) is a popular paradigm for addressing sequential decision tasks in which the agent has only limited environmental feedback. Despite many advances over the past three decades, learning in many domains still requires a large amount of interaction with the environment, which can be prohibitively expensive in realistic scenarios. To address this problem, transfer learning has been applied to reinforcement learning such that experience gained in one task can be leveraged when starting to learn the next, harder task. More recently, several lines of research have explored how tasks, or data samples themselves, can be sequenced into a curriculum for the purpose of learning a problem that may otherwise be too difficult to learn from scratch. In this article, we present a framework for curriculum learning (CL) in reinforcement learning, and use it to survey and classify existing CL methods in terms of their assumptions, capabilities, and goals. Finally, we use our framework to find open problems and suggest directions for future RL curriculum learning research.
Adversarial Machine Learning (AML) is emerging as a major field aimed at the protection of automated ML systems against security threats. The majority of work in this area has built upon a game-theoretic framework by modelling a conflict between an attacker and a defender. After reviewing game-theoretic approaches to AML, we discuss the benefits that a Bayesian Adversarial Risk Analysis perspective brings when defending ML based systems. A research agenda is included.
We propose a novel actor-critic, model-free reinforcement learning algorithm which employs a Bayesian method of parameter space exploration to solve environments. A Gaussian process is used to learn the expected return of a policy given the policy's parameters. The system is trained by updating the parameters using gradient descent on a new surrogate loss function consisting of the Proximal Policy Optimization 'Clipped' loss function and a bonus term representing the expected improvement acquisition function given by the Gaussian process. This new method is shown to be comparable to and at times empirically outperform current algorithms on environments that simulate robotic locomotion using the MuJoCo physics engine.
Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy with the purpose of improving both learning accuracy and training speed. The dual representations indicate the environmental model and the state-action data: the former can accelerate the learning process of RL, while its inherent model uncertainty generally leads to worse policy accuracy than the latter, which comes from direct measurements of states and actions. In the framework design of the mixed RL, the compensation of the additive stochastic model uncertainty is embedded inside the policy iteration RL framework by using explored state-action data via iterative Bayesian estimator (IBE). The optimal policy is then computed in an iterative way by alternating between policy evaluation (PEV) and policy improvement (PIM). The convergence of the mixed RL is proved using the Bellman's principle of optimality, and the recursive stability of the generated policy is proved via the Lyapunov's direct method. The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of stochastic non-affine nonlinear systems (i.e., double lane change task with an automated vehicle).
The central tenet of reinforcement learning (RL) is that agents seek to maximize the sum of cumulative rewards. In contrast, active inference, an emerging framework within cognitive and computational neuroscience, proposes that agents act to maximize the evidence for a biased generative model. Here, we illustrate how ideas from active inference can augment traditional RL approaches by (i) furnishing an inherent balance of exploration and exploitation, and (ii) providing a more flexible conceptualization of reward. Inspired by active inference, we develop and implement a novel objective for decision making, which we term the free energy of the expected future. We demonstrate that the resulting algorithm successfully balances exploration and exploitation, simultaneously achieving robust performance on several challenging RL benchmarks with sparse, well-shaped, and no rewards.
Bayesian reward learning from demonstrations enables rigorous safety and uncertainty analysis when performing imitation learning. However, Bayesian reward learning methods are typically computationally intractable for complex control problems. We propose a highly efficient Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by first pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference. We evaluate our proposed approach on the task of learning to play Atari games from demonstrations, without access to the game score. For Atari games our approach enables us to generate 100,000 samples from the posterior over reward functions in only 5 minutes using a personal laptop. Furthermore, our proposed approach achieves comparable or better imitation learning performance than state-of-the-art methods that only find a point estimate of the reward function. Finally, we show that our approach enables efficient high-confidence policy performance bounds. We show that these high-confidence performance bounds can be used to rank the performance and risk of a variety of evaluation policies, despite not having samples of the reward function. We also show evidence that high-confidence performance bounds can be used to detect reward hacking in complex imitation learning problems.
Policy evaluation is a key process in Reinforcement Learning (RL). It assesses a given policy by estimating the corresponding value function. When using parameterized value functions, common approaches minimize the sum of squared Bellman temporal-difference errors and receive a point-estimate for the parameters. Kalman-based and Gaussian-processes based frameworks were suggested to evaluate the policy by treating the value as a random variable. These frameworks can learn uncertainties over the value parameters and exploit them for policy exploration. When adopting these frameworks to solve deep RL tasks, several limitations are revealed: excessive computations in each optimization step, difficulty with handling batches of samples which slows training and the effect of memory in stochastic environments which prevents off-policy learning. In this work, we discuss these limitations and propose to overcome them by an alternative general framework, based on the extended Kalman filter. We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA) that can be incorporated as a policy evaluation component in policy optimization algorithms. KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties. We analyze the properties of KOVA and present its performance on deep RL control tasks.
This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. These bounds hold regardless of the correctness of the prior distribution. We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading. Papers published at the Neural Information Processing Systems Conference.