Goto

Collaborating Authors

 Reinforcement Learning


Tensor-based Cooperative Control for Large Scale Multi-intersection Traffic Signal Using Deep Reinforcement Learning and Imitation Learning

arXiv.org Machine Learning

Traffic signal control has long been considered as a critical topic in intelligent transportation systems. Most existing learning methods mainly focus on isolated intersections and suffer from inefficient training. This paper aims at the cooperative control for large scale multi-intersection traffic signal, in which a novel end-to-end learning based model is established and the efficient training method is proposed correspondingly. In the proposed model, the input traffic status in multi-intersections is represented by a tensor, which not only significantly reduces dimensionality than using a single matrix but also avoids information loss. For the output, a multidimensional boolean vector is employed for the control policy to indicate whether the signal state changes or not, which simplifies the representation and abides the practical phase changing rules. In the proposed model, a multi-task learning structure is used to get the cooperative policy by learning. Instead of only using the reinforcement learning to train the model, we employ imitation learning to integrate a rule based model with neural networks to do the pre-training, which provides a reliable and satisfactory stage solution and greatly accelerates the convergence. Afterwards, the reinforcement learning method is adopted to continue the fine training, where proximal policy optimization algorithm is incorporated to solve the policy collapse problem in multi-dimensional output situation. In numerical experiments, the advantages of the proposed model are demonstrated with comparison to the related state-of-the-art methods.


Learning from Observations Using a Single Video Demonstration and Human Feedback

arXiv.org Machine Learning

In this paper, we present a method for learning from video demonstrations by using human feedback to construct a mapping between the standard representation of the agent and the visual representation of the demonstration. In this way, we leverage the advantages of both these representations, i.e., we learn the policy using standard state representations, but are able to specify the expected behavior using video demonstration. We train an autonomous agent using a single video demonstration and use human feedback (using numerical similarity rating) to map the standard representation to the visual representation with a neural network. We show the effectiveness of our method by teaching a hopper agent in the MuJoCo to perform a backflip using a single video demonstration generated in MuJoCo as well as from a real-world YouTube video of a person performing a backflip. Additionally, we show that our method can transfer to new tasks, such as hopping, with very little human feedback.


Optimizing Design Verification using Machine Learning: Doing better than Random

arXiv.org Machine Learning

As integrated circuits have become progressively more complex, constrained random stimulus has become ubiquitous as a means of stimulating a designs functionality and ensuring it fully meets expectations. In theory, random stimulus allows all possible combinations to be exercised given enough time, but in practice with highly complex designs a purely random approach will have difficulty in exercising all possible combinations in a timely fashion. As a result it is often necessary to steer the Design Verification (DV) environment to generate hard to hit combinations. The resulting constrained-random approach is powerful but often relies on extensive human expertise to guide the DV environment in order to fully exercise the design. As designs become more complex, the guidance aspect becomes progressively more challenging and time consuming often resulting in design schedules in which the verification time to hit all possible design coverage points is the dominant schedule limitation. This paper describes an approach which leverages existing constrained-random DV environment tools but which further enhances them using supervised learning and reinforcement learning techniques. This approach provides better than random results in a highly automated fashion thereby ensuring DV objectives of full design coverage can be achieved on an accelerated timescale and with fewer resources. Two hardware verification examples are presented, one of a Cache Controller design and one using the open-source RISCV-Ariane design and Google's RISCV Random Instruction Generator. We demonstrate that a machine-learning based approach can perform significantly better on functional coverage and reaching complex hard-to-hit states than a random or constrained-random approach.


Accelerating the Computation of UCB and Related Indices for Reinforcement Learning

arXiv.org Artificial Intelligence

In this paper we derive an efficient method for computing the indices associated with an asymptotically optimal upper confidence bound algorithm (MDP-UCB) of Burnetas and Katehakis (1997) that only requires solving a system of two non-linear equations with two unknowns, irrespective of the cardinality of the state space of the Markovian decision process (MDP). In addition, we develop a similar acceleration for computing the indices for the MDP-Deterministic Minimum Empirical Divergence (MDP-DMED) algorithm developed in Cowan et al. (2019), based on ideas from Honda and Takemura (2011), that involves solving a single equation of one variable. We provide experimental results demonstrating the computational time savings and regret performance of these algorithms. In these comparison we also consider the Optimistic Linear Programming (OLP) algorithm (Tewari and Bartlett, 2008) and a method based on Posterior sampling (MDP-PS).


MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics

arXiv.org Machine Learning

Transfer reinforcement learning (RL) aims at improving learning efficiency of an agent by exploiting knowledge from other source agents trained on relevant tasks. However, it remains challenging to transfer knowledge between different environmental dynamics without having access to the source environments. In this work, we explore a new challenge in transfer RL, where only a set of source policies collected under unknown diverse dynamics is available for learning a target task efficiently. To address this problem, the proposed approach, MULTI-source POLicy AggRegation (MULTIPOLAR), comprises two key techniques. We learn to aggregate the actions provided by the source policies adaptively to maximize the target task performance. Meanwhile, we learn an auxiliary network that predicts residuals around the aggregated actions, which ensures the target policy's expressiveness even when some of the source policies perform poorly. We demonstrated the effectiveness of MULTIPOLAR through an extensive experimental evaluation across six simulated environments ranging from classic control problems to challenging robotics simulations, under both continuous and discrete action spaces.


How to Evaluate Machine Learning Approaches for Combinatorial Optimization: Application to the Travelling Salesman Problem

arXiv.org Artificial Intelligence

Combinatorial optimization is the field devoted to the study and practice of algorithms that solve NP-hard problems. As Machine Learning (ML) and deep learning have popularized, several research groups have started to use ML to solve combinatorial optimization problems, such as the well-known Travelling Salesman Problem (TSP). Based on deep (reinforcement) learning, new models and architecture for the TSP have been successively developed and have gained increasing performances. At the time of writing, state-of-the-art models provide solutions to TSP instances of 100 cities that are roughly 1.33% away from optimal solutions. However, despite these apparently positive results, the performances remain far from those that can be achieved using a specialized search procedure. In this paper, we address the limitations of ML approaches for solving the TSP and investigate two fundamental questions: (1) how can we measure the level of accuracy of the pure ML component of such methods; and (2) what is the impact of a search procedure plugged inside a ML model on the performances? To answer these questions, we propose a new metric, ratio of optimal decisions (ROD), based on a fair comparison with a parametrized oracle, mimicking a ML model with a controlled accuracy. All the experiments are carried out on four state-of-the-art ML approaches dedicated to solve the TSP. Finally, we made ROD open-source in order to ease future research in the field.


The Differentiable Cross-Entropy Method

arXiv.org Machine Learning

T HE D IFFERENTIABLEC ROSS-E NTROPYM ETHOD Brandon Amos 1 Denis Y arats 12 1 Facebook AI Research 2 New Y ork University A BSTRACT We study the Cross-Entropy Method (CEM) for the non-convex optimization of a continuous and parameterized objective function and introduce a differentiable variant (DCEM) that enables us to differentiate the output of CEM with respect to the objective function's parameters. In the machine learning setting this brings CEM inside of the end-to-end learning pipeline where this has otherwise been impossible. We show applications in a synthetic energy-based structured prediction task and in non-convex continuous control. In this paper we focus on the setting of optimizing an unconstrained, non-convex, and continuous objective function f ฮธ(x): R n ฮ˜ R as ห† x arg min x f ฮธ(x), where f is parameterized by ฮธ ฮ˜ and has inputs x R n . If it exists, some (sub-)derivative ฮธห† x is useful in the machine learning setting to make the output of the optimization procedure end-to-end learnable. For example, ฮธ could parameterize a predictive model that is generating potential outcomes conditional on x happening that you want to optimize over. End-to-end learning in these settings can be done by defining a loss function L on top of ห† x and taking gradient steps ฮธL . If f ฮธ were convex this gradient is easy to analyze and compute when it exists and is unique (Gould et al., 2016; Johnson et al., 2016; Amos et al., 2017; Amos & Kolter, 2017). Unfortunately analyzing and computing a "derivative" through the non-convex arg min here is not as easy and is challenging in theory and practice. No such derivative may exist in theory, it might not be unique, and even if it uniquely exists, the numerical solver being used to compute the solution may not find a global or even local optimum of f . One promising direction to sidestep these issues is to approximate the arg min operation with an explicit optimization procedure that is interpreted as just another compute graph and unrolled through.


Safe Reinforcement Learning on Autonomous Vehicles

arXiv.org Artificial Intelligence

-- There have been numerous advances in reinforcement learning, but the typically unconstrained exploration of the learning process prevents the adoption of these methods in many safety critical applications. Recent work in safe reinforcement learning uses idealized models to achieve their guarantees, but these models do not easily accommodate the stochasticity or high-dimensionality of real world systems. We investigate how prediction provides a general and intuitive framework to constraint exploration, and show how it can be used to safely learn intersection handling behaviors on an autonomous vehicle. I. INTRODUCTION With the increasing complexity of robotic systems, and the continued advances in machine learning, it can be tempting to apply reinforcement learning (RL) to challenging control problems. However the trial and error searches typical to RL methods are not appropriate to physical systems which act in the real world where failure cases result in real consequences. To mitigate the safety concerns associated with training an RL agent, there have been various efforts at designing learning processes with safe exploration. As noted by Garcia and Fernandez [1], these approaches can be broadly classified into approaches that modify the objective function and approaches that constrain the search space. Modifying the objective function mostly focuses on catastrophic rare events which do not necessarily have a large impact on the expected return over many trials. Proposed methods take into account the variance of return [2], the worst-outcome [3], [2], [4], and the probability of visiting error states [5].


Deep Coordination Graphs

arXiv.org Artificial Intelligence

This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible trade-off between representational capacity and generalization by factorizing the joint value function of all agents according to a coordination graph into payoffs between pairs of agents. The value can be maximized by local message passing along the graph, which allows training of the value function end-to-end with Q-learning. Payoff functions are approximated with deep neural networks and parameter sharing improves generalization over the state-action space. We show that DCG can solve challenging predator-prey tasks that are vulnerable to the relative overgeneralization pathology and in which all other known value factorization approaches fail.


Counterfactual States for Atari Agents via Generative Deep Learning

arXiv.org Artificial Intelligence

Although deep reinforcement learning agents have produced impressive results in many domains, their decision making is difficult to explain to humans. To address this problem, past work has mainly focused on explaining why an action was chosen in a given state. A different type of explanation that is useful is a counterfactual, which deals with "what if?" scenarios. In this work, we introduce the concept of a counterfactual state to help humans gain a better understanding of what would need to change (minimally) in an Atari game image for the agent to choose a different action. We introduce a novel method to create counterfactual states from a generative deep learning architecture. In addition, we evaluate the effectiveness of counterfactual states on human participants who are not machine learning experts. Our user study results suggest that our generated counterfactual states are useful in helping non-expert participants gain a better understanding of an agent's decision making process.