Goto

Collaborating Authors

 Reinforcement Learning


Machine Learning and System Identification for Estimation in Physical Systems

arXiv.org Machine Learning

In this thesis, we draw inspiration from both classical system identification and modern machine learning in order to solve estimation problems for real-world, physical systems. The main approach to estimation and learning adopted is optimization based. Concepts such as regularization will be utilized for encoding of prior knowledge and basis-function expansions will be used to add nonlinear modeling power while keeping data requirements practical. The thesis covers a wide range of applications, many inspired by applications within robotics, but also extending outside this already wide field. Usage of the proposed methods and algorithms are in many cases illustrated in the real-world applications that motivated the research. Topics covered include dynamics modeling and estimation, model-based reinforcement learning, spectral estimation, friction modeling and state estimation and calibration in robotic machining. In the work on modeling and identification of dynamics, we develop regularization strategies that allow us to incorporate prior domain knowledge into flexible, overparameterized models. We make use of classical control theory to gain insight into training and regularization while using flexible tools from modern deep learning. A particular focus of the work is to allow use of modern methods in scenarios where gathering data is associated with a high cost. In the robotics-inspired parts of the thesis, we develop methods that are practically motivated and ensure that they are implementable also outside the research setting. We demonstrate this by performing experiments in realistic settings and providing open-source implementations of all proposed methods and algorithms.


Measurement-based Online Available Bandwidth Estimation employing Reinforcement Learning

arXiv.org Machine Learning

An accurate and fast estimation of the available bandwidth in a network with varying cross-traffic is a challenging task. The accepted probing tools, based on the fluid-flow model of a bottleneck link with first-in, first-out multiplexing, estimate the available bandwidth by measuring packet dispersions. The estimation becomes more difficult if packet dispersions deviate from the assumptions of the fluid-flow model in the presence of non-fluid bursty cross-traffic, multiple bottleneck links, and inaccurate time-stamping. This motivates us to explore the use of machine learning tools for available bandwidth estimation. Hence, we consider reinforcement learning and implement the single-state multi-armed bandit technique, which follows the $\epsilon$-greedy algorithm to find the available bandwidth. Our measurements and tests reveal that our proposed method identifies the available bandwidth with high precision. Furthermore, our method converges to the available bandwidth under a variety of notoriously difficult conditions, such as heavy traffic burstiness, different cross-traffic intensities, multiple bottleneck links, and in networks where the tight link and the bottleneck link are not same. Compared to the piece-wise linear network a model-based direct probing technique that employs a Kalman filter, our method shows more accurate estimates and faster convergence in certain network scenarios and does not require measurement noise statistics.


Escaping the State of Nature: A Hobbesian Approach to Cooperation in Multi-agent Reinforcement Learning

arXiv.org Artificial Intelligence

Cooperation is a phenomenon that has been widely studied across many different disciplines. In the field of computer science, the modularity and robustness of multi-agent systems offer significant practical advantages over individual machines. At the same time, agents using standard reinforcement learning algorithms often fail to achieve long-term, cooperative strategies in unstable environments when there are short-term incentives to defect. Political philosophy, on the other hand, studies the evolution of cooperation in humans who face similar incentives to act individualistically, but nevertheless succeed in forming societies. Thomas Hobbes in Leviathan provides the classic analysis of the transition from a pre-social State of Nature, where consistent defection results in a constant state of war, to stable political community through the institution of an absolute Sovereign. This thesis argues that Hobbes's natural and moral philosophy are strikingly applicable to artificially intelligent agents and aims to show that his political solutions are experimentally successful in producing cooperation among modified Q-Learning agents. Cooperative play is achieved in a novel Sequential Social Dilemma called the Civilization Game, which models the State of Nature by introducing the Hobbesian mechanisms of opponent learning awareness and majoritarian voting, leading to the establishment of a Sovereign.


Deep Q-Learning for Directed Acyclic Graph Generation

arXiv.org Machine Learning

We present a method to generate directed acyclic graphs (DAGs) using deep reinforcement learning, specifically deep Q-learning. Generating graphs with specified structures is an important and challenging task in various application fields, however most current graph generation methods produce graphs with undirected edges. We demonstrate that this method is capable of generating DAGs with topology and node types satisfying specified criteria in highly sparse reward environments.


Interactive Teaching Algorithms for Inverse Reinforcement Learning

arXiv.org Artificial Intelligence

We study the problem of inverse reinforcement learning (IRL) with the added twist that the learner is assisted by a helpful teacher. More formally, we tackle the following algorithmic question: How could a teacher provide an informative sequence of demonstrations to an IRL learner to speed up the learning process? We present an interactive teaching framework where a teacher adaptively chooses the next demonstration based on learner's current policy. In particular, we design teaching algorithms for two concrete settings: an omniscient setting where a teacher has full knowledge about the learner's dynamics and a blackbox setting where the teacher has minimal knowledge. Then, we study a sequential variant of the popular MCE-IRL learner and prove convergence guarantees of our teaching algorithm in the omniscient setting. Extensive experiments with a car driving simulator environment show that the learning progress can be speeded up drastically as compared to an uninformative teacher.


Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

arXiv.org Artificial Intelligence

This paper investigates the use of intrinsic reward to guide exploration in multi-agent reinforcement learning. We discuss the challenges in applying intrinsic reward to multiple collaborative agents and demonstrate how unreliable reward can prevent decentralized agents from learning the optimal policy. We address this problem with a novel framework, Independent Centrally-assisted Q-learning (ICQL), in which decentralized agents share control and an experience replay buffer with a centralized agent. Only the centralized agent is intrinsically rewarded, but the decentralized agents still benefit from improved exploration, without the distraction of unreliable incentives.


Probabilistic hypergraph grammars for efficient molecular optimization

arXiv.org Machine Learning

We present an approach to make molecular optimization more efficient. We infer a hypergraph replacement grammar from the ChEMBL database, count the frequencies of particular rules being used to expand particular nonterminals in other rules, and use these as conditional priors for the policy model. Simulating random molecules from the resulting probabilistic grammar, we show that conditional priors result in a molecular distribution closer to the training set than using equal rule probabilities or unconditional priors. We then treat molecular optimization as a reinforcement learning problem, using a novel modification of the policy gradient algorithm - batch-advantage: using individual rewards minus the batch average reward to weight the log probability loss. The reinforcement learning agent is tasked with building molecules using this grammar, with the goal of maximizing benchmark scores available from the literature. To do so, the agent has policies both to choose the next node in the graph to expand and to select the next grammar rule to apply. The policies are implemented using the Transformer architecture with the partially expanded graph as the input at each step. We show that using the empirical priors as the starting point for a policy eliminates the need for pre-training, and allows us to reach optima faster. We achieve competitive performance on common benchmarks from the literature, such as penalized logP and QED, with only hundreds of training steps on a budget GPU instance.


Finding Friend and Foe in Multi-Agent Games

arXiv.org Machine Learning

Recent breakthroughs in AI for multi-agent games like Go, Poker, and Dota, have seen great strides in recent years. Yet none of these games address the real-life challenge of cooperation in the presence of unknown and uncertain teammates. This challenge is a key game mechanism in hidden role games. Here we develop the DeepRole algorithm, a multi-agent reinforcement learning agent that we test on The Resistance: Avalon, the most popular hidden role game. DeepRole combines counterfactual regret minimization (CFR) with deep value networks trained through self-play. Our algorithm integrates deductive reasoning into vector-form CFR to reason about joint beliefs and deduce partially observable actions. We augment deep value networks with constraints that yield interpretable representations of win probabilities. These innovations enable DeepRole to scale to the full Avalon game. Empirical game-theoretic methods show that DeepRole outperforms other hand-crafted and learned agents in five-player Avalon. DeepRole played with and against human players on the web in hybrid human-agent teams. We find that DeepRole outperforms human players as both a cooperator and a competitor.


Continuous Control for Automated Lane Change Behavior Based on Deep Deterministic Policy Gradient Algorithm

arXiv.org Machine Learning

Lane change is a challenging task which requires delicate actions to ensure safety and comfort. Some recent studies have attempted to solve the lane-change control problem with Reinforcement Learning (RL), yet the action is confined to discrete action space. To overcome this limitation, we formulate the lane change behavior with continuous action in a model-free dynamic driving environment based on Deep Deterministic Policy Gradient (DDPG). The reward function, which is critical for learning the optimal policy, is defined by control values, position deviation status, and maneuvering time to provide the RL agent informative signals. The RL agent is trained from scratch without resorting to any prior knowledge of the environment and vehicle dynamics since they are not easy to obtain. Seven models under different hyperparameter settings are compared. A video showing the learning progress of the driving behavior is available. It demonstrates the RL vehicle agent initially runs out of road boundary frequently, but eventually has managed to smoothly and stably change to the target lane with a success rate of 100% under diverse driving situations in simulation.


Temporal-difference learning for nonlinear value function approximation in the lazy training regime

arXiv.org Machine Learning

In recent years, deep reinforcement learning has pushed the boundaries of Artificial Intelligence to an unprecedented level, achieving what was expected to be possible only in a decade and outperforming human intelligence in a number of highly complex tasks. Paramount examples of this potential have appeared over the past few years, with such algorithms mastering games and tasks of increasing complexity, from playing Atari to learning to walk and beating world grandmasters at the game of Go [16, 23, 24, 31-33]. Such impressive success would be impossible without using neural networks to approximate value functions and / or policy functions in reinforcement learning algorithms. While neural networks, in particular deep neural networks, provide a powerful and versatile tool to approximate high dimensional functions [4, 12, 17], their intrinsic nonlinearity might also lead to trouble in training, in particular in the context of reinforcement learning. For example, it is well known that nonlinear approximation to value function might cause divergence in the classical temporal-difference learning due to instability [40].