Goto

Collaborating Authors

 Reinforcement Learning



Roberto G.E. Martín on LinkedIn: #AI #RL

#artificialintelligence

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long-time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems.


Efficient Object Detection in Large Images using Deep Reinforcement Learning

#artificialintelligence

Reinforcement Learning for Efficient Detection Reinforcement Learning (RL) has been recently used to (1) replace classical detectors such as SSD and Faster-RCNN, (2) replace exhaustive box proposal techniques in two-stage detectors, and (3) find ROIs in very large images to run a detector on. Most of the methods proposed in this categories focus on learning sequential policies. Under category (1), [3, 29] proposed a top-down sequential object detection models trained with Q-learning algorithm. Most of the RL methods associated with object detection fall into category (2). For example, [16] recursively divides up an image in a top-down approach where the divisions are decided by the RL agent. The box proposals returned by the agent are then passed through Fast-RCNN.


How Should an Agent Practice?

arXiv.org Artificial Intelligence

We present a method for learning intrinsic reward functions to drive the learning of an agent during periods of practice in which extrinsic task rewards are not available. During practice, the environment may differ from the one available for training and evaluation with extrinsic rewards. We refer to this setup of alternating periods of practice and objective evaluation as practice-match, drawing an analogy to regimes of skill acquisition common for humans in sports and games. The agent must effectively use periods in the practice environment so that performance improves during matches. In the proposed method the intrinsic practice reward is learned through a meta-gradient approach that adapts the practice reward parameters to reduce the extrinsic match reward loss computed from matches. We illustrate the method on a simple grid world, and evaluate it in two games in which the practice environment differs from match: Pong with practice against a wall without an opponent, and PacMan with practice in a maze without ghosts. The results show gains from learning in practice in addition to match periods over learning in matches only. Introduction There are many applications of reinforcement learning (RL) in which the natural formulation of the reward function gives rise to difficult computational challenges, or in which the reward itself is unavailable for extended periods of time or is difficult to specify. These include settings with very sparse or delayed reward, multiple tasks or goals, reward uncertainty, and learning in the absence of reward or in advance of unknown future reward. A range of approaches address these challenges through reward design, providing intrinsic rewards to the agent that augment or replace the objective or extrinsic reward. The aim is to provide useful and proximal learning signals that drive behavior and learning in a way that improves performance on the main objective of interest (Ng, Harada, and Russell 1999; Barto, Singh, and Chentanez 2004; Singh et al. 2010). The optimal rewards framework (Singh et al. 2010) provides a general meta-optimization formulation of intrinsic reward design, and has served as the basis for algorithms that discover good intrinsic rewards; we discuss this further in Related Work.


Pseudo Random Number Generation: a Reinforcement Learning approach

arXiv.org Artificial Intelligence

Pseudo-Random Numbers Generators (PRNGs) are algorithms produced to generate long sequences of statistically uncorrelated numbers, i.e. Pseudo-Random Numbers (PRNs). These numbers are widely employed in mid-level cryptography and in software applications. Test suites are used to evaluate PRNGs quality by checking statistical properties of the generated sequences. Machine learning techniques are often used to break these generators, for instance approximating a certain generator or a certain sequence using a neural network. But what about using machine learning to generate PRNs generators? This paper proposes a Reinforcement Learning (RL) approach to the task of generating PRNGs from scratch by learning a policy to solve an N-dimensional navigation problem. In this context, N is the length of the period of the generated sequence, and the policy is iteratively improved using the average value of an appropriate test suite run over that period. Aim of this work is to demonstrate the feasibility of the proposed approach, to compare it with classical methods, and to lay the foundation of a research path which combines RL and PRNGs.


Natural Actor-Critic Converges Globally for Hierarchical Linear Quadratic Regulator

arXiv.org Machine Learning

Multi-agent reinforcement learning has been successfully applied to a number of challenging problems. Despite these empirical successes, theoretical understanding of different algorithms is lacking, primarily due to the curse of dimensionality caused by the exponential growth of the state-action space with the number of agents. We study a fundamental problem of multi-agent linear quadratic regulator in a setting where the agents are partially exchangeable. In this setting, we develop a hierarchical actor-critic algorithm, whose computational complexity is independent of the total number of agents, and prove its global linear convergence to the optimal policy. As linear quadratic regulators are often used to approximate general dynamic systems, this paper provided an important step towards better understanding of general hierarchical mean-field multi-agent reinforcement learning.


Adapting Behaviour for Learning Progress

arXiv.org Artificial Intelligence

A BSTRACT Determining what experience to generate to best facilitate learning (i.e. The advent of distributed agents that interact with parallel instances of the environment has enabled larger scales and greater flexibility, but has not removed the need to tune exploration to the task, because the ideal data for the learning algorithm necessarily depends on its process of learning. We propose to dynamically adapt the data generation by using a non-stationary multi-armed bandit to optimize a proxy of the learning progress. The data distribution is controlled by modulating multiple parameters of the policy (such as stochasticity, consistency or optimism) without significant overhead. The adaptation speed of the bandit can be increased by exploiting the factored modulation structure. We demonstrate on a suite of Atari 2600 games how this unified approach produces results comparable to per-task tuning at a fraction of the cost. 1 I NTRODUCTION Reinforcement learning (RL) is a general formalism modelling sequential decision making, which supports making minimal assumptions about the task at hand and reducing the need for prior knowledge. By learning behaviour from scratch, RL agents have the potential to surpass human expertise or tackle complex domains where human intuition is not applicable. In practice, however, generality is often traded for performance and efficiency, with RL practitioners tuning algorithms, architectures and hyper-parameters to the task at hand (Hessel et al., 2019). A side-effect is that the resulting methods can be brittle, or difficult to reliably reproduce (Nagarajan et al., 2018). Exploration is one of the main aspects commonly designed or tuned specifically for the task being solved. Previous work has shown that large sample-efficiency gains are possible, for example, when the exploratory behaviour's level of stochasticity is adjusted to the environment's hazard rate (Garc ıa & Fern andez, 2015), or when an appropriate prior is used in large action spaces (Dulac-Arnold et al., 2015; Czarnecki et al., 2018; Vinyals et al., 2019). Exploration in the presence of function approximation should ideally be agent-centred. It ought to focus more on generating data that supports the agent's learning at its current parameters θ, rather than making progress on objective measurements of information gathering.


Spatial Influence-aware Reinforcement Learning for Intelligent Transportation System

arXiv.org Artificial Intelligence

Intelligent transportation systems (ITSs) are envisioned to be crucial for smart cities, which aims at improving traffic flow to improve the life quality of urban residents and reducing congestion to improve the efficiency of commuting. However, several challenges need to be resolved before such systems can be deployed, for example, conventional solutions for Markov decision process (MDP) and single-agent Reinforcement Learning (RL) algorithms suffer from poor scalability, and multi-agent systems suffer from poor communication and coordination. In this paper, we explore the potential of mutual information sharing, or in other words, spatial influence based communication, to optimize traffic light control policy. First, we mathematically analyze the transportation system. We conclude that the transportation system does not have stationary Nash Equilibrium, thereby reinforcement learning algorithms offer suitable solutions. Secondly, we describe how to build a multi-agent Deep Deterministic Policy Gradient (DDPG) system with spatial influence and social group utility incorporated. Then we utilize the grid topology road network to empirically demonstrate the scalability of the new system. We demonstrate three types of directed communications to show the effect of directions of social influence on the entire network utility and individual utility. Lastly, we define "selfish index" and analyze the effect of it on total group utility.


Resolving Congestions in the Air Traffic Management Domain via Multiagent Reinforcement Learning Methods

arXiv.org Artificial Intelligence

In this article, we report on the efficiency and effectiveness of multiagent reinforcement learning methods (MARL) for the computation of flight delays to resolve congestion problems in the Air Traffic Management (ATM) domain. Specifically, we aim to resolve cases where demand of airspace use exceeds capacity (demand-capacity problems), via imposing ground delays to flights at the pre-tactical stage of operations (i.e. few days to few hours before operation). Casting this into the multiagent domain, agents, representing flights, need to decide on own delays w.r.t. own preferences, having no information about others' payoffs, preferences and constraints, while they plan to execute their trajectories jointly with others, adhering to operational constraints. Specifically, we formalize the problem as a multiagent Markov Decision Process (MA-MDP) and we show that it can be considered as a Markov game in which interacting agents need to reach an equilibrium: What makes the problem more interesting is the dynamic setting in which agents operate, which is also due to the unforeseen, emergent effects of their decisions in the whole system. We propose collaborative multiagent reinforcement learning methods to resolve demand-capacity imbalances: Extensive experimental study on real-world cases, shows the potential of the proposed approaches in resolving problems, while advanced visualizations provide detailed views towards understanding the quality of solutions provided.


The AI Overview - 30 Influential Presentations in 2019

#artificialintelligence

It feels as though 2019 has gone by in a flash, that said, it has been a year in which we have seen great advancement in AI application methods and technical discovery, paving the way for future development. We are incredibly grateful to have had the leading minds in AI & Deep Learning present their latest work at our summits in San Francisco, Boston, Montreal and more, so we thought we would share thirty of our highlight videos with you as we think everybody needs to see them!. We were delighted to be joined by Dawn at the Deep Reinforcement Learning Summit in June of 2019, presenting the latest industry research on Secure Deep Reinforcement Learning, covering both the lessons leant in the lead up to her presentation, current challenges faced for advancement, and the future direction of which her research is set to take. You can see Dawn's full presentation from June here. Reinforcement Learning is somewhat of a hotbed for research, this year alone we have seen several presentations that have broken down the ins and outs of RL, that said, Doina's talk just last month gave us some new angles on the latest algorithmic development.