AITopics

2110.1284

Country: North America > United States (0.46)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)

arXiv.org Artificial IntelligenceOct-8-2021

Pick Your Battles: Interaction Graphs as Population-Level Objectives for Strategic Diversity

Garnelo, Marta, Czarnecki, Wojciech Marian, Liu, Siqi, Tirumala, Dhruva, Oh, Junhyuk, Gidel, Gauthier, van Hasselt, Hado, Balduzzi, David

Strategic diversity is often essential in games: in multi-player games, for example, evaluating a player against a diverse set of strategies will yield a more accurate estimate of its performance. Furthermore, in games with non-transitivities diversity allows a player to cover several winning strategies. However, despite the significance of strategic diversity, training agents that exhibit diverse behaviour remains a challenge. In this paper we study how to construct diverse populations of agents by carefully structuring how individuals within a population interact. Our approach is based on interaction graphs, which control the flow of information between agents during training and can encourage agents to specialise on different strategies, leading to improved overall performance. We provide evidence for the importance of diversity in multi-agent training and analyse the effect of applying different interaction graphs on the training trajectories, diversity and performance of populations in a range of games. This is an extended version of the long abstract published at AAMAS.

artificial intelligence, game theory, machine learning, (18 more...)

2110.04041

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Game Theory (0.94)

arXiv.org Machine LearningSep-22-2021

Introducing Symmetries to Black Box Meta Reinforcement Learning

Kirsch, Louis, Flennerhag, Sebastian, van Hasselt, Hado, Friesen, Abram, Oh, Junhyuk, Chen, Yutian

Meta reinforcement learning (RL) attempts to discover new RL algorithms automatically from environment interaction. In so-called black-box approaches, the policy and the learning algorithm are jointly represented by a single neural network. These methods are very flexible, but they tend to underperform in terms of generalisation to new, unseen environments. In this paper, we explore the role of symmetries in meta-generalisation. We show that a recent successful meta RL approach that meta-learns an objective for backpropagation-based learning exhibits certain symmetries (specifically the reuse of the learning rule, and invariance to input and output permutations) that are not present in typical black-box meta RL systems. We hypothesise that these symmetries can play an important role in meta-generalisation. Building off recent work in black-box supervised meta learning, we develop a black-box meta RL system that exhibits these same symmetries. We show through careful experimentation that incorporating these symmetries can lead to algorithms with a greater ability to generalise to unseen action & observation spaces, tasks, and environments.

air transportation, deep learning, symmetry, (19 more...)

2109.10781

Genre: Research Report (0.64)

Industry: Transportation > Air (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

arXiv.org Machine LearningSep-9-2021

Bootstrapped Meta-Learning

Flennerhag, Sebastian, Schroecker, Yannick, Zahavy, Tom, van Hasselt, Hado, Silver, David, Singh, Satinder

Meta-learning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging meta-optimisation problem that often exhibits ill-conditioning, and myopic meta-objectives. We propose an algorithm that tackles these issues by letting the meta-learner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target under a chosen (pseudo-)metric. Focusing on meta-learning with gradients, we establish conditions that guarantee performance improvements and show that the improvement is related to the target distance. Thus, by controlling curvature, the distance measure can be used to ease meta-optimization, for instance by reducing ill-conditioning. Further, the bootstrapping mechanism can extend the effective meta-learning horizon without requiring backpropagation through all updates. The algorithm is versatile and easy to implement. We achieve a new state-of-the art for model-free agents on the Atari ALE benchmark, improve upon MAML in few-shot learning, and demonstrate how our approach opens up new possibilities by meta-learning efficient exploration in a Q-learning agent.

deep learning, neural network, update rule, (16 more...)

2109.04504

Country:

Europe (0.14)
Asia (0.14)

Genre:

Research Report (0.81)
Workflow (0.68)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningJul-12-2021

Learning Expected Emphatic Traces for Deep RL

Jiang, Ray, Zhang, Shangtong, Chelu, Veronica, White, Adam, van Hasselt, Hado

Off-policy sampling and experience replay are key for improving sample efficiency and scaling model-free temporal difference learning methods. When combined with function approximation, such as neural networks, this combination is known as the deadly triad and is potentially unstable. Recently, it has been shown that stability and good performance at scale can be achieved by combining emphatic weightings and multi-step updates. This approach, however, is generally limited to sampling complete trajectories in order, to compute the required emphatic weighting. In this paper we investigate how to combine emphatic weightings with non-sequential, off-line data sampled from a replay buffer. We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed $n$-step TD learning algorithm to learn the required emphatic weighting. We show that these state weightings reduce variance compared with prior approaches, while providing convergence guarantees. We tested the approach at scale on Atari 2600 video games, and observed that the new X-ETD($n$) agent improved over baseline agents, highlighting both the scalability and broad applicability of our approach.

computer game, neural network, x-etd, (16 more...)

2107.05405

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.14)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games > Computer Games (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Machine LearningJun-21-2021

Emphatic Algorithms for Deep Reinforcement Learning

Jiang, Ray, Zahavy, Tom, Xu, Zhongwen, White, Adam, Hessel, Matteo, Blundell, Charles, van Hasselt, Hado

Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling - this is known as the ''deadly triad''. Emphatic temporal difference (ETD($\lambda$)) algorithm ensures convergence in the linear case by appropriately weighting the TD($\lambda$) updates. In this paper, we extend the use of emphatic methods to deep reinforcement learning agents. We show that naively adapting ETD($\lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance. We then derive new emphatic algorithms for use in the context of such algorithms, and we demonstrate that they provide noticeable benefits in small problems designed to highlight the instability of TD methods. Finally, we observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment.

artificial intelligence, computer game, machine learning, (4 more...)

2106.11779

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games > Computer Games (0.53)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceApr-13-2021

Muesli: Combining Improvements in Policy Optimization

Hessel, Matteo, Danihelka, Ivo, Viola, Fabio, Guez, Arthur, Schmitt, Simon, Sifre, Laurent, Weber, Theophane, Silver, David, van Hasselt, Hado

We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.

computer game, deep learning, muesli, (19 more...)

2104.06159

Country:

Asia (0.14)
Europe > United Kingdom > England (0.14)
North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
(4 more...)

arXiv.org Artificial IntelligenceFeb-12-2021

Discovery of Options via Meta-Learned Subgoals

Veeriah, Vivek, Zahavy, Tom, Hessel, Matteo, Xu, Zhongwen, Oh, Junhyuk, Kemaev, Iurii, van Hasselt, Hado, Silver, David, Singh, Satinder

Temporal abstractions in the form of options have been shown to help reinforcement learning (RL) agents learn faster. However, despite prior work on this topic, the problem of discovering options through interaction with an environment remains a challenge. In this paper, we introduce a novel meta-gradient approach for discovering useful options in multi-task RL environments. Our approach is based on a manager-worker decomposition of the RL agent, in which a manager maximises rewards from the environment by learning a task-dependent policy over both a set of task-independent discovered-options and primitive actions. The option-reward and termination functions that define a subgoal for each option are parameterised as neural networks and trained via meta-gradients to maximise their usefulness. Empirical analysis on gridworld and DeepMind Lab tasks show that: (1) our approach can discover meaningful and diverse temporally-extended options in multi-task RL domains, (2) the discovered options are frequently used by the agent while learning to solve the training tasks, and (3) that the discovered options help a randomly initialised manager learn faster in completely new tasks.

agent, deep learning, neural network, (18 more...)

2102.06741

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

arXiv.org Artificial IntelligenceOct-26-2020

Forethought and Hindsight in Credit Assignment

Chelu, Veronica, Precup, Doina, van Hasselt, Hado

We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. Particularly, we work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models. We establish the relative merits, limitations and complementary properties of both planning mechanisms in carefully constructed scenarios. Further, we investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)- evaluated. Lastly, we discuss the issue of model estimation and highlight a spectrum of methods that stretch from explicit environment-dynamics predictors to more abstract planner-aware models.

artificial intelligence, backward model, planning & scheduling, (16 more...)

2010.13685

Country:

North America > United States > California (0.28)
North America > Canada > Quebec > Montreal (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.93)

arXiv.org Artificial IntelligenceJul-17-2020

Discovering Reinforcement Learning Algorithms

Oh, Junhyuk, Hessel, Matteo, Czarnecki, Wojciech M., Xu, Zhongwen, van Hasselt, Hado, Singh, Satinder, Silver, David

Reinforcement learning (RL) algorithms update an agent's parameters according to one of several possible rules, discovered manually through years of research. Automating the discovery of update rules from data could lead to more efficient algorithms, or algorithms that are better adapted to specific environments. Although there have been prior attempts at addressing this significant scientific challenge, it remains an open question whether it is feasible to discover alternatives to fundamental concepts of RL such as value functions and temporal-difference learning. This paper introduces a new meta-learning approach that discovers an entire update rule which includes both 'what to predict' (e.g. value functions) and 'how to learn from it' (e.g. bootstrapping) by interacting with a set of environments. The output of this method is an RL algorithm that we call Learned Policy Gradient (LPG). Empirical results show that our method discovers its own alternative to the concept of value functions. Furthermore it discovers a bootstrapping mechanism to maintain and use its predictions. Surprisingly, when trained solely on toy environments, LPG generalises effectively to complex Atari games and achieves non-trivial performance. This shows the potential to discover general RL algorithms from data.

algorithm, computer game, deep learning, (19 more...)

2007.08794

Country: Europe > Germany (0.14)

Genre: Research Report > New Finding (0.66)

Industry:

Energy > Oil & Gas (1.00)
Leisure & Entertainment > Games > Computer Games (0.57)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)