Goto

Collaborating Authors

 nac



Neural Auto-Curricula

Neural Information Processing Systems

When solving two-player zero-sum games, multi-agent reinforcement learning (MARL) algorithms often create populations of agents where, at each iteration, a new agent is discovered as the best response to a mixture over the opponent population. Within such a process, the update rules of "who to compete with" (i.e., the opponent mixture) and "how to beat them" (i.e., finding best responses) are underpinned by manually developed game theoretical principles such as fictitious play and Double Oracle. In this paper1, we introduce a novel framework--Neural Auto-Curricula (NAC)--that leverages meta-gradient descent to automate the discovery of the learning update rule without explicit human design. Specifically, we parameterise the opponent selection module by neural networks and the bestresponse module by optimisation subroutines, and update their parameters solely via interaction with the game engine, where both players aim to minimise their exploitability. Surprisingly, even without human design, the discovered MARL algorithms achieve competitive or even better performance with the state-of-the-art population-based game solvers (e.g., PSRO) on Games of Skill, differentiable Lotto, non-transitive Mixture Games, Iterated Matching Pennies, and Kuhn Poker. Additionally, we show that NAC is able to generalise from small games to large games, for example training on Kuhn Poker and outperforming PSRO on Leduc Poker. Our work inspires a promising future direction to discover general MARL algorithms solely from data.


Neural Arithmetic Logic Units

Neural Information Processing Systems

Specifically,one frequently observes failures when quantities that lie outside the numerical range used during training are encountered at test time, even when the target functionissimple (e.g., itdepends only onaggregating counts orlinear extrapolation). This failure patternindicates that the learned behavior is better characterized by memorization than by systematic abstraction.



NeuralAttentiveCircuits

Neural Information Processing Systems

General purpose models typically makefewassumptions about theunderlyingdata-structure and are known to perform well in the large-data regime. At the same time, there hasbeen growing interest inmodular neural architectures that represent thedata using sparsely interacting modules.




Neural Attentive Circuits

Neural Information Processing Systems

Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modalities. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data using sparsely interacting modules. These models can be more robust out-of-distribution, computationally efficient, and capable of sample-efficient adaptation to new data. However, they tend to make domain-specific assumptions about the data, and present challenges in how module behavior (i.e., parameterization) and connectivity (i.e., their layout) can be jointly learned.


Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

Neural Information Processing Systems

The actor-critic (AC) algorithm is a popular method to find an optimal policy in reinforcement learning. In the infinite horizon scenario, the finite-sample convergence rate for the AC and natural actor-critic (NAC) algorithms has been established recently, but under independent and identically distributed (i.i.d.) sampling and single-sample update at each iteration. In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with mini-batch data for each iteration, and with actor having general policy class approximation. We show that the overall sample complexity for a mini-batch AC to attain an $\epsilon$-accurate stationary point improves the best known sample complexity of AC by an order of $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$, and the overall sample complexity for a mini-batch NAC to attain an $\epsilon$-accurate globally optimal point improves the existing sample complexity of NAC by an order of $\mathcal{O}(\epsilon^{-2}/\log(1/\epsilon))$. Moreover, the sample complexity of AC and NAC characterized in this work outperforms that of policy gradient (PG) and natural policy gradient (NPG) by a factor of $\mathcal{O}((1-\gamma)^{-3})$ and $\mathcal{O}((1-\gamma)^{-4}\epsilon^{-2}/\log(1/\epsilon))$, respectively. This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.