Goto

Collaborating Authors

 Statistical Learning



Differentiable Analog Quantum Computing for Optimization and Control

Neural Information Processing Systems

We formulate the first differentiable analog quantum computing framework with specific parameterization design at the analog signal (pulse) level to better exploit near-term quantum devices via variational methods. We further propose a scalable approach to estimate the gradients of quantum dynamics using a forward pass with Monte Carlo sampling, which leads to a quantum stochastic gradient descent algorithm for scalable gradient-based training in our framework. Applying our framework to quantum optimization and control, we observe a significant advantage of differentiable analog quantum computing against SOTAs based on parameterized digital quantum circuits by orders of magnitude.



The alignment property of SGD noise and how it helps select flat minima: A stability analysis

Neural Information Processing Systems

The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its linear stability (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum ฮธ is linearly stable for SGD, then it must satisfy H(ฮธ) F O( B/ฮท), where H(ฮธ) F,B,ฮท denote the Frobenius norm of Hessian at ฮธ, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum exponentially fast. Hence, for minima accessible to SGD, the sharpness--as measured by the Frobenius norm of the Hessian--is bounded independently of the model size and sample size. The key to obtaining these results is exploiting the particular structure of SGD noise: The noise concentrates in sharp directions of local landscape and the magnitude is proportional to loss value. This alignment property of SGD noise provably holds for linear networks and random feature models (RFMs), and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are also justified by extensive experiments on CIFAR-10 dataset.




Neural Auto-Curricula

Neural Information Processing Systems

When solving two-player zero-sum games, multi-agent reinforcement learning (MARL) algorithms often create populations of agents where, at each iteration, a new agent is discovered as the best response to a mixture over the opponent population. Within such a process, the update rules of "who to compete with" (i.e., the opponent mixture) and "how to beat them" (i.e., finding best responses) are underpinned by manually developed game theoretical principles such as fictitious play and Double Oracle. In this paper1, we introduce a novel framework--Neural Auto-Curricula (NAC)--that leverages meta-gradient descent to automate the discovery of the learning update rule without explicit human design. Specifically, we parameterise the opponent selection module by neural networks and the bestresponse module by optimisation subroutines, and update their parameters solely via interaction with the game engine, where both players aim to minimise their exploitability. Surprisingly, even without human design, the discovered MARL algorithms achieve competitive or even better performance with the state-of-the-art population-based game solvers (e.g., PSRO) on Games of Skill, differentiable Lotto, non-transitive Mixture Games, Iterated Matching Pennies, and Kuhn Poker. Additionally, we show that NAC is able to generalise from small games to large games, for example training on Kuhn Poker and outperforming PSRO on Leduc Poker. Our work inspires a promising future direction to discover general MARL algorithms solely from data.


Appendix: Learning Compact Representations of Neural Networks using DiscriminAtive Masking (DAM) AAnalysis of the DAMGate Function Dynamics During Training

Neural Information Processing Systems

In this section, we theoretically analyze the dynamics of the DAM mask gi at the i-th layer as the training process unfolds. The loss function for training the neural network for the target task can then be denoted as L= L(f(x,ฮ˜,ฮฒi)) (e.g., cross-entropy loss for supervised structured pruning problems and reconstruction error for representation learning problems), where xdenotes the input features to the neural network. Using gradient descent methods with a learning rate of ฮท, the expected update formula of ฮฒi in DAM is given by: ฮฒi = ฮทEx Dtr [ ฮฒiL(f(x,ฮ˜,ฮฒi)) + ฮป ฮฒiฮฒi/(l 1)] (2) = ฮทEx Dtr [ ฮฒiL(f(x,ฮ˜,ฮฒi))] ฮทฮป/(l 1) (3) Let hi be the layer output before applying the DAM mask, and the masked output be represented as oi = hi gi after applying the gate. For the j-th neuron, gij/ ฮฒi = 0 if and only if ฮพj(ฮฒi)/ ฮฒi = 0. Since tanh(z) has non-zero gradients for z >0, the gradient of ฮพj(ฮฒi) is 0 only when kj/ni + ฮฒi 0, i.e., the mask value of the neuron is 0 (or in other words, it is deactivated or dead). Let us denote the set of all neuron indices with non-zero mask values (also referred to as active neurons) as J. Equation 4 can then be simplified as: ฮฒiL(f(x,ฮ˜,ฮฒi)) = ฮฑi X We can make the following two observations: (i) only those neurons that are active (i.e., have non-zero mask values) have a contribution towards updating ฮฒi and moving the gate function. We name these neurons as support neurons and their position in the ordering of neurons as the transitioning zone of the gate function.