Goto

Collaborating Authors

 critical initialization


Critical Initialization of Wide and Deep Neural Networks using Partial Jacobians: General Theory and Applications

Neural Information Processing Systems

Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows one to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality.


Critical Initialization of Wide and Deep Neural Networks using Partial Jacobians: General Theory and Applications

Neural Information Processing Systems

Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows one to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality.


Scaling and Resizing Symmetry in Feedforward Networks

Cardona, Carlos

arXiv.org Artificial Intelligence

Weights initialization in deep neural networks have a strong impact on the speed of converge of the learning map. Recent studies have shown that in the case of random initializations, a chaos/order phase transition occur in the space of variances of random weights and biases. Experiments then had shown that large improvements can be made, in terms of the training speed, if a neural network is initialized on values along the critical line of such phase transition. In this contribution, we show evidence that the scaling property exhibited by physical systems at criticality, is also present in untrained feedforward networks with random weights initialization at the critical line. Additionally, we suggest an additional data-resizing symmetry, which is directly inherited from the scaling symmetry at criticality.


AutoInit: Automatic Initialization via Jacobian Tuning

He, Tianyu, Doshi, Darshil, Gromov, Andrey

arXiv.org Machine Learning

Good initialization is essential for training Deep Neural Networks (DNNs). Oftentimes such initialization is found through a trial and error approach, which has to be applied anew every time an architecture is substantially modified, or inherited from smaller size networks leading to sub-optimal initialization. In this work we introduce a new and cheap algorithm, that allows one to find a good initialization automatically, for general feed-forward DNNs. The algorithm utilizes the Jacobian between adjacent network blocks to tune the network hyperparameters to criticality. We solve the dynamics of the algorithm for fully connected networks with ReLU and derive conditions for its convergence. We then extend the discussion to more general architectures with BatchNorm and residual connections. Finally, we apply our method to ResMLP and VGG architectures, where the automatic one-shot initialization found by our method shows good performance on vision tasks.


Critical initialization of wide and deep neural networks through partial Jacobians: general theory and applications to LayerNorm

Doshi, Darshil, He, Tianyu, Gromov, Andrey

arXiv.org Machine Learning

Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new way to diagnose (both theoretically and empirically) this criticality. To that end, we introduce partial Jacobians of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0