Residual networks (ResNet) and weight normalization play an important role in various deep learning applications. However, parameter initialization strategies have not been studied previously for weight normalized networks and, in practice, initialization methods designed for un-normalized networks are used as a proxy. Similarly, initialization for ResNets have also been studied for un-normalized networks and often under simplified settings ignoring the shortcut connection. To address these issues, we propose a novel parameter initialization strategy that avoids explosion/vanishment of information across layers for weight normalized networks with and without residual connections. The proposed strategy is based on a theoretical analysis using mean field approximation.
As a beginner at deep learning, one of the things I realized is that there isn't much online documentation that covers all the deep learning tricks in one place. There are lots of small best practices, ranging from simple tricks like initializing weights, regularization to slightly complex techniques like cyclic learning rates that can make training and debugging neural nets easier and efficient. This inspired me to write this series of blogs where I will cover as many nuances as I can to make implementing deep learning simpler for you. While writing this blog, the assumption is that you have a basic idea of how neural networks are trained. An understanding of weights, biases, hidden layers, activations and activation functions will make the content clearer.
We present some novel, straightforward methods for training the connection graph of a randomly initialized neural network without training the weights. These methods do not use hyperparameters defining cutoff thresholds and therefore remove the need for iteratively searching optimal values of such hyperparameters. We can achieve similar or higher performances than in the case of training all weights, with a similar computational cost as for standard training techniques. Besides switching connections on and off, we introduce a novel way of training a network by flipping the signs of the weights. If we try to minimize the number of changed connections, by changing less than 10\% of the total it is already possible to reach more than 90\% of the accuracy achieved by standard training. We obtain good results even with weights of constant magnitude or even when weights are drawn from highly asymmetric distributions. These results shed light on the over-parameterization of neural networks and on how they may be reduced to their effective size.
Deep learning relies on good initialization schemes and hyperparameter choices prior to training a neural network. Random weight initializations induce random network ensembles, which give rise to the trainability, training speed, and sometimes also generalization ability of an instance. In addition, such ensembles provide theoretical insights into the space of candidate models of which one is selected during training. The results obtained so far rely on mean field approximations that assume infinite layer width and that study average squared signals. We derive the joint signal output distribution exactly, without mean field assumptions, for fully-connected networks with Gaussian weights and biases, and analyze deviations from the mean field results.
Convergence rate of training algorithms for neural networks is heavily affected by initialization of weights. In this paper, an original algorithm for initialization of weights in backpropagation neural net is presented with application to character recognition. The initialization method is mainly based on a customization of the Kalman filter, translating it into Bayesian statistics terms. A metrological approach is used in this context considering weights as measurements modeled by mutually dependent normal random variables. The algorithm performance is demonstrated by reporting and discussing results of simulation trials. Results are compared with random weights initialization and other methods. The proposed method shows an improved convergence rate for the backpropagation training algorithm.