Goto

Collaborating Authors

 initialisation


Critical initialisation for deep signal propagation in noisy rectifier neural networks

Neural Information Processing Systems

Stochastic regularisation is an important weapon in the arsenal of a deep learning practitioner. However, despite recent theoretical advances, our understanding of how noise influences signal propagation in deep neural networks remains limited. By extending recent work based on mean field theory, we develop a new framework for signal propagation in stochastic regularised neural networks. Our \textit{noisy signal propagation} theory can incorporate several common noise distributions, including additive and multiplicative Gaussian noise as well as dropout. We use this framework to investigate initialisation strategies for noisy ReLU networks. We show that no critical initialisation strategy exists using additive noise, with signal propagation exploding regardless of the selected noise distribution. For multiplicative noise (e.g.\ dropout), we identify alternative critical initialisation strategies that depend on the second moment of the noise distribution. Simulations and experiments on real-world data confirm that our proposed initialisation is able to stably propagate signals in deep networks, while using an initialisation disregarding noise fails to do so.



1 Appendix 1 Bayes-by-backprop The Bayesian posterior neural network distribution P (w |D) is approximated

Neural Information Processing Systems

In Algorithm 1 we give the full clustering algorithm used for each of the T fixing iterations. In Figure 1 we show how the layers' In Figure 2 we show the impact of increasing the regularisation strength.


S)GD over Diagonal Linear Networks Implicit Bias Large and Edge of Stability

Neural Information Processing Systems

Currently, most theoretical works on implicit regularisation have primarily focused on continuous time approximations of (S)GD where the impact of crucial hyperparameters such as the stepsize and the minibatch size are ignored. One such common simplification is to analyse gradient flow, which is a continuous time limit of GD and minibatch SGD with an infinitesimal stepsize. By definition, this analysis does not capture the effect of stepsize or stochasticity.





48cb136b65a69e8c2aa22913a0d91b2f-Supplemental.pdf

Neural Information Processing Systems

Thehandwritten digits are written by 500 writers (250 writers for training and test set, respectively), introducing a largevariation interms ofstyleofthesecharacters.