Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

Pennington, Jeffrey, Schoenholz, Samuel, Ganguli, Surya

Feb-14-2020, 16:13:04 GMT–Neural Information Processing Systems

It is well known that weight initialization in deep networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is O(1) is essential for avoiding exponentially vanishing or exploding gradients. Moreover, in deep linear networks, ensuring that all singular values of the Jacobian are concentrated near 1 can yield a dramatic additional speed-up in learning; this is a property known as dynamical isometry. However, it is unclear how to achieve dynamical isometry in nonlinear deep networks. We address this question by employing powerful tools from free probability theory to analytically compute the {\it entire} singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity.

deep learning, dynamical isometry, weight initialization, (8 more...)

Neural Information Processing Systems

Feb-14-2020, 16:13:04 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.44)