Scaling up deep neural networks: a capacity allocation perspective
Capacity analysis has been introduced in [2] as a way to analyze which dependencies a linear model is focussing its modelling capacity on, when trained on a given task. The concept was then extended in [3] to neural networks with nonlinear activations, where capacity propagation through layers was studied. When the layers are residual (or differential), and in one limiting case with extremely irregular activations (which was called the pseudo-random limit), it has been shown that capacity propagation through layers follows a discrete Markov equation. This discrete equation can then be approximated by a continuous Kolmogorov forward equation in the deep limit, provided some specific scaling relation holds between the network depth and the scale of its residual connections - more precisely, the residual weights must scale as the inverse square root of the number of layers. Following [1], it was then hypothesized that the success of residual networks lies in their ability to propagate capacity through a large number of layers in a non-degenerate manner. It is interesting to note that the inverse square root scaling mentioned above is the only scaling relation that leads to a non-degenerate propagation PDE in that case: larger weights would lead to shattering, while smaller ones would lead to no spatial propagation at all. In this paper, we take this idea one step further and formulate the conjecture that enforcing the right scaling relations - i.e. the ones that lead to a non-degenerate continuous limit for capacity propagation - is key to avoiding the shattering problem: we call this the neural network scaling conjecture. In the example above, this would mean that the inverse square root scaling must be enforced if one wants to use residual networks at their full power. In the second part of this paper, we use the PDE capacity propagation framework to study a number of commonly used network architectures, and determine the scaling relations that are required for a non-degenerate capacity propagation to happen in each case.
Mar-27-2019