A note on Linear Bottleneck networks and their Transition to Multilinearity
Zhu, Libin, Pandit, Parthe, Belkin, Mikhail
For a wide neural network (WNN), when the network width is sufficiently large, there exists a linear function of parameters, arbitrarily close to the network function, in a ball of radius O(1) in the parameter space around random initialization. This local linearity explains the equivalence to the neural tangent kernel (NTK) regression for optimizing wide neural networks with small learning rates, first shown in [13]. However, an important assumption for this transition to linearity [18] to hold is that each layer must be sufficiently wide. If there is even one narrow "bottleneck" hidden layer, resulting in a so-called bottleneck neural network (BNN), the work [18] showed that the transition to linearity does not occur. An immediate question at this point is, What functions of the weights does a neural network with a bottleneck layer represent?
Jun-30-2022