relu net
Half-Space Feature Learning in Neural Networks
Yadav, Mahesh Lorik, Ramaswamy, Harish Guruprasad, Lakshminarayanan, Chandrashekar
There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural networks can be viewed as a mixture of experts, where each expert corresponds to a (number of layers length) path through a sequence of hidden units. We use this alternate interpretation to motivate a model, called the Deep Linearly Gated Network (DLGN), which sits midway between deep linear networks and ReLU networks. Unlike deep linear networks, the DLGN is capable of learning non-linear features (which are then linearly combined), and unlike ReLU networks these features are ultimately simple -- each feature is effectively an indicator function for a region compactly described as an intersection of (number of layers) half-spaces in the input space. This viewpoint allows for a comprehensive global visualization of features, unlike the local visualizations for neurons based on saliency/activation/gradient maps. Feature learning in DLGNs is shown to happen and the mechanism with which this happens is through learning half-spaces in the input space that contain smooth regions of the target function. Due to the structure of DLGNs, the neurons in later layers are fundamentally the same as those in earlier layers -- they all represent a half-space -- however, the dynamics of gradient descent impart a distinct clustering to the later layer neurons. We hypothesize that ReLU networks also have similar feature learning behaviour.
Generalization Performance of Empirical Risk Minimization on Over-parameterized Deep ReLU Nets
Lin, Shao-Bo, Wang, Yao, Zhou, Ding-Xuan
In this paper, we study the generalization performance of global minima for implementing empirical risk minimization (ERM) on over-parameterized deep ReLU nets. Using a novel deepening scheme for deep ReLU nets, we rigorously prove that there exist perfect global minima achieving almost optimal generalization error bounds for numerous types of data under mild conditions. Since over-parameterization is crucial to guarantee that the global minima of ERM on deep ReLU nets can be realized by the widely used stochastic gradient descent (SGD) algorithm, our results indeed fill a gap between optimization and generalization.
Catapult Dynamics and Phase Transitions in Quadratic Nets
Neural networks trained with gradient descent can undergo non-trivial phase transitions as a function of the learning rate. In (Lewkowycz et al., 2020) it was discovered that wide neural nets can exhibit a catapult phase for super-critical learning rates, where the training loss grows exponentially quickly at early times before rapidly decreasing to a small value. During this phase the top eigenvalue of the neural tangent kernel (NTK) also undergoes significant evolution. In this work, we will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. To do this, we show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.
ReLU nets adapt to intrinsic dimensionality beyond the target domain
Cloninger, Alexander, Klock, Timo
We study the approximation of two-layer compositions $f(x) = g(\phi(x))$ via deep ReLU networks, where $\phi$ is a nonlinear, geometrically intuitive, and dimensionality reducing feature map. We focus on two complementary choices for $\phi$ that are intuitive and frequently appearing in the statistical literature. The resulting approximation rates are near optimal and show adaptivity to intrinsic notions of complexity, which significantly extend a series of recent works on approximating targets over low-dimensional manifolds. Specifically, we show that ReLU nets can express functions, which are invariant to the input up to an orthogonal projection onto a low-dimensional manifold, with the same efficiency as if the target domain would be the manifold itself. This implies approximation via ReLU nets is faithful to an intrinsic dimensionality governed by the target $f$ itself, rather than the dimensionality of the approximation domain. As an application of our approximation bounds, we study empirical risk minimization over a space of sparsely constrained ReLU nets under the assumption that the conditional expectation satisfies one of the proposed models. We show near-optimal estimation guarantees in regression and classifications problems, for which, to the best of our knowledge, no efficient estimator has been developed so far.