Goto

Collaborating Authors

 functional approximation perspective


On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective

Neural Information Processing Systems

We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be neatly described in a matrix form. When the network is sufficiently over-parameterized, these matrices individually approximate {\em an} integral operator which is determined by the feature vector distribution $\rho$ only. Consequently, GD method can be viewed as {\em approximately} applying the powers of this integral operator on the underlying/target function $f^*$ that generates the responses/labels. We show that if $f^*$ admits a low-rank approximation with respect to the eigenspaces of this integral operator, then the empirical risk decreases to this low rank approximation error at a linear rate which is determined by $f^*$ and $\rho$ only, i.e., the rate is independent of the sample size $n$. Furthermore, if $f^*$ has zero low-rank approximation error, then, as long as the width of the neural network is $\Omega(n\log n)$, the empirical risk decreases to $\Theta(1/\sqrt{n})$. To the best of our knowledge, this is the first result showing the sufficiency of nearly-linear network over-parameterization. We provide an application of our general results to the setting where $\rho$ is the uniform distribution on the spheres and $f^*$ is a polynomial. Throughout this paper, we consider the scenario where the input dimension $d$ is fixed.


Reviews: On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective

Neural Information Processing Systems

One of the biggest reasons that I am not too thrilled about the submission is that the two-layer fully connected neural network model under consideration is off standard; the weights of the second layer is fixed to binary (i.e. 1, -1 with uniform scaling) and are not being updated via the gradient descent procedures. Correct me if I am wrong; this assumption is neither commonly adopted in experiments nor identical to the comparable theoretical works such as [DLL 18] or [ADH 19]. If the authors are convinced of the significance or general applicability of the suggested framework, they should have taken more care communicating those to the audience. A relatively minor issue is about the significance of the c_1 term in Theorem 3. The authors nicely demonstrate that the constant c_1 satisfying (13) can be controlled via a sum whose dominating term is inversely proportional to (lambda_{m_l} - lambda_{m_l 1}), which is later formalized to Theorem 4. (By the way, I had trouble locating a formal definition of eps(f *,l).) The question is, can we guarantee that the value is large enough to ensure the ignorability of the term c_1? I am particularly worried about this, as the authors have already mentioned that the spectrum of the random matrix concentrates as n grows, in line 145.


Reviews: On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective

Neural Information Processing Systems

This paper gives convergence guarantees for training neural networks via gradient descent. The approach consists of considering GD as an operator and of analyzing it through its eigenvalues. The interest of the analysis is focus on the overparameterized setting of such network. This is an interesting paper, with interesting theoretical results. It is clearly above the acceptance threshold.


On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective

Neural Information Processing Systems

We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be neatly described in a matrix form. When the network is sufficiently over-parameterized, these matrices individually approximate {\em an} integral operator which is determined by the feature vector distribution \rho only. Consequently, GD method can be viewed as {\em approximately} applying the powers of this integral operator on the underlying/target function f * that generates the responses/labels. We show that if f * admits a low-rank approximation with respect to the eigenspaces of this integral operator, then the empirical risk decreases to this low rank approximation error at a linear rate which is determined by f * and \rho only, i.e., the rate is independent of the sample size n .


On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective

Su, Lili, Yang, Pengkun

Neural Information Processing Systems

We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be neatly described in a matrix form. When the network is sufficiently over-parameterized, these matrices individually approximate {\em an} integral operator which is determined by the feature vector distribution $\rho$ only. Consequently, GD method can be viewed as {\em approximately} applying the powers of this integral operator on the underlying/target function $f *$ that generates the responses/labels. We show that if $f *$ admits a low-rank approximation with respect to the eigenspaces of this integral operator, then the empirical risk decreases to this low rank approximation error at a linear rate which is determined by $f *$ and $\rho$ only, i.e., the rate is independent of the sample size $n$.