Goto

Collaborating Authors

 eigenvalue decay


Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks

Neural Information Processing Systems

We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g.


Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks

Surbhi Goel, Adam Klivans

Neural Information Processing Systems

We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to eigenvalue decay of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g.


d0f5edad9ac19abed9e235c0fe0aa59f-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewer for providing constructive feedback and suggestions. By now, the number of papers (and books!) using these two parametrizations is In this view, the specific regime in which our rates are better is not really important. Our final comments on the bias of the community towards "weak assumptions" was So, we are happy that the reviewer engaged with us in this discussion! "strong" is the case of zero Bayes error w.r.t. the square loss: It is completely a problem-dependent judgment rather Instead, we just consider it an interesting setting that researchers have ignored for a long time. Moreover, we do plan to extend the results we presented to smooth classification losses, as the squared hinge loss.




72e6d3238361fe70f22fb0ac624a7072-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all reviewers for their helpful feedback. Below we address the questions and comments individually. We will correct typos in the main text and bibliography, and refer to Figure 1 in the introduction. We apologize for the confusion. The V AMP framework does not capture our "aligned" or "misalgined" cases.


Optimal Learning Rates for Regularized Conditional Mean Embedding Zhu Li

Neural Information Processing Systems

X and Y , the conditional expectation operator for a function f is defined [ Pf ]( x ): = E [ f ( Y ) | X = x ] . They require an explicit relation between the smoothness of the target CME and the size of the RKHS.



Optimal Learning Rates for Regularized Conditional Mean Embedding Zhu Li

Neural Information Processing Systems

X and Y , the conditional expectation operator for a function f is defined [ Pf ]( x ): = E [ f ( Y ) | X = x ] . They require an explicit relation between the smoothness of the target CME and the size of the RKHS.