neural network outperform kernel method
When Do Neural Networks Outperform Kernel Methods?
For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims?
Review for NeurIPS paper: When Do Neural Networks Outperform Kernel Methods?
Summary and Contributions: This paper mainly studies the approximation error of random feature model, neural tangent kernel model, and two-layer neural network model, for non-uniform data distribution. Specifically, an "effective dimension" is defined to characterize the informative dimension of the data, which depend on both the dimension used to generate the target (d0) and the noise level on other dimensions. For RF and NT model, the approximation error bounds depends on the effective dimension. While for NN model, the bounds only depends on d0. This difference between two types of models can help understand the performance different between kernel methods and two-layer neural networks.
When Do Neural Networks Outperform Kernel Methods?
For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims?