Goto

Collaborating Authors

 regularization matter



Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Neural Information Processing Systems

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard \ell_2 regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in d dimensions which the optimal regularized neural net learns with O(d) samples but the NTK requires \Omega(d 2) samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable.


Reviews: Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Neural Information Processing Systems

Summary: The paper studies the generalization and optimization aspects of regularized neural networks, and provide two key contributions: (a)they show that a O(d) sample complexity gap between global minima of regularized loss and the induced kernel method. They also establish that in infinite-width two-layer nets, a variant of gradient descent converges to global minimum with of (weakly) regularized cross entropy loss in poly iterations. The paper studies a natural and important problem and makes fundamental contributions in this direction. Recent results in deep learning theory exploits this neural tangent connection to prove optimization and generalization results. In light of this, it is important to study the limitations of this.


Reviews: Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Neural Information Processing Systems

This paper investigates how the regularization helps for training neural networks in contrast to the unregularized neural tangent kernel method. It is shown that regularization captures "informative signal" but the NTK model does not, which highlights the effectiveness of the regularization. Moreover, this paper shows polynomial time convergence of gradient flow corresponding to the infinite width neural network. The contribution is novel and the implication is quite instructive to neural tangent kernel learning. Especially, the lower bound evaluation for kernel learning is a novel contribution.


Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Neural Information Processing Systems

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard \ell_2 regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in d dimensions which the optimal regularized neural net learns with O(d) samples but the NTK requires \Omega(d 2) samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable.


Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Wei, Colin, Lee, Jason D., Liu, Qiang, Ma, Tengyu

Neural Information Processing Systems

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard $\ell_2$ regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in $d$ dimensions which the optimal regularized neural net learns with $O(d)$ samples but the NTK requires $\Omega(d 2)$ samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable.