AITopics | Colin Wei

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Neural Information Processing SystemsMar-26-2025, 18:30:13 GMT

Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing. Our experiments show that this causes the small learning rate model's accuracy on unmodified images to suffer, as it relies too much on the patch early on.

artificial intelligence, learning rate, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > Canada (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma

Neural Information Processing SystemsMar-26-2025, 03:04:34 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Colin Wei, Tengyu Ma

Neural Information Processing SystemsMar-22-2025, 15:55:09 GMT

Existing Rademacher complexity bounds for neural networks rely only on norm control of the weight matrices and depend exponentially on depth via a product of the matrix norms. Lower bounds show that this exponential dependence on depth is unavoidable when no additional properties of the training data are considered. We suspect that this conundrum comes from the fact that these bounds depend on the training data only through the margin. In practice, many data-dependent techniques such as Batchnorm improve the generalization performance. For feedforward neural nets as well as RNNs, we obtain tighter Rademacher complexity bounds by considering additional data-dependent properties of the network: the norms of the hidden layers of the network, and the norms of the Jacobians of each layer with respect to all previous layers. Our bounds scale polynomially in depth when these empirical quantities are small, as is usually the case in practice. To obtain these bounds, we develop general tools for augmenting a sequence of functions to make their composition Lipschitz and then covering the augmented functions. Inspired by our theory, we directly regularize the network's Jacobians during training and empirically demonstrate that this improves test performance.

artificial intelligence, generalization, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.65)

Add feedback

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Yuanzhi Li, Colin Wei, Tengyu Ma

Neural Information Processing SystemsJan-26-2025, 18:43:45 GMT

Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing. Our experiments show that this causes the small learning rate model's accuracy on unmodified images to suffer, as it relies too much on the patch early on.

artificial intelligence, learning rate, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > Canada (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma

Neural Information Processing SystemsJan-25-2025, 09:02:15 GMT

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK).

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.38)

Add feedback

Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Colin Wei, Tengyu Ma

Neural Information Processing SystemsJan-21-2025, 13:45:49 GMT

Existing Rademacher complexity bounds for neural networks rely only on norm control of the weight matrices and depend exponentially on depth via a product of the matrix norms. Lower bounds show that this exponential dependence on depth is unavoidable when no additional properties of the training data are considered. We suspect that this conundrum comes from the fact that these bounds depend on the training data only through the margin. In practice, many data-dependent techniques such as Batchnorm improve the generalization performance. For feedforward neural nets as well as RNNs, we obtain tighter Rademacher complexity bounds by considering additional data-dependent properties of the network: the norms of the hidden layers of the network, and the norms of the Jacobians of each layer with respect to all previous layers. Our bounds scale polynomially in depth when these empirical quantities are small, as is usually the case in practice. To obtain these bounds, we develop general tools for augmenting a sequence of functions to make their composition Lipschitz and then covering the augmented functions. Inspired by our theory, we directly regularize the network's Jacobians during training and empirically demonstrate that this improves test performance.

artificial intelligence, generalization, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.65)

Add feedback

Filters

Collaborating Authors

Colin Wei

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation