AITopics | Tengyu Ma

Neural Information Processing Systems http://nips.cc/

calibration error, data mining, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > Canada (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Yuanzhi Li, Colin Wei, Tengyu Ma

Neural Information Processing SystemsMar-26-2025, 18:30:13 GMT

Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing. Our experiments show that this causes the small learning rate model's accuracy on unmodified images to suffer, as it relies too much on the patch early on.

artificial intelligence, learning rate, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > Canada (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma

Neural Information Processing SystemsMar-26-2025, 03:04:34 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Colin Wei, Tengyu Ma

Neural Information Processing SystemsMar-22-2025, 15:55:09 GMT

Existing Rademacher complexity bounds for neural networks rely only on norm control of the weight matrices and depend exponentially on depth via a product of the matrix norms. Lower bounds show that this exponential dependence on depth is unavoidable when no additional properties of the training data are considered. We suspect that this conundrum comes from the fact that these bounds depend on the training data only through the margin. In practice, many data-dependent techniques such as Batchnorm improve the generalization performance. For feedforward neural nets as well as RNNs, we obtain tighter Rademacher complexity bounds by considering additional data-dependent properties of the network: the norms of the hidden layers of the network, and the norms of the Jacobians of each layer with respect to all previous layers. Our bounds scale polynomially in depth when these empirical quantities are small, as is usually the case in practice. To obtain these bounds, we develop general tools for augmenting a sequence of functions to make their composition Lipschitz and then covering the augmented functions. Inspired by our theory, we directly regularize the network's Jacobians during training and empirically demonstrate that this improves test performance.

artificial intelligence, generalization, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.65)

Add feedback

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Yuanzhi Li, Colin Wei, Tengyu Ma

Neural Information Processing SystemsJan-26-2025, 18:43:45 GMT

Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing. Our experiments show that this causes the small learning rate model's accuracy on unmodified images to suffer, as it relies too much on the patch early on.

artificial intelligence, learning rate, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > Canada (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma

Neural Information Processing SystemsJan-25-2025, 09:02:15 GMT

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK).

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.38)

Add feedback

Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Colin Wei, Tengyu Ma

Neural Information Processing SystemsJan-21-2025, 13:45:49 GMT

Existing Rademacher complexity bounds for neural networks rely only on norm control of the weight matrices and depend exponentially on depth via a product of the matrix norms. Lower bounds show that this exponential dependence on depth is unavoidable when no additional properties of the training data are considered. We suspect that this conundrum comes from the fact that these bounds depend on the training data only through the margin. In practice, many data-dependent techniques such as Batchnorm improve the generalization performance. For feedforward neural nets as well as RNNs, we obtain tighter Rademacher complexity bounds by considering additional data-dependent properties of the network: the norms of the hidden layers of the network, and the norms of the Jacobians of each layer with respect to all previous layers. Our bounds scale polynomially in depth when these empirical quantities are small, as is usually the case in practice. To obtain these bounds, we develop general tools for augmenting a sequence of functions to make their composition Lipschitz and then covering the augmented functions. Inspired by our theory, we directly regularize the network's Jacobians during training and empirically demonstrate that this improves test performance.

artificial intelligence, generalization, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.65)

Add feedback

A Non-generative Framework and Convex Relaxations for Unsupervised Learning

Elad Hazan, Tengyu Ma

Neural Information Processing SystemsJan-20-2025, 19:04:24 GMT

We give a novel formal theoretical framework for unsupervised learning with two distinctive characteristics. First, it does not assume any generative model and based on a worst-case performance metric. Second, it is comparative, namely performance is measured with respect to a given hypothesis class. This allows to avoid known computational hardness results and improper algorithms based on convex relaxations. We show how several families of unsupervised learning models, which were previously only analyzed under probabilistic assumptions and are otherwise provably intractable, can be efficiently learned in our framework by convex optimization.

artificial intelligence, deep learning, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Europe (1.00)
Asia > Middle East (0.28)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Matrix Completion has No Spurious Local Minimum

Rong Ge, Jason D. Lee, Tengyu Ma

Neural Information Processing SystemsJan-20-2025, 14:07:08 GMT

Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems. Simple non-convex optimization algorithms are popular and effective in practice. Despite recent progress in proving various non-convex algorithms converge from a good initial point, it remains unclear why random or arbitrary initialization suffices in practice. We prove that the commonly used non-convex objective function for positive semidefinite matrix completion has no spurious local minima - all local minima must also be global. Therefore, many popular optimization algorithms such as (stochastic) gradient descent can provably solve positive semidefinite matrix completion with arbitrary initialization in polynomial time. The result can be generalized to the setting when the observed entries contain noise. We believe that our main proof strategy can be useful for understanding geometric properties of other statistical problems involving partial or noisy observations.

artificial intelligence, machine learning, optimality condition, (16 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)

Add feedback

On the Optimization Landscape of Tensor Decompositions

Rong Ge, Tengyu Ma

Neural Information Processing SystemsOct-8-2024, 05:33:06 GMT

Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that "all local optima are (approximately) global optima", and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult. In this paper, we analyze the optimization landscape of the random over-complete tensor decomposition problem, which has many applications in unsupervised leaning, especially in learning latent variable models.

artificial intelligence, kac-rice formula, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.89)

Add feedback

Filters

Collaborating Authors

Tengyu Ma

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Verified Uncertainty Calibration

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

A Non-generative Framework and Convex Relaxations for Unsupervised Learning

Matrix Completion has No Spurious Local Minimum

On the Optimization Landscape of Tensor Decompositions