AITopics | Zhu, Libin

Collaborating Authors

Zhu, Libin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Zhu, Libin, Liu, Chaoyue, Radhakrishnan, Adityanarayanan, Belkin, Mikhail

arXiv.org Artificial IntelligenceJun-7-2023

In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.

artificial intelligence, catapult, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2306.04815

Country:

North America > United States > California (0.14)
North America > United States > Massachusetts (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.87)

Add feedback

Quadratic models for understanding neural network dynamics

Zhu, Libin, Liu, Chaoyue, Radhakrishnan, Adityanarayanan, Belkin, Mikhail

arXiv.org Artificial IntelligenceJun-7-2023

A recent remarkable finding on neural networks, originating from [9] and termed as the "transition to linearity" [16], is that, as network width goes to infinity, such models become linear functions in the parameter space. Thus, a linear (in parameters) model can be built to accurately approximate wide neural networks under certain conditions. While this finding has helped improve our understanding of trained neural networks [4, 20, 29, 18, 11, 3], not all properties of finite width neural networks can be understood in terms of linear models, as is shown in several recent works [27, 21, 17, 6]. In this work, we show that properties of finitely wide neural networks in optimization and generalization that cannot be captured by linear models are, in fact, manifested in quadratic models.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

2205.11787

Country:

North America > United States > California (0.14)
North America > United States > Massachusetts (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture

Zhu, Libin, Liu, Chaoyue, Belkin, Mikhail

arXiv.org Artificial IntelligenceJun-7-2023

In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity. The width of these general networks is characterized by the minimum in-degree of their neurons, except for the input and first layers. Our results identify the mathematical structure underlying transition to linearity and generalize a number of recent works aimed at characterizing transition to linearity or constancy of the Neural Tangent Kernel for standard architectures.

artificial intelligence, machine learning, neural network, (18 more...)

arXiv.org Artificial Intelligence

2205.11786

Country: North America > United States > California (0.14)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Restricted Strong Convexity of Deep Learning Models with Smooth Activations

Banerjee, Arindam, Cisneros-Velarde, Pedro, Zhu, Libin, Belkin, Mikhail

arXiv.org Artificial IntelligenceSep-29-2022

We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the ``near initialization'' perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with $L$ layers, $m$ width, and $\sigma_0^2$ initialization variance. First, for suitable $\sigma_0^2$, we establish a $O(\frac{\text{poly}(L)}{\sqrt{m}})$ upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $\Omega(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss. We also present results for more general losses. The RSC based analysis does not need the ``near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results supporting our theoretical advances.

artificial intelligence, machine learning, survey article, (17 more...)

arXiv.org Artificial Intelligence

2209.15106

Country: North America > United States (0.93)

Genre:

Research Report (0.64)
Overview (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A note on Linear Bottleneck networks and their Transition to Multilinearity

Zhu, Libin, Pandit, Parthe, Belkin, Mikhail

arXiv.org Machine LearningJun-30-2022

For a wide neural network (WNN), when the network width is sufficiently large, there exists a linear function of parameters, arbitrarily close to the network function, in a ball of radius O(1) in the parameter space around random initialization. This local linearity explains the equivalence to the neural tangent kernel (NTK) regression for optimizing wide neural networks with small learning rates, first shown in [13]. However, an important assumption for this transition to linearity [18] to hold is that each layer must be sufficiently wide. If there is even one narrow "bottleneck" hidden layer, resulting in a so-called bottleneck neural network (BNN), the work [18] showed that the transition to linearity does not occur. An immediate question at this point is, What functions of the weights does a neural network with a bottleneck layer represent?

artificial intelligence, init, machine learning, (17 more...)

arXiv.org Machine Learning

2206.15058

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

On the linearity of large non-linear models: when and why the tangent kernel is constant

Liu, Chaoyue, Zhu, Libin, Belkin, Mikhail

arXiv.org Machine LearningOct-2-2020

The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian scaling applicable to the standard classes of neural networks. Our analysis provides a new perspective on the phenomenon of constant tangent kernel, which is different from the widely accepted "lazy training". Furthermore, we show that the transition to linearity is not a general property of wide neural networks and does not hold when the last layer of the network is non-linear. It is also not necessary for successful optimization by gradient descent.

deep learning, neural network, tangent kernel, (18 more...)

arXiv.org Machine Learning

2010.01092

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback