AITopics

2502.02431

Country: North America > United States (0.46)

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)

arXiv.org Machine LearningNov-8-2024

How Does Critical Batch Size Scale in Pre-training?

Zhang, Hanlin, Morwani, Depen, Vyas, Nikhil, Wu, Jingfeng, Zou, Difan, Ghai, Udaya, Foster, Dean, Kakade, Sham

Efficient optimization is critical in pre-training large models (LMs) at scale (McCandlish et al., 2018; Shoeybi et al., 2019; Kaplan et al., 2020). In particular, large-batch training is key to accelerating training, as it enables more efficient parallelism across hardware accelerators (You et al., 2017; Goyal et al., 2018). Specifically, understanding the scaling behavior of the critical batch size (CBS) is essential for optimizing data parallelism, as it defines the point beyond which increasing the batch size may result in computational efficiency degradation. Below the CBS, approximately linear scaling is achievable--doubling the batch size can proportionally reduce the number of optimization steps required to reach a target loss. However, beyond this threshold, further increases in batch size would lead to diminishing returns, making it essential to balance computational efficiency with model performance (McCandlish et al., 2018; Shallue et al., 2019). This trade-off presents a challenge for studying pre-training given resource constraints as practitioners are compelled to navigate difficult decisions in balancing compute, data, and training time. We investigate the scaling laws governing CBS in the context of autoregressive transformerbased language modeling (Vaswani, 2017; Radford et al., 2018). Analyzing CBS in pre-training is challenging due to the absence of a precise formalism relating it to model and data sizes in the literature (McCandlish et al., 2018; Kaplan et al., 2020).

large language model, machine learning, natural language, (18 more...)

2410.21676

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.93)

Industry: Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.88)
(2 more...)

arXiv.org Artificial IntelligenceJul-10-2024

Deconstructing What Makes a Good Optimizer for Language Models

Zhao, Rosie, Morwani, Depen, Brandfonbrener, David, Vyas, Nikhil, Kakade, Sham

Training language models becomes increasingly expensive with scale, prompting numerous attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer remains the most widely used, due to a prevailing view that it is the most effective approach. We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling across a range of model sizes, hyperparameters, and architecture variants. Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance and also in terms of how they fare across a wide range of hyperparameter choices. Our results suggest to practitioners that the choice of optimizer can be guided by practical considerations like memory constraints and ease of implementation, as no single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification. Given our findings, we further dissect these approaches, examining two simplified versions of Adam: a) signed momentum (Signum) which we see recovers both the performance and hyperparameter stability of Adam and b) Adalayer, a layerwise variant of Adam which we introduce to study Adam's preconditioning. Examining Adalayer leads us to the conclusion that the largest impact of Adam's preconditioning is restricted to the last layer and LayerNorm parameters, and, perhaps surprisingly, the remaining layers can be trained with SGD.

artificial intelligence, machine learning, natural language, (21 more...)

2407.07972

Country: North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningJun-25-2024

A New Perspective on Shampoo's Preconditioner

Morwani, Depen, Shapira, Itai, Vyas, Nikhil, Malach, Eran, Kakade, Sham, Janson, Lucas

Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connection between the $\textit{optimal}$ Kronecker product approximation of these matrices and the approximation made by Shampoo. Our connection highlights a subtle but common misconception about Shampoo's approximation. In particular, the $\textit{square}$ of the approximation used by the Shampoo optimizer is equivalent to a single step of the power iteration algorithm for computing the aforementioned optimal Kronecker product approximation. Across a variety of datasets and architectures we empirically demonstrate that this is close to the optimal Kronecker product approximation. Additionally, for the Hessian approximation viewpoint, we empirically study the impact of various practical tricks to make Shampoo more computationally efficient (such as using the batch gradient and the empirical Fisher) on the quality of Hessian approximation.

approximation, artificial intelligence, machine learning, (17 more...)

2406.17748

Country:

North America > United States (0.46)
North America > Canada (0.28)
Asia > Middle East > Israel (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-5-2023

Feature-Learning Networks Are Consistent Across Widths At Realistic Scales

Vyas, Nikhil, Atanasov, Alexander, Bordelon, Blake, Morwani, Depen, Sainathan, Sabarish, Pehlevan, Cengiz

We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. For simple tasks such as CIFAR-5m this holds throughout training for networks of realistic widths. We also show that structural properties of the models, including internal representations, preactivation distributions, edge of stability phenomena, and large learning rate effects are consistent across large widths. This motivates the hypothesis that phenomena seen in realistic models can be captured by infinite-width, feature-learning limits. For harder tasks (such as ImageNet and language modeling), and later training times, finite-width deviations grow systematically. Two distinct effects cause these deviations across widths. First, the network output has initialization-dependent variance scaling inversely with width, which can be removed by ensembling networks. We observe, however, that ensembles of narrower networks perform worse than a single wide network. We call this the bias of narrower width. We conclude with a spectral perspective on the origin of this finite-width bias.

large language model, machine learning, natural language, (18 more...)

2305.18411

Country: North America > United States (0.67)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceNov-13-2023

Feature emergence via margin maximization: case studies in algebraic tasks

Morwani, Depen, Edelman, Benjamin L., Oncescu, Costin-Andrei, Zhao, Rosie, Kakade, Sham

Understanding the internal representations learned by neural networks is a cornerstone challenge in the science of machine learning. While there have been significant recent strides in some cases towards understanding how neural networks implement specific target functions, this paper explores a complementary question -- why do networks arrive at particular computational strategies? Our inquiry focuses on the algebraic learning tasks of modular addition, sparse parities, and finite group operations. Our primary theoretical findings analytically characterize the features learned by stylized neural networks for these algebraic tasks. Notably, our main technique demonstrates how the principle of margin maximization alone can be used to fully specify the features learned by the network. Specifically, we prove that the trained networks utilize Fourier features to perform modular addition and employ features corresponding to irreducible group-theoretic representations to perform compositions in general groups, aligning closely with the empirical observations of Nanda et al. and Chughtai et al. More generally, we hope our techniques can help to foster a deeper understanding of why neural networks adopt specific computational strategies.

artificial intelligence, machine learning, margin maximization, (3 more...)

2311.07568

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceJun-14-2023

Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

Vyas, Nikhil, Morwani, Depen, Zhao, Rosie, Kaplun, Gal, Kakade, Sham, Barak, Boaz

The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by high learning rate or small batch size ("SGD noise"). While prior works that focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that large learning rate and small batch size do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating larger or more cost-effective gradient steps. Our work suggests that SGD in the online regime can be construed as taking noisy steps along the "golden path" of the noiseless gradient flow algorithm. We provide evidence to support this hypothesis by conducting experiments that reduce SGD noise during training and by measuring the pointwise functional distance between models trained with varying SGD noise levels, but at equivalent loss values. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.

artificial intelligence, machine learning, noise, (16 more...)

2306.0859

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

arXiv.org Artificial IntelligenceFeb-1-2023

Simplicity Bias in 1-Hidden Layer Neural Networks

Morwani, Depen, Batra, Jatin, Jain, Prateek, Netrapalli, Praneeth

Recent works have demonstrated that neural networks exhibit extreme simplicity bias(SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for one hidden layer neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable ($1$-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on real datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.

artificial intelligence, dataset, machine learning, (17 more...)

2302.00457

Country:

Asia > India (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Machine LearningDec-16-2020

Using noise resilience for ranking generalization of deep neural networks

Morwani, Depen, Vashisht, Rahul, Ramaswamy, Harish G.

Recent papers have shown that sufficiently overparameterized neural networks can perfectly fit even random labels. Thus, it is crucial to understand the underlying reason behind the generalization performance of a network on real-world data. In this work, we propose several measures to predict the generalization error of a network given the training data and its parameters. Using one of these measures, based on noise resilience of the network, we secured 5th position in the predicting generalization in deep learning (PGDL) competition at NeurIPS 2020.

deep learning, generalization, neural network, (17 more...)

2012.08854

Country: Europe > Sweden (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningOct-24-2020

Inductive Bias of Gradient Descent for Exponentially Weight Normalized Smooth Homogeneous Neural Nets

Morwani, Depen, Ramaswamy, Harish G.

We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. Our analysis focuses on exponential weight normalization (EWN), which encourages weight updates along the radial direction. This paper shows that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate, and hence causes the weights to be updated in a way that prefers asymptotic relative sparsity. These results can be extended to hold for gradient descent via an appropriate adaptive learning rate. The asymptotic convergence rate of the loss in this setting is given by $\Theta(\frac{1}{t(\log t)^2})$, and is independent of the depth of the network. We contrast these results with the inductive bias of standard weight normalization (SWN) and unnormalized architectures, and demonstrate their implications on synthetic data sets.Experimental results on simple data sets and architectures support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning prunable neural networks.

artificial intelligence, assumption, neural network, (16 more...)

2010.12909

Country: North America > United States > California (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)