AITopics | norm test

Collaborating Authors

norm test

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism

Lau, Tim Tsz-Kit, Li, Weijian, Xu, Chenwei, Liu, Han, Kolar, Mladen

arXiv.org Machine LearningDec-30-2024

An appropriate choice of batch sizes in large-scale model training is crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch training improves training efficiency in terms of memory utilization, while generalization performance often deteriorates due to small amounts of gradient noise. Despite this dilemma, the common practice of choosing batch sizes in language model training often prioritizes training efficiency -- employing either constant large sizes with data parallelism or implementing batch size warmup schedules. However, such batch size schedule designs remain heuristic and often fail to adapt to training dynamics, presenting the challenge of designing adaptive batch size schedules. Given the abundance of available datasets and the data-hungry nature of language models, data parallelism has become an indispensable distributed training paradigm, enabling the use of larger batch sizes for gradient computation. However, vanilla data parallelism requires replicas of model parameters, gradients, and optimizer states at each worker, which prohibits training larger models with billions of parameters. To optimize memory usage, more advanced parallelism strategies must be employed. In this work, we propose general-purpose and theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. We develop a practical implementation with PyTorch Fully Sharded Data Parallel, facilitating the pretraining of language models of different sizes. We empirically demonstrate that our proposed approaches outperform constant batch sizes and heuristic batch size warmup schedules in the pretraining of models in the Llama family, with particular focus on smaller models with up to 3 billion parameters. We also establish theoretical convergence guarantees for such adaptive batch size schedules with Adam for general smooth nonconvex objectives.

arxiv preprint arxiv, batch size, zhang, (12 more...)

arXiv.org Machine Learning

2412.21124

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

Add feedback

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Lau, Tim Tsz-Kit, Liu, Han, Kolar, Mladen

arXiv.org Machine LearningFeb-17-2024

The choice of batch sizes in stochastic gradient optimizers is critical for model training. However, the practice of varying batch sizes throughout the training process is less explored compared to other hyperparameters. We investigate adaptive batch size strategies derived from adaptive sampling methods, traditionally applied only in stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which incrementally increase batch sizes during training, while model updates are performed using AdaGrad and AdaGradNorm. We prove that AdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ for finding a first-order stationary point of smooth nonconvex functions within $K$ iterations. AdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. Our theoretical claims are supported by numerical experiments on various image classification tasks, highlighting the enhanced adaptability of progressive batching protocols in deep learning and the potential of such adaptive batch size strategies with adaptive gradient optimizers in large-scale model training.

batch size, inner product test, norm test, (12 more...)

arXiv.org Machine Learning

2402.11215

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback

Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization

Bollapragada, Raghu, Wild, Stefan M.

arXiv.org Artificial IntelligenceSep-24-2021

Several methods have been proposed to solve such derivative-free stochastic optimization problems, and we refer the reader to [3, 38] for surveys of these methods. A popular class of these methods estimate the gradients using function values and employ standard gradient-based optimization methods using these estimators. Quasi-Newton methods are recognized as one of the most powerful methods for solving deterministic optimization problems. These methods build quadratic models of the objective information using only gradient information. Recently, researchers have been adapting these methods for stochastic settings when the gradient information is available. The empirical results in [15] indicate that a careful implementation of these methods can be efficient compared with the popular stochastic gradient methods. We adapt these methods to make them suitable for situations where the gradients are estimated using function values. We propose finite-difference derivative-free stochastic quasi-Newton methods for solving (1) by exploiting common random number (CRN) evaluations of f.

inequality, iteration, optimization, (14 more...)

arXiv.org Artificial Intelligence

2109.12213

Country:

North America > United States > Texas > Travis County > Austin (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > Illinois > Cook County > Lemont (0.04)

Genre: Research Report (1.00)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

On the equivalence of different adaptive batch size selection strategies for stochastic gradient descent methods

Espath, Luis, Krumscheid, Sebastian, Tempone, Raúl, Vilanova, Pedro

arXiv.org Machine LearningSep-22-2021

In this study, we demonstrate that the norm test and inner product/orthogonality test presented in \cite{Bol18} are equivalent in terms of the convergence rates associated with Stochastic Gradient Descent (SGD) methods if $\epsilon^2=\theta^2+\nu^2$ with specific choices of $\theta$ and $\nu$. Here, $\epsilon$ controls the relative statistical error of the norm of the gradient while $\theta$ and $\nu$ control the relative statistical error of the gradient in the direction of the gradient and in the direction orthogonal to the gradient, respectively. Furthermore, we demonstrate that the inner product/orthogonality test can be as inexpensive as the norm test in the best case scenario if $\theta$ and $\nu$ are optimally selected, but the inner product/orthogonality test will never be more computationally affordable than the norm test if $\epsilon^2=\theta^2+\nu^2$. Finally, we present two stochastic optimization problems to illustrate our results.

inner product orthogonality test, norm test, orthogonality test, (11 more...)

arXiv.org Machine Learning

2109.10933

Country:

North America > United States > New Jersey > Hudson County > Hoboken (0.04)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)
Asia > Middle East > Saudi Arabia (0.04)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Constrained and Composite Optimization via Adaptive Sampling Methods

Xie, Yuchen, Bollapragada, Raghu, Byrd, Richard, Nocedal, Jorge

arXiv.org Machine LearningDec-30-2020

The motivation for this paper stems from the desire to develop an adaptive sampling method for solving constrained optimization problems in which the objective function is stochastic and the constraints are deterministic. The method proposed in this paper is a proximal gradient method that can also be applied to the composite optimization problem min f(x) + h(x), where f is stochastic and h is convex (but not necessarily differentiable). Adaptive sampling methods employ a mechanism for gradually improving the quality of the gradient approximation so as to keep computational cost to a minimum. The mechanism commonly employed in unconstrained optimization is no longer reliable in the constrained or composite optimization settings because it is based on pointwise decisions that cannot correctly predict the quality of the proximal gradient step. The method proposed in this paper measures the result of a complete step to determine if the gradient approximation is accurate enough; otherwise a more accurate gradient is generated and a new step is computed. Convergence results are established both for strongly convex and general convex f. Numerical experiments are presented to illustrate the practical behavior of the method.

inner-product test, iteration, norm test, (14 more...)

arXiv.org Machine Learning

2012.15411

Country:

North America > United States > Colorado > Boulder County > Boulder (0.14)
North America > United States > Illinois > Cook County > Evanston (0.04)
North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Adaptive Sampling Strategies for Stochastic Optimization

Bollapragada, Raghu, Byrd, Richard, Nocedal, Jorge

arXiv.org Machine LearningOct-30-2017

In this paper, we propose a stochastic optimization method that adaptively controls the sample size used in the computation of gradient approximations. Unlike other variance reduction techniques that either require additional storage or the regular computation of full gradients, the proposed method reduces variance by increasing the sample size as needed. The decision to increase the sample size is governed by an inner product test that ensures that search directions are descent directions with high probability. We show that the inner product test improves upon the well known norm test, and can be used as a basis for an algorithm that is globally convergent on nonconvex functions and enjoys a global linear rate of convergence on strongly convex functions. Numerical experiments on logistic regression problems illustrate the performance of the algorithm.

artificial intelligence, iteration, machine learning, (15 more...)

arXiv.org Machine Learning

1710.11258

Country:

North America > United States > Colorado > Boulder County > Boulder (0.14)
North America > United States > Illinois > Cook County > Evanston (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback