AITopics | Safaryan, Mher

Collaborating Authors

Safaryan, Mher

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

Robert, Thomas, Safaryan, Mher, Modoranu, Ionut-Vlad, Alistarh, Dan

arXiv.org Machine LearningNov-7-2024

We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2410.16103

Country:

Europe > Romania (0.14)
Europe > Belgium (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence

Modoranu, Ionut-Vlad, Safaryan, Mher, Malinovsky, Grigory, Kurtic, Eldar, Robert, Thomas, Richtarik, Peter, Alistarh, Dan

arXiv.org Artificial IntelligenceMay-24-2024

We propose a new variant of the Adam optimizer [Kingma and Ba, 2014] called MICROADAM that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instance of the classical error feedback mechanism from distributed optimization [Seide et al., 2014, Alistarh et al., 2018, Karimireddy et al., 2019] in which the error correction information is itself compressed to allow for practical memory gains. We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance. Specifically, we show that MICROADAM can be implemented efficiently on GPUs: on both million-scale (BERT) and billion-scale (LLaMA) models, MicroAdam provides practical convergence competitive to that of the uncompressed Adam baseline, with lower memory usage and similar running time. Our code is available at https://github.com/IST-DASLab/MicroAdam.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.15593

Country: Europe (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Knowledge Distillation Performs Partial Variance Reduction

Safaryan, Mher, Peste, Alexandra, Alistarh, Dan

arXiv.org Artificial IntelligenceDec-8-2023

Knowledge Distillation (KD) [13, 3] is a standard tool for transferring information between a machine learning model of lower representational capacity-usually called the student-and a more accurate and powerful teacher model. In the context of classification using neural networks, it is common to consider the student to be a smaller network [2], whereas the teacher is a network that is larger and more computationally-heavy, but also more accurate. Assuming a supervised classification task, distillation consists in training the student to minimize the cross-entropy with respect to the teacher's logits on every given sample, in addition to minimizing the standard cross-entropy loss with respect to the ground truth labels. Since its introduction [3], distillation has been developed and applied in a wide variety of settings, from obtaining compact high-accuracy encodings of model ensembles [13], to boosting the accuracy of compressed models [49, 38, 31], to reinforcement learning [50, 42, 35, 5, 7, 45] and learning with privileged information [51]. Given its apparent simplicity, there has been significant interest in finding explanations for the effectiveness of distillation [2, 13, 37]. For instance, one hypothesis [2, 13] is that the smoothed labels resulting from distillation present the student with a decision surface that is easier to learn than the one presented by the categorical (one-hot) outputs. Another hypothesis [2, 13, 51] starts from the observation that the teacher's outputs have higher entropy than the ground truth labels, and therefore, higher information content. Despite this work, we still have a limited analytical understanding regarding why knowledge distillation is so effective [37].

artificial intelligence, distillation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2305.17581

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms

Islamov, Rustem, Safaryan, Mher, Alistarh, Dan

arXiv.org Machine LearningOct-31-2023

We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting, where each worker has its own computation and communication speeds, as well as data distribution. In these algorithms, workers compute possibly stale and stochastic gradients associated with their local data at some iteration back in history and then return those gradients to the server without synchronizing with other workers. We present a unified convergence theory for non-convex smooth functions in the heterogeneous regime. The proposed analysis provides convergence for pure asynchronous SGD and its various modifications. Moreover, our theory explains what affects the convergence rate and what can be done to improve the performance of asynchronous algorithms. In particular, we introduce a novel asynchronous method based on worker shuffling. As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD. The derived rates match the best-known results for those algorithms, highlighting the tightness of our approach. Finally, our numerical evaluations support theoretical findings and show the good practical performance of our method.

artificial intelligence, machine learning, sgd, (15 more...)

arXiv.org Machine Learning

2310.20452

Country: Europe (0.28)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity

Maranjyan, Artavazd, Safaryan, Mher, Richtárik, Peter

arXiv.org Artificial IntelligenceMay-30-2023

We study a class of distributed optimization algorithms that aim to alleviate high communication costs by allowing the clients to perform multiple local gradient-type training steps prior to communication. While methods of this type have been studied for about a decade, the empirically observed acceleration properties of local training eluded all attempts at theoretical understanding. In a recent breakthrough, Mishchenko et al. (ICML 2022) proved that local training, when properly executed, leads to provable communication acceleration, and this holds in the strongly convex regime without relying on any data similarity assumptions. However, their method ProxSkip requires all clients to take the same number of local training steps in each communication round. Inspired by a common sense intuition, we start our investigation by conjecturing that clients with ``less important'' data should be able to get away with fewer local training steps without this impacting the overall communication complexity of the method. It turns out that this intuition is correct: we managed to redesign the original ProxSkip method to achieve this. In particular, we prove that our modified method, for which coin the name GradSkip, converges linearly under the same assumptions, has the same accelerated communication complexity, while the number of local gradient steps can be reduced relative to a local condition number. We further generalize our method by extending the randomness of probabilistic alternations to arbitrary unbiased compression operators and considering a generic proximable regularizer. This generalization, which we call GradSkip+, recovers several related methods in the literature as special cases. Finally, we present an empirical study on carefully designed toy problems that confirm our theoretical claims.

artificial intelligence, gradskip, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2210.16402

Country:

Asia > Middle East (0.14)
Asia > Armenia (0.14)

Genre: Research Report (0.64)

Industry: Education (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

On Biased Compression for Distributed Learning

Beznosikov, Aleksandr, Horváth, Samuel, Richtárik, Peter, Safaryan, Mher

arXiv.org Machine LearningFeb-27-2020

In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact {\em biased} compressors often show superior performance in practice when compared to the much more studied and understood {\em unbiased} compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. Our {\em distributed} SGD method enjoys the ergodic rate $\mathcal{O}\left(\frac{\delta L \exp(-K) }{\mu} + \frac{(C + D)}{K\mu}\right)$, where $\delta$ is a compression parameter which grows when more compression is applied, $L$ and $\mu$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose a new highly performing biased compressor---combination of Top-$k$ and natural dithering---which in our experiments outperforms all other compression techniques.

artificial intelligence, machine learning, operator, (15 more...)

arXiv.org Machine Learning

2002.1241

Country:

Europe (0.46)
Asia (0.28)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback