AITopics | Lobacheva, Ekaterina

Collaborating Authors

Lobacheva, Ekaterina

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Where Do Large Learning Rates Lead Us?

Sadrtdinov, Ildus, Kodryan, Maxim, Pokonechny, Eduard, Lobacheva, Ekaterina, Vetrov, Dmitry

arXiv.org Machine LearningOct-29-2024

It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization. Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging. By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima. Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

2410.22113

Country: Asia (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Large Learning Rates Improve Generalization: But How Large Are We Talking About?

Lobacheva, Ekaterina, Pockonechnyy, Eduard, Kodryan, Maxim, Vetrov, Dmitry

arXiv.org Machine LearningNov-19-2023

Inspired by recent research that recommends starting neural networks training with large learning rates (LRs) to achieve the best generalization, we explore this hypothesis in detail. Our study clarifies the initial LR ranges that provide optimal results for subsequent training with a small LR or weight averaging. We find that these ranges are in fact significantly narrower than generally assumed. We conduct our main experiments in a simplified setup that allows precise control of the learning rate hyperparameter and validate our key findings in a more practical setting.

artificial intelligence, machine learning, regime, (17 more...)

arXiv.org Machine Learning

2311.11303

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning

Sadrtdinov, Ildus, Pozdeev, Dmitrii, Vetrov, Dmitry, Lobacheva, Ekaterina

arXiv.org Machine LearningNov-3-2023

Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape, which we call the pre-train basin, and thus have limited diversity. In this work, we show that ensembles trained from a single pre-trained checkpoint may be improved by better exploring the pre-train basin, however, leaving the basin results in losing the benefits of transfer learning and in degradation of the ensemble quality. Based on the analysis of existing exploration methods, we propose a more effective modification of the Snapshot Ensembles (SSE) for transfer learning setup, StarSSE, which results in stronger ensembles and uniform model soups.

accuracy, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2303.03374

Country:

Asia (0.14)
Europe (0.14)

Genre: Research Report (1.00)

Industry: Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

Kodryan, Maxim, Lobacheva, Ekaterina, Nakhodnov, Maksim, Vetrov, Dmitry

arXiv.org Artificial IntelligenceJan-15-2023

A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.

artificial intelligence, deep learning, training scale-invariant neural network, (2 more...)

arXiv.org Artificial Intelligence

2209.03695

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

On the Memorization Properties of Contrastive Learning

Sadrtdinov, Ildus, Chirkova, Nadezhda, Lobacheva, Ekaterina

arXiv.org Machine LearningJul-21-2021

However, data labeling is often time-consuming and costly, as it involves human expertise. Thus, it is common for computer vision to pretrain DNNs vate improvements to DNN training approaches. A pioneer on some large labeled dataset, e. g. ImageNet (Russakovsky work of Zhang et al. (2017) showed that the capacity of et al., 2015), and then to fine-tune the model to a specific modern DNNs is sufficient to fit perfectly even randomly downstream task. The self-supervised learning paradigm labeled data. According to classic learning theory, such a provides a human labeling-free alternative to the supervised huge capacity should lead to catastrophic overfitting, however, pretraining: recently developed contrastive self-supervised recent works (Nakkiran et al., 2020) show that in methods show results, comparable to ImageNet pretraining practice increasing DNN capacity further improves generalization.

augmentation, inductive learning, neural network, (18 more...)

arXiv.org Machine Learning

2107.10143

Country: North America > Canada (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Lobacheva, Ekaterina, Kodryan, Maxim, Chirkova, Nadezhda, Malinin, Andrey, Vetrov, Dmitry

arXiv.org Machine LearningJun-29-2021

Despite the conventional wisdom that using batch normalization with weight decay may improve neural network training, some recent works show their joint usage may cause instabilities at the late stages of training. Other works, in contrast, show convergence to the equilibrium, i.e., the stabilization of training metrics. In this paper, we study this contradiction and show that instead of converging to a stable equilibrium, the training dynamics converge to consistent periodic behavior. That is, the training process regularly exhibits instabilities which, however, do not lead to complete training failure, but cause a new period of training. We rigorously investigate the mechanism underlying this discovered periodic behavior both from an empirical and theoretical point of view and show that this periodic behavior is indeed caused by the interaction between batch normalization and weight decay.

artificial intelligence, machine learning, neural network, (17 more...)

arXiv.org Machine Learning

2106.15739

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

On Power Laws in Deep Ensembles

Lobacheva, Ekaterina, Chirkova, Nadezhda, Kodryan, Maxim, Vetrov, Dmitry

arXiv.org Machine LearningJul-16-2020

Ensembles of deep neural networks are known to achieve state-of-the-art performance in uncertainty estimation and lead to accuracy improvement. In this work, we focus on a classification problem and investigate the behavior of both non-calibrated and calibrated negative log-likelihood (CNLL) of a deep ensemble as a function of the ensemble size and the member network size. We indicate the conditions under which CNLL follows a power law w.r.t. ensemble size or member network size, and analyze the dynamics of the parameters of the discovered power laws. Our important practical finding is that one large network may perform worse than an ensemble of several medium-size networks with the same total number of parameters (we call this ensemble a memory split). Using the detected power law-like dependencies, we can predict (1) the possible gain from the ensembling of networks with given structure, (2) the optimal memory split given a memory budget, based on a relatively small number of trained networks. We describe the memory split advantage effect in more details in arXiv:2005.07292

deep learning, neural network, power law, (17 more...)

arXiv.org Machine Learning

2007.08483

Country:

Europe > Russia (0.14)
Africa > Ethiopia (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Deep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

Chirkova, Nadezhda, Lobacheva, Ekaterina, Vetrov, Dmitry

arXiv.org Machine LearningMay-14-2020

One of the generally accepted views of modern deep learning is that increasing the number of parameters usually leads to better quality. The two easiest ways to increase the number of parameters is to increase the size of the network, e.g. width, or to train a deep ensemble; both approaches improve the performance in practice. In this work, we consider a fixed memory budget setting, and investigate, what is more effective: to train a single wide network, or to perform a memory split -- to train an ensemble of several thinner networks, with the same total number of parameters? We find that, for large enough budgets, the number of networks in the ensemble, corresponding to the optimal memory split, is usually larger than one. Interestingly, this effect holds for the commonly used sizes of the standard architectures. For example, one WideResNet-28-10 achieves significantly worse test accuracy on CIFAR-100 than an ensemble of sixteen thinner WideResNets: 80.6% and 82.52% correspondingly. We call the described effect the Memory Split Advantage and show that it holds for a variety of datasets and model architectures.

budget, deep learning, neural network, (17 more...)

arXiv.org Machine Learning

2005.07292

Country: Europe > Russia (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Bayesian Sparsification of Gated Recurrent Neural Networks

Lobacheva, Ekaterina, Chirkova, Nadezhda, Vetrov, Dmitry

arXiv.org Machine LearningDec-12-2018

Bayesian methods have been successfully applied to sparsify weights of neural networks and to remove structure units from the networks, e. g. neurons. We apply and further develop this approach for gated recurrent architectures. Specifically, in addition to sparsification of individual weights and neurons, we propose to sparsify preactivations of gates and information flow in LSTM. It makes some gates and information flow components constant, speeds up forward pass and improves compression. Moreover, the resulting structure of gate sparsity is interpretable and depends on the task. Code is available on github: https://github.com/tipt0p/SparseBayesianRNN

deep learning, neural network, sparsification, (20 more...)

arXiv.org Machine Learning

1812.05692

Country:

North America > Canada (0.14)
Europe > Russia (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bayesian Compression for Natural Language Processing

Chirkova, Nadezhda, Lobacheva, Ekaterina, Vetrov, Dmitry

arXiv.org Machine LearningOct-25-2018

In natural language processing, a lot of the tasks are successfully solved with recurrent neural networks, but such models have a huge number of parameters. The majority of these parameters are often concentrated in the embedding layer, which size grows proportionally to the vocabulary length. We propose a Bayesian sparsification technique for RNNs which allows compressing the RNN dozens or hundreds of times without time-consuming hyperparameters tuning. We also generalize the model for vocabulary sparsification to filter out unnecessary words and compress the RNN even further. We show that the choice of the kept words is interpretable. 1 Introduction Recurrent neural networks (RNNs) are among the most powerful models for natural language processing, speech recognition, question-answering systems (Chan et al., 2016; Ha et al., 2017; Wu et al., 2016; Ren et al., 2015).

deep learning, neural network, sparsification, (19 more...)

arXiv.org Machine Learning

1810.10927

Country:

Europe (0.46)
North America > Canada > Quebec (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback