Goto

Collaborating Authors

 Chirkova, Nadezhda


On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

arXiv.org Machine Learning

Despite the conventional wisdom that using batch normalization with weight decay may improve neural network training, some recent works show their joint usage may cause instabilities at the late stages of training. Other works, in contrast, show convergence to the equilibrium, i.e., the stabilization of training metrics. In this paper, we study this contradiction and show that instead of converging to a stable equilibrium, the training dynamics converge to consistent periodic behavior. That is, the training process regularly exhibits instabilities which, however, do not lead to complete training failure, but cause a new period of training. We rigorously investigate the mechanism underlying this discovered periodic behavior both from an empirical and theoretical point of view and show that this periodic behavior is indeed caused by the interaction between batch normalization and weight decay.


On Power Laws in Deep Ensembles

arXiv.org Machine Learning

Ensembles of deep neural networks are known to achieve state-of-the-art performance in uncertainty estimation and lead to accuracy improvement. In this work, we focus on a classification problem and investigate the behavior of both non-calibrated and calibrated negative log-likelihood (CNLL) of a deep ensemble as a function of the ensemble size and the member network size. We indicate the conditions under which CNLL follows a power law w.r.t. ensemble size or member network size, and analyze the dynamics of the parameters of the discovered power laws. Our important practical finding is that one large network may perform worse than an ensemble of several medium-size networks with the same total number of parameters (we call this ensemble a memory split). Using the detected power law-like dependencies, we can predict (1) the possible gain from the ensembling of networks with given structure, (2) the optimal memory split given a memory budget, based on a relatively small number of trained networks. We describe the memory split advantage effect in more details in arXiv:2005.07292


Deep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

arXiv.org Machine Learning

One of the generally accepted views of modern deep learning is that increasing the number of parameters usually leads to better quality. The two easiest ways to increase the number of parameters is to increase the size of the network, e.g. width, or to train a deep ensemble; both approaches improve the performance in practice. In this work, we consider a fixed memory budget setting, and investigate, what is more effective: to train a single wide network, or to perform a memory split -- to train an ensemble of several thinner networks, with the same total number of parameters? We find that, for large enough budgets, the number of networks in the ensemble, corresponding to the optimal memory split, is usually larger than one. Interestingly, this effect holds for the commonly used sizes of the standard architectures. For example, one WideResNet-28-10 achieves significantly worse test accuracy on CIFAR-100 than an ensemble of sixteen thinner WideResNets: 80.6% and 82.52% correspondingly. We call the described effect the Memory Split Advantage and show that it holds for a variety of datasets and model architectures.


Bayesian Sparsification of Gated Recurrent Neural Networks

arXiv.org Machine Learning

Bayesian methods have been successfully applied to sparsify weights of neural networks and to remove structure units from the networks, e. g. neurons. We apply and further develop this approach for gated recurrent architectures. Specifically, in addition to sparsification of individual weights and neurons, we propose to sparsify preactivations of gates and information flow in LSTM. It makes some gates and information flow components constant, speeds up forward pass and improves compression. Moreover, the resulting structure of gate sparsity is interpretable and depends on the task. Code is available on github: https://github.com/tipt0p/SparseBayesianRNN


Bayesian Compression for Natural Language Processing

arXiv.org Machine Learning

In natural language processing, a lot of the tasks are successfully solved with recurrent neural networks, but such models have a huge number of parameters. The majority of these parameters are often concentrated in the embedding layer, which size grows proportionally to the vocabulary length. We propose a Bayesian sparsification technique for RNNs which allows compressing the RNN dozens or hundreds of times without time-consuming hyperparameters tuning. We also generalize the model for vocabulary sparsification to filter out unnecessary words and compress the RNN even further. We show that the choice of the kept words is interpretable. 1 Introduction Recurrent neural networks (RNNs) are among the most powerful models for natural language processing, speech recognition, question-answering systems (Chan et al., 2016; Ha et al., 2017; Wu et al., 2016; Ren et al., 2015).


Bayesian Sparsification of Recurrent Neural Networks

arXiv.org Machine Learning

Recurrent neural networks show state-of-the-art results in many text analysis tasks but often require a lot of memory to store their weights. Recently proposed Sparse Variational Dropout eliminates the majority of the weights in a feed-forward neural network without significant loss of quality. We apply this technique to sparsify recurrent neural networks. To account for recurrent specifics we also rely on Binary Variational Dropout for RNN. We report 99.5% sparsity level on sentiment analysis task without a quality drop and up to 87% sparsity level on language modeling task with slight loss of accuracy.