Goto

Collaborating Authors

 Vetrov, Dmitry


Automating Control of Overestimation Bias for Continuous Reinforcement Learning

arXiv.org Machine Learning

Bias correction techniques are used by most of the high-performing methods for off-policy reinforcement learning. However, these techniques rely on a pre-defined bias correction policy that is either not flexible enough or requires environment-specific tuning of hyperparameters. In this work, we present a simple data-driven approach for guiding bias correction. We demonstrate its effectiveness on the Truncated Quantile Critics -- a state-of-the-art continuous control algorithm. The proposed technique can adjust the bias correction across environments automatically. As a result, it eliminates the need for an extensive hyperparameter search, significantly reducing the actual number of interactions and computation.


Quantization of Generative Adversarial Networks for Efficient Inference: a Methodological Study

arXiv.org Artificial Intelligence

Generative adversarial networks (GANs) have an enormous potential impact on digital content creation, e.g., photo-realistic digital avatars, semantic content editing, and quality enhancement of speech and images. However, the performance of modern GANs comes together with massive amounts of computations performed during the inference and high energy consumption. That complicates, or even makes impossible, their deployment on edge devices. The problem can be reduced with quantization -- a neural network compression technique that facilitates hardware-friendly inference by replacing floating-point computations with low-bit integer ones. While quantization is well established for discriminative models, the performance of modern quantization techniques in application to GANs remains unclear. GANs generate content of a more complex structure than discriminative models, and thus quantization of GANs is significantly more challenging. To tackle this problem, we perform an extensive experimental study of state-of-art quantization techniques on three diverse GAN architectures, namely StyleGAN, Self-Attention GAN, and CycleGAN. As a result, we discovered practical recipes that allowed us to successfully quantize these models for inference with 4/8-bit weights and 8-bit activations while preserving the quality of the original full-precision models.


On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

arXiv.org Machine Learning

Despite the conventional wisdom that using batch normalization with weight decay may improve neural network training, some recent works show their joint usage may cause instabilities at the late stages of training. Other works, in contrast, show convergence to the equilibrium, i.e., the stabilization of training metrics. In this paper, we study this contradiction and show that instead of converging to a stable equilibrium, the training dynamics converge to consistent periodic behavior. That is, the training process regularly exhibits instabilities which, however, do not lead to complete training failure, but cause a new period of training. We rigorously investigate the mechanism underlying this discovered periodic behavior both from an empirical and theoretical point of view and show that this periodic behavior is indeed caused by the interaction between batch normalization and weight decay.


Towards Practical Credit Assignment for Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Credit assignment is a fundamental problem in reinforcement learning, the problem of measuring an action's influence on future rewards. Improvements in credit assignment methods have the potential to boost the performance of RL algorithms on many tasks, but thus far have not seen widespread adoption. Recently, a family of methods called Hindsight Credit Assignment (HCA) was proposed, which explicitly assign credit to actions in hindsight based on the probability of the action having led to an observed outcome. This approach is appealing as a means to more efficient data usage, but remains a largely theoretical idea applicable to a limited set of tabular RL tasks, and it is unclear how to extend HCA to Deep RL environments. In this work, we explore the use of HCA-style credit in a deep RL context. We first describe the limitations of existing HCA algorithms in deep RL, then propose several theoretically-justified modifications to overcome them. Based on this exploration, we present a new algorithm, Credit-Constrained Advantage Actor-Critic (C2A2C), which ignores policy updates for actions which don't affect future outcomes based on credit in hindsight, while updating the policy as normal for those that do. We find that C2A2C outperforms Advantage Actor-Critic (A2C) on the Arcade Learning Environment (ALE) benchmark, showing broad improvements over A2C and motivating further work on credit-constrained update rules for deep RL methods.


On Power Laws in Deep Ensembles

arXiv.org Machine Learning

Ensembles of deep neural networks are known to achieve state-of-the-art performance in uncertainty estimation and lead to accuracy improvement. In this work, we focus on a classification problem and investigate the behavior of both non-calibrated and calibrated negative log-likelihood (CNLL) of a deep ensemble as a function of the ensemble size and the member network size. We indicate the conditions under which CNLL follows a power law w.r.t. ensemble size or member network size, and analyze the dynamics of the parameters of the discovered power laws. Our important practical finding is that one large network may perform worse than an ensemble of several medium-size networks with the same total number of parameters (we call this ensemble a memory split). Using the detected power law-like dependencies, we can predict (1) the possible gain from the ensembling of networks with given structure, (2) the optimal memory split given a memory budget, based on a relatively small number of trained networks. We describe the memory split advantage effect in more details in arXiv:2005.07292


Involutive MCMC: a Unifying Framework

arXiv.org Machine Learning

Name & Citation Appendix Metropolis-Hastings (Hastings, 1970) B.1 Markov Chain Monte Carlo (MCMC) is a computational Mixture Proposal (Habib & Barber, 2018) B.2 approach to fundamental problems such Multiple-Try Metropolis (Liu et al., 2000) B.3 as inference, integration, optimization, and simulation. Sample-Adaptive MCMC (Zhu, 2019) B.4 The field has developed a broad spectrum Reversible-Jump MCMC (Green, 1995) B.5 of algorithms, varying in the way they are motivated, Hybrid Monte Carlo (Duane et al., 1987) B.6 the way they are applied and how efficiently RMHMC (Girolami & Calderhead, 2011) B.7 they sample. Despite all the differences, many of NeuTra (Hoffman et al., 2019) B.8 them share the same core principle, which we A-NICE-MC (Song et al., 2017) B.9 unify as the Involutive MCMC (iMCMC) framework. L2HMC (Levy et al., 2017) B.10 Building upon this, we describe a wide Persistent HMC (Horowitz, 1991) B.11 range of MCMC algorithms in terms of iMCMC, Gibbs (Geman & Geman, 1984) B.12 and formulate a number of "tricks" which one Look Ahead (Sohl-Dickstein et al., 2014) B.13 can use as design principles for developing new NRJ (Gagnon & Doucet, 2019) B.14 MCMC algorithms. Thus, iMCMC provides a Lifted MH (Turitsyn et al., 2011) B.15 unified view of many known MCMC algorithms, which facilitates the derivation of powerful extensions. Table 1: List of algorithms that we describe by the Involutive We demonstrate the latter with two MCMC framework. See their descriptions and formulations examples where we transform known reversible in terms of iMCMC in corresponding appendices.


MARS: Masked Automatic Ranks Selection in Tensor Decompositions

arXiv.org Machine Learning

For instance, Tucker (Tucker, Tensor decomposition methods have recently 1966) and canonical polyadic (CP) (Caroll & Chang, 1970) proven to be efficient for compressing and accelerating tensor decompositions are widely known for compressing neural networks. However, the problem and accelerating convolutional networks (Lebedev of optimal decomposition structure determination et al., 2015; Kim et al., 2016; Kossaifi et al., 2019), and is still not well studied while being quite important. Tensor Train (TT) (Oseledets, 2011) decomposition has Specifically, decomposition ranks present been successfully applied for compressing fully-connected the crucial parameter controlling the compressionaccuracy (FC) (Novikov et al., 2015), convolutional (Garipov et al., tradeoff. In this paper, we introduce 2016), recurrent (Yang et al., 2017; Yu et al., 2017), embedding MARS -- a new efficient method for the automatic (Khrulkov et al., 2019) layers.


Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks

arXiv.org Machine Learning

Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been recently achieved using the empirical straight-through estimation approach. This approach has generated a variety of ad-hoc rules for propagating gradients through non-differentiable activations and updating discrete weights. We put such methods on a solid basis by obtaining them as viable approximations in the stochastic binary network (SBN) model with Bernoulli weights. In this model gradients are well-defined and the weight probabilities can be optimized by continuous techniques. By choosing the activation noises in SBN appropriately and choosing mirror descent (MD) for optimization, we obtain methods that closely resemble several existing straight-through variants, but unlike them, all work reliably and produce equally good results. We further show that variational inference for Bayesian learning of Binary weights can be implemented using MD updates with the same simplicity.


Deep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

arXiv.org Machine Learning

One of the generally accepted views of modern deep learning is that increasing the number of parameters usually leads to better quality. The two easiest ways to increase the number of parameters is to increase the size of the network, e.g. width, or to train a deep ensemble; both approaches improve the performance in practice. In this work, we consider a fixed memory budget setting, and investigate, what is more effective: to train a single wide network, or to perform a memory split -- to train an ensemble of several thinner networks, with the same total number of parameters? We find that, for large enough budgets, the number of networks in the ensemble, corresponding to the optimal memory split, is usually larger than one. Interestingly, this effect holds for the commonly used sizes of the standard architectures. For example, one WideResNet-28-10 achieves significantly worse test accuracy on CIFAR-100 than an ensemble of sixteen thinner WideResNets: 80.6% and 82.52% correspondingly. We call the described effect the Memory Split Advantage and show that it holds for a variety of datasets and model architectures.


Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics

arXiv.org Artificial Intelligence

The overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. Distributional representation and truncation allow for arbitrary granular overestimation control, while ensembling provides additional score improvements. TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment.