Goto

Collaborating Authors

 figure app


c467978aaae44a0e8054e174bc0da4bb-Supplemental.pdf

Neural Information Processing Systems

In Appendix A.1, we describe the attributes and the comparisons in the VQA-MNIST datasets. A.1 Attributes and Comparisons Visual attributes. This corresponds to a total of 21 attribute instances. The sub-tasks for comparison of spatial relations are in Table App.2. Sub-tasks for attribute comparison between pairs of separated objects.




c467978aaae44a0e8054e174bc0da4bb-Supplemental.pdf

Neural Information Processing Systems

In Appendix A.1, we describe the attributes and the comparisons in the VQA-MNIST datasets. A.1 Attributes and Comparisons Visual attributes. This corresponds to a total of 21 attribute instances. The sub-tasks for comparison of spatial relations are in Table App.2. Sub-tasks for attribute comparison between pairs of separated objects.


Counterfactually Comparing Abstaining Classifiers

Choe, Yo Joong, Gangrade, Aditya, Ramdas, Aaditya

arXiv.org Machine Learning

Abstaining classifiers have the option to abstain from making predictions on inputs that they are unsure about. These classifiers are becoming increasingly popular in high-stakes decision-making problems, as they can withhold uncertain predictions to improve their reliability and safety. When evaluating black-box abstaining classifier(s), however, we lack a principled approach that accounts for what the classifier would have predicted on its abstentions. These missing predictions matter when they can eventually be utilized, either directly or as a backup option in a failure mode. In this paper, we introduce a novel approach and perspective to the problem of evaluating and comparing abstaining classifiers by treating abstentions as missing data. Our evaluation approach is centered around defining the counterfactual score of an abstaining classifier, defined as the expected performance of the classifier had it not been allowed to abstain. We specify the conditions under which the counterfactual score is identifiable: if the abstentions are stochastic, and if the evaluation data is independent of the training data (ensuring that the predictions are missing at random), then the score is identifiable. Note that, if abstentions are deterministic, then the score is unidentifiable because the classifier can perform arbitrarily poorly on its abstentions. Leveraging tools from observational causal inference, we then develop nonparametric and doubly robust methods to efficiently estimate this quantity under identification. Our approach is examined in both simulated and real data experiments.


A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases

Harrison, James, Metz, Luke, Sohl-Dickstein, Jascha

arXiv.org Artificial Intelligence

Learned optimizers -- neural networks that are trained to act as optimizers -- have the potential to dramatically accelerate training of machine learning models. However, even when meta-trained across thousands of tasks at huge computational expense, blackbox learned optimizers often struggle with stability and generalization when applied to tasks unlike those in their meta-training set. In this paper, we use tools from dynamical systems to investigate the inductive biases and stability properties of optimization algorithms, and apply the resulting insights to designing inductive biases for blackbox optimizers. Our investigation begins with a noisy quadratic model, where we characterize conditions in which optimization is stable, in terms of eigenvalues of the training dynamics. We then introduce simple modifications to a learned optimizer's architecture and meta-training procedure which lead to improved stability, and improve the optimizer's inductive bias. We apply the resulting learned optimizer to a variety of neural network training tasks, where it outperforms the current state of the art learned optimizer -- at matched optimizer computational overhead -- with regard to optimization performance and meta-training speed, and is capable of generalization to tasks far different from those it was meta-trained on.


Sensitivity and Generalization in Neural Networks: an Empirical Study

Novak, Roman, Bahri, Yasaman, Abolafia, Daniel A., Pennington, Jeffrey, Sohl-Dickstein, Jascha

arXiv.org Machine Learning

In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization $-$ such as full-batch training or using random labels $-$ correspond to lower robustness, while factors associated with good generalization $-$ such as data augmentation and ReLU non-linearities $-$ give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.


SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

Raghu, Maithra, Gilmer, Justin, Yosinski, Jason, Sohl-Dickstein, Jascha

arXiv.org Machine Learning

We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less.


REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

Tucker, George, Mnih, Andriy, Maddison, Chris J., Lawson, Dieterich, Sohl-Dickstein, Jascha

arXiv.org Machine Learning

Learning in models with discrete latent variables is challenging due to high variance gradient estimators. Generally, approaches have relied on control variates to reduce the variance of the REINFORCE estimator. Recent work (Jang et al. 2016, Maddison et al. 2016) has taken a different approach, introducing a continuous relaxation of discrete variables to produce low-variance, but biased, gradient estimates. In this work, we combine the two approaches through a novel control variate that produces low-variance, \emph{unbiased} gradient estimates. Then, we introduce a modification to the continuous relaxation and show that the tightness of the relaxation can be adapted online, removing it as a hyperparameter. We show state-of-the-art variance reduction on several benchmark generative modeling tasks, generally leading to faster convergence to a better final log-likelihood.


Unrolled Generative Adversarial Networks

Metz, Luke, Poole, Ben, Pfau, David, Sohl-Dickstein, Jascha

arXiv.org Machine Learning

We introduce a method to stabilize Generative Adversarial Networks (GANs) by defining the generator objective with respect to an unrolled optimization of the discriminator. This allows training to be adjusted between using the optimal discriminator in the generator's objective, which is ideal but infeasible in practice, and using the current value of the discriminator, which is often unstable and leads to poor solutions. We show how this technique solves the common problem of mode collapse, stabilizes training of GANs with complex recurrent generators, and increases diversity and coverage of the data distribution by the generator.