figure app
c467978aaae44a0e8054e174bc0da4bb-Supplemental.pdf
In Appendix A.1, we describe the attributes and the comparisons in the VQA-MNIST datasets. A.1 Attributes and Comparisons Visual attributes. This corresponds to a total of 21 attribute instances. The sub-tasks for comparison of spatial relations are in Table App.2. Sub-tasks for attribute comparison between pairs of separated objects.
c467978aaae44a0e8054e174bc0da4bb-Supplemental.pdf
In Appendix A.1, we describe the attributes and the comparisons in the VQA-MNIST datasets. A.1 Attributes and Comparisons Visual attributes. This corresponds to a total of 21 attribute instances. The sub-tasks for comparison of spatial relations are in Table App.2. Sub-tasks for attribute comparison between pairs of separated objects.
Counterfactually Comparing Abstaining Classifiers
Choe, Yo Joong, Gangrade, Aditya, Ramdas, Aaditya
Abstaining classifiers have the option to abstain from making predictions on inputs that they are unsure about. These classifiers are becoming increasingly popular in high-stakes decision-making problems, as they can withhold uncertain predictions to improve their reliability and safety. When evaluating black-box abstaining classifier(s), however, we lack a principled approach that accounts for what the classifier would have predicted on its abstentions. These missing predictions matter when they can eventually be utilized, either directly or as a backup option in a failure mode. In this paper, we introduce a novel approach and perspective to the problem of evaluating and comparing abstaining classifiers by treating abstentions as missing data. Our evaluation approach is centered around defining the counterfactual score of an abstaining classifier, defined as the expected performance of the classifier had it not been allowed to abstain. We specify the conditions under which the counterfactual score is identifiable: if the abstentions are stochastic, and if the evaluation data is independent of the training data (ensuring that the predictions are missing at random), then the score is identifiable. Note that, if abstentions are deterministic, then the score is unidentifiable because the classifier can perform arbitrarily poorly on its abstentions. Leveraging tools from observational causal inference, we then develop nonparametric and doubly robust methods to efficiently estimate this quantity under identification. Our approach is examined in both simulated and real data experiments.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (2 more...)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
- Transportation > Ground > Road (1.00)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (0.67)
A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases
Harrison, James, Metz, Luke, Sohl-Dickstein, Jascha
Learned optimizers -- neural networks that are trained to act as optimizers -- have the potential to dramatically accelerate training of machine learning models. However, even when meta-trained across thousands of tasks at huge computational expense, blackbox learned optimizers often struggle with stability and generalization when applied to tasks unlike those in their meta-training set. In this paper, we use tools from dynamical systems to investigate the inductive biases and stability properties of optimization algorithms, and apply the resulting insights to designing inductive biases for blackbox optimizers. Our investigation begins with a noisy quadratic model, where we characterize conditions in which optimization is stable, in terms of eigenvalues of the training dynamics. We then introduce simple modifications to a learned optimizer's architecture and meta-training procedure which lead to improved stability, and improve the optimizer's inductive bias. We apply the resulting learned optimizer to a variety of neural network training tasks, where it outperforms the current state of the art learned optimizer -- at matched optimizer computational overhead -- with regard to optimization performance and meta-training speed, and is capable of generalization to tasks far different from those it was meta-trained on.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Russia (0.04)
- Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
- Asia > Russia (0.04)
Sensitivity and Generalization in Neural Networks: an Empirical Study
Novak, Roman, Bahri, Yasaman, Abolafia, Daniel A., Pennington, Jeffrey, Sohl-Dickstein, Jascha
In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization $-$ such as full-batch training or using random labels $-$ correspond to lower robustness, while factors associated with good generalization $-$ such as data augmentation and ReLU non-linearities $-$ give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.
- North America > Canada > Ontario > Toronto (0.14)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability
Raghu, Maithra, Gilmer, Justin, Yosinski, Jason, Sohl-Dickstein, Jascha
We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less.
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models
Tucker, George, Mnih, Andriy, Maddison, Chris J., Lawson, Dieterich, Sohl-Dickstein, Jascha
Learning in models with discrete latent variables is challenging due to high variance gradient estimators. Generally, approaches have relied on control variates to reduce the variance of the REINFORCE estimator. Recent work (Jang et al. 2016, Maddison et al. 2016) has taken a different approach, introducing a continuous relaxation of discrete variables to produce low-variance, but biased, gradient estimates. In this work, we combine the two approaches through a novel control variate that produces low-variance, \emph{unbiased} gradient estimates. Then, we introduce a modification to the continuous relaxation and show that the tightness of the relaxation can be adapted online, removing it as a hyperparameter. We show state-of-the-art variance reduction on several benchmark generative modeling tasks, generally leading to faster convergence to a better final log-likelihood.
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Middle East > Jordan (0.04)
Unrolled Generative Adversarial Networks
Metz, Luke, Poole, Ben, Pfau, David, Sohl-Dickstein, Jascha
We introduce a method to stabilize Generative Adversarial Networks (GANs) by defining the generator objective with respect to an unrolled optimization of the discriminator. This allows training to be adjusted between using the optimal discriminator in the generator's objective, which is ideal but infeasible in practice, and using the current value of the discriminator, which is often unstable and leads to poor solutions. We show how this technique solves the common problem of mode collapse, stabilizes training of GANs with complex recurrent generators, and increases diversity and coverage of the data distribution by the generator.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)