Goto

Collaborating Authors

 Machine Learning


The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies

Neural Information Processing Systems

We study the relationship between the frequency of a function and the speed at which a neural network learns it. We build on recent results that show that the dynamics of overparameterized neural networks trained with gradient descent can be well approximated by a linear system. When normalized training data is uniformly distributed on a hypersphere, the eigenfunctions of this linear system are spherical harmonic functions. We derive the corresponding eigenvalues for each frequency after introducing a bias term in the model. This bias term had been omitted from the linear network model without significantly affecting previous theoretical results. However, we show theoretically and experimentally that a shallow neural network without bias cannot represent or learn simple, low frequency functions with odd frequencies. Our results lead to specific predictions of the time it will take a network to learn functions of varying frequency. These predictions match the empirical behavior of both shallow and deep networks.


Continuously-Tempered PDMP samplers

Neural Information Processing Systems

New sampling algorithms based on simulating continuous-time stochastic processes called piecewise deterministic Markov processes (PDMPs) have shown considerable promise. However, these methods can struggle to sample from multimodal or heavy-tailed distributions. We show how tempering ideas can improve the mixing of PDMPs in such cases. We introduce an extended distribution defined over the state of the posterior distribution and an inverse temperature, which interpolates between a tractable distribution when the inverse temperature is 0 and the posterior when the inverse temperature is 1. The marginal distribution of the inverse temperature is a mixture of a continuous distribution on [0, 1) and a point mass at 1: which means that we obtain samples when the inverse temperature is 1, and these are draws from the posterior, but sampling algorithms will also explore distributions at lower temperatures which will improve mixing. We show how PDMPs, and particularly the Zig-Zag sampler, can be implemented to sample from such an extended distribution. The resulting algorithm is easy to implement and we show empirically that it can outperform existing PDMP-based samplers on challenging multimodal posteriors.


Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Neural Information Processing Systems

A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.


Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations Colin Wei Stanford University

Neural Information Processing Systems

Contrastive learning is a highly effective method for learning representations from unlabeled data. Recent works show that contrastive representations can transfer across domains, leading to simple state-of-the-art algorithms for unsupervised domain adaptation. In particular, a linear classifier trained to separate the representations on the source domain can also predict classes on the target domain accurately, even though the representations of the two domains are far from each other. We refer to this phenomenon as linear transferability.


bca382c81484983f2d437f97d1e141f3-AuthorFeedback.pdf

Neural Information Processing Systems

We are also thankful for the reviewers' concrete suggestions on improving the draft, We agree with the reviewers that our proposed estimators are not computationally efficient. We work in the high-temperature regime i.e., max We agree with the reviewer that our estimator doesn't recover the true model even in AR3: Model Width as relaxed parameter. We thank the reviewer for raising this subtle issue.


A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions

Neural Information Processing Systems

We propose an adaptively weighted stochastic gradient Langevin dynamics algorithm (SGLD), so-called contour stochastic gradient Langevin dynamics (CSGLD), for Bayesian learning in big data statistics. The proposed algorithm is essentially a scalable dynamic importance sampler, which automatically flattens the target distribution such that the simulation for a multi-modal distribution can be greatly facilitated. Theoretically, we prove a stability condition and establish the asymptotic convergence of the self-adapting parameter to a unique fixed-point, regardless of the non-convexity of the original energy function; we also present an error analysis for the weighted averaging estimators. Empirically, the CSGLD algorithm is tested on multiple benchmark datasets including CIFAR10 and CIFAR100. The numerical results indicate its superiority over the existing state-of-the-art algorithms in training deep neural networks.


b5b8c484824d8a06f4f3d570bc420313-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all the reviewers for the valuable comments. Advantages of CSGLD over M-SGD: (i) CSGLD belongs to the class of adaptive biasing force algorithms and Empirically, we suggest to partition the sample space into a moderate number of subregions, e.g. Drawbacks of simulated annealing (SA) and replica exchange SGLD (reSGLD)/parallel tempering: SA can only be Q2. Missing baselines: We further compared CSGLD with CyclicalSGLD and reSGLD on an asymmetric mixture We will include the baselines and references in the next version. The gradient-vanishing problem in SGLD is not clear: Please refer to our reply to Q1 of Reviewer 1. Q1. Comments on bizarre peaks: A bizarre peak always indicates that there is a local minimum of the same energy in Q3.


Differentiable Simulation of Soft Multi-body Systems Yi-Ling Qiao University of Maryland, College Park University of Maryland, College Park Vladlen Koltun

Neural Information Processing Systems

We present a method for differentiable simulation of soft articulated bodies. Our work enables the integration of differentiable physical dynamics into gradient-based pipelines. We develop a top-down matrix assembly algorithm within Projective Dynamics and derive a generalized dry friction model for soft continuum using a new matrix splitting strategy. We derive a differentiable control framework for soft articulated bodies driven by muscles, joint torques, or pneumatic tubes. The experiments demonstrate that our designs make soft body simulation more stable and realistic compared to other frameworks. Our method accelerates the solution of system identification problems by more than an order of magnitude, and enables efficient gradient-based learning of motion control with soft robots.


Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques Zitai Wang 3 Zhiyong Yang

Neural Information Processing Systems

Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature.


Supplementary Material to Linear Disentangled Representations and Unsupervised Action Estimation

Neural Information Processing Systems

When we predict post-action latent codes through a linear combination of representations, we lose the guarantee that the gradient will point towards this solution. Since reinforce applies solely one representation exactly once, we are guaranteed that (if the policy is accurate and the latent structure is amenable) the gradient will point towards this solution. We find that the cyclic representation error ||ˆα α|| = 0.157 is far worse than the 0.012 error of RGrVAE. Furthermore, the independence score is 0.830