Goto

Collaborating Authors

 regime


the NLLs in the final version of the paper in addition to reporting averages and standard deviations in all of our other

Neural Information Processing Systems

We agree with all three reviewers that evaluating the predictive variances is important. Finally, we will clarify that SGPR is by (Titsias, 2009) and SVGP is by (Hensman et al., 2013). This has important ramifications, e.g., In contrast, using CG requires exactly 2w exchanges to do a linear solve. We were unaware of Nguyen's paper at submission and we will add this discussion to the paper. We note that the precomputation, like CG, can be run to a specified desired tolerance.



92d1e1eb1cd6f9fba3227870bb6d7f07-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their fruitful comments! Response to Reviewer 2: We predict characters for Librispeech/Libri-light. Thank you for the pointer! "when the official LibriSpeech LM... is incorporated into decoding, it is not clear whether the experiments still represent We will also try to make it more self-contained given the space restrictions. "I'm not convinced that this training works well conceptually." "... for ASR, we have a lot of transcribed data, and we can make a strong ASR model and perform transfer learning." "... how to extract K detractors." - The distractors are quantized latent speech representations sampled from masked If another masked time-step uses the same quantized latent, then it won't be sampled. "The paper would have been significantly different in terms of quality had you applied you approach to some standard This follows other recent work on semi-supervised methods for speech such as "Improved Noisy Student Training Synnaeve et al., 2020" which achieve some of the strongest results.


Towards Optimal Communication Complexity in Distributed Non-Convex Optimization Lingxiao Wang

Neural Information Processing Systems

We study the problem of distributed stochastic non-convex optimization with intermittent communication. We consider the full participation setting where M machines work in parallel over R communication rounds and the partial participation setting where M machines are sampled independently every round from some meta-distribution over machines. We propose and analyze a new algorithm that improves existing methods by requiring fewer and lighter variance reduction operations. We also present lower bounds, showing our algorithm is either optimal or almost optimal in most settings. Numerical experiments demonstrate the superior performance of our algorithm.


6b7375226d4742ff910618a56ae72b7d-Paper-Conference.pdf

Neural Information Processing Systems

It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization. Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging. By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima. Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.


Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure

Neural Information Processing Systems

In this work, we study the generalizability of diffusion models by looking into the hidden properties of the learned score functions, which are essentially a series of deep denoisers trained on various noise levels. We observe that as diffusion models transition from memorization to generalization, their corresponding nonlinear diffusion denoisers exhibit increasing linearity. This discovery leads us to investigate the linear counterparts of the nonlinear diffusion models, which are a series of linear models trained to match the function mappings of the nonlinear diffusion denoisers. Interestingly, these linear denoisers are approximately the optimal denoisers for a multivariate Gaussian distribution characterized by the empirical mean and covariance of the training dataset. This finding implies that diffusion models have the inductive bias towards capturing and utilizing the Gaussian structure (covariance information) of the training dataset for data generation. We empirically demonstrate that this inductive bias is a unique property of diffusion models in the generalization regime, which becomes increasingly evident when the model's capacity is relatively small compared to the training dataset size. In the case where the model is highly overparameterized, this inductive bias emerges during the initial training phases before the model fully memorizes its training data. Our study provides crucial insights into understanding the notable strong generalization phenomenon recently observed in real-world diffusion models.


High-Dimensional Sparse Linear Bandits

Neural Information Processing Systems

Stochastic linear bandits with high-dimensional sparse features are a practical model for a variety of domains, including personalized medicine and online advertising [Bastani and Bayati, 2020].


Stochastic Optimization with Laggard Data Pipelines

Neural Information Processing Systems

State-of-the-art optimization is steadily shifting towards massively parallel pipelines with extremely large batch sizes. As a consequence, CPU-bound preprocessing and disk/memory/network operations have emerged as new performance bottlenecks, as opposed to hardware-accelerated gradient computations. In this regime, a recently proposed approach is data echoing (Choi et al., 2019), which takes repeated gradient steps on the same batch while waiting for fresh data to arrive from upstream. We provide the first convergence analyses of "data-echoed" extensions of common optimization methods, showing that they exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with stochastic minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.


comments on the presentation, which we will address while preparing our final manuscript

Neural Information Processing Systems

We thank all the reviewers for their careful reading and thoughtful comments. For example, ε = 1.1 would permit an algorithm to go from Additionally, data analysis pipelines (e.g., model selection) in practice typically contain many


Information theoretic limits of learning a sparse rule

Neural Information Processing Systems

We consider generalized linear models in regimes where the number of nonzero components of the signal and accessible data points are sublinear with respect to the size of the signal. We prove a variational formula for the asymptotic mutual information per sample when the system size grows to infinity. This result allows us to derive an expression for the minimum mean-square error (MMSE) of the Bayesian estimator when the signal entries have a discrete distribution with finite support. We find that, for such signals and suitable vanishing scalings of the sparsity and sampling rate, the MMSE is nonincreasing piecewise constant. In specific instances the MMSE even displays an all-or-nothing phase transition, that is, the MMSE sharply jumps from its maximum value to zero at a critical sampling rate. The all-or-nothing phenomenon has previously been shown to occur in high-dimensional linear regression. Our analysis goes beyond the linear case and applies to learning the weights of a perceptron with general activation function in a teacher-student scenario. In particular, we discuss an all-or-nothing phenomenon for the generalization error with a sublinear set of training examples.