sigma
Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization
This is a canonical problem that admits two difficulties in optimization: 1) non-convexity and 2) non-smoothness (due to unbalancedness of $\mathbf{U}$ and $\mathbf{V}$). This is also a prototype for more complex problems such as asymmetric matrix sensing and matrix completion. Despite being non-convex and non-smooth, it has been observed empirically that the randomly initialized gradient descent algorithm can solve this problem in polynomial time. Existing theories to explain this phenomenon all require artificial modifications of the algorithm, such as adding noise in each iteration and adding a balancing regularizer to balance the $\mathbf{U}$ and $\mathbf{V}$.This paper presents the first proof that shows randomly initialized gradient descent converges to a global minimum of the asymmetric low-rank factorization problem with a polynomial rate. For the proof, we develop 1) a new symmetrization technique to capture the magnitudes of the symmetry and asymmetry, and 2) a quantitative perturbation analysis to approximate matrix derivatives. We believe both are useful for other related non-convex problems.
Dimension-free Private Mean Estimation for Anisotropic Distributions
Previous private estimators on distributions over \mathbb{R} d suffer from a curse of dimensionality, as they require \Omega(d {1/2}) samples to achieve non-trivial error, even in cases where O(1) samples suffice without privacy. This rate is unavoidable when the distribution is isotropic, namely, when the covariance is a multiple of the identity matrix. Yet, real-world data is often highly anisotropic, with signals concentrated on a small number of principal components. We develop estimators that are appropriate for such signals---our estimators are (\varepsilon,\delta) -differentially private and have sample complexity that is dimension-independent for anisotropic subgaussian distributions. We show that this is the optimal sample complexity for this task up to logarithmic factors.
Oja's Algorithm for Streaming Sparse PCA
Oja's algorithm for Streaming Principal Component Analysis (PCA) for n data-points in a d dimensional space achieves the same sin-squared error O(r_{\mathsf{eff}}/n) as the offline algorithm in O(d) space and O(nd) time and a single pass through the datapoints. Here r_{\mathsf{eff}} is the effective rank (ratio of the trace and the principal eigenvalue of the population covariance matrix \Sigma). Under this computational budget, we consider the problem of sparse PCA, where the principal eigenvector of \Sigma is s -sparse, and r_{\mathsf{eff}} can be large. In this setting, to our knowledge, *there are no known single-pass algorithms* that achieve the minimax error bound in O(d) space and O(nd) time without either requiring strong initialization conditions or assuming further structure (e.g., spiked) of the covariance matrix.We show that a simple single-pass procedure that thresholds the output of Oja's algorithm (the Oja vector) can achieve the minimax error bound under some regularity conditions in O(d) space and O(nd) time. We present a nontrivial and novel analysis of the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
I have read the paper fast classification rates for high-dimensional conditional Gaussian models". The paper studies the problem of binary classification using a Gaussian model and provides some theoretical results on the convergence of the classification error rates (compared to the Bayes classifier). The paper presents some nice theoretical results and is interesting to some extent. I am generally positive about the paper but I have the following concerns. First, it is about the practical relevance.
Review for NeurIPS paper: Reinforced Molecular Optimization with Neighborhood-Controlled Grammars
Additional Feedback: I appreciate the authors for addressing most of my concerns. I have updated my score from 4 to 6. i) For the empirical evaluation, I understand that the proposed method performs better than the method I found, when compared in fair settings. I think the experimental setting is sound enough, because the evaluation score is independent of the classifier. I wish the authors mention the existence of such benchmark environments in the main text so that following papers can use them. I would like the authors to clarify that the valency-preserving property comes from the inference algorithm rather than the definition of the molecular NCE grammar, because Definition 1 does not much specify the embedding function phi. For example, if we add phi(1, 6) "..." in the production rule shown in the top of Figure 2, this production rule does not preserve the degree of node 1, while the embedding function with phi(1, 6) "..." is still legal.
Reviews: Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization
The fundamental claim [line 101 & 239] is that asymptotically, for streaming PCA, the delay tau is allowed to scale as (1 - mu) 2 / sqrt(eta), where mu is the step size and mu the momentum parameter. Major Comments Before we discuss the proof, I think the introduction is somewhat misleading. In line 76, the authors point out previous work all focus on analyzing convergence to a first order optimal solution. The readers can be confused that this paper improved the results of previous work. However, the problems studies in those paper and streaming PCA are different.
Reviews: Neural Tangent Kernel: Convergence and Generalization in Neural Networks
The authors prove that networks of infinite width trained with SGD and (infinitely) small step size evolve according to a differential equation, the solution of which depends only on the covariance kernel of the data and, in the case of L2 regression, on the eigenspectrum of the Kernel. I believe this is a breakthrough result in the field of neural network theory. It elevates the analysis of infinitely wide networks from the study of the static initial function to closely predicting the entire training path. There are a plethora of powerful consequences about infinitely wide, fully-connected networks: - They cannot learn information not contained in the covariance matrix - Change to latent representation and parameters tends to zero as width goes to infinity. Therefore choosing nonlinearities in all layers reduces to choosing a single 1d function.
Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
Arora, Sanjeev, Ge, Rong, Moitra, Ankur, Sachdeva, Sushant
We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form $y Ax \eta$ where $A$ is an unknown $n \times n$ matrix and $x$ is chosen uniformly at random from $\{ 1, -1\} n$, $\eta$ is an $n$-dimensional Gaussian random variable with unknown covariance $\Sigma$: We give an algorithm that provable recovers $A$ and $\Sigma$ up to an additive $\epsilon$ whose running time and sample complexity are polynomial in $n$ and $1 / \epsilon$. To accomplish this, we introduce a novel quasi-whitening'' step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of $A$ one by one via local search. Papers published at the Neural Information Processing Systems Conference.
How to propagate uncertainty into the prediction of a neural network?
I am using them to predict outputs $y_1 \ldots y_m$ on a trained neural network. How can I obtain 1$\sigma$ uncertainties on my predictions? My idea is to randomly perturb each input $x_i$ with normal noise having mean 0 and standard deviation $\epsilon_i$ a large number of times (say, 10000), and then take the median and standard deviation of each prediction $y_i$. I fear that this only takes into account the "random" error (from the measurements) and not the "systematic" error (from the network), i.e., each prediction inherently has some error to it that is not being considered in this approach. How can I properly obtain $1\sigma$ error bars on my predictions?
Generative Adversarial Networks recover features in astrophysical images of galaxies beyond the deconvolution limit
Schawinski, Kevin, Zhang, Ce, Zhang, Hantian, Fowler, Lucas, Santhanam, Gokula Krishnan
Similarly, the observation is limited in angular resolution by the resolving power of the telescope (R λ/D) and, if taken from the ground, by the distortions caused by the moving atmosphere (the "seeing"). The total blurring introduced by the combination of the telescope and the atmosphere is described by the point spread function (PSF). An image taken by a telescope can therefore be thought of as a convolution of the true light distribution with this point spread function plus the addition of various sources of noise. The Shannon-Nyquist sampling theorem (Nyquist 1928; Shannon 1949) limits the ability of deconvolution techniques in removing the effect of the PSF, particularly in the presence of noise (Magain et al. 1998; Courbin 1999; Starck et al. 2002). Deconvolution has long been known as an "ill-posed" ABSTRACT Observations of astrophysical objects such as galaxies are limited by various sources of random and systematic noise from the sky background, the optical system of the telescope and the detector used to record the data. Conventional deconvolution techniques are limited in their ability to recover features in imaging data by the Shannon-Nyquist sampling theorem. Here we train a generative adversarial network (GAN) on a sample of 4, 550 images of nearby galaxies at 0.01 z 0.02 from the Sloan Digital Sky Survey and conduct 10 cross validation to evaluate the results. We present a method using a GAN trained on galaxy images that can recover features from artificially degraded images with worse seeing and higher noise than the original with a performance which far exceeds simple deconvolution. The ability to better recover detailed features such as galaxy morphology from low-signal-to-noise and low angular resolution imaging data significantly increases our ability to study existing data sets of astrophysical objects as well as future observations with observatories such as the Large Synoptic Sky Telescope (LSST) and the Hubble and James Webb space telescopes.