sigma
Convergence of \text{log}(1/\epsilon) for Gradient-Based Algorithms in Zero-Sum Games without the Condition Number: A Smoothed Analysis
Gradient-based algorithms have shown great promise in solving large (two-player) zero-sum games. However, their success has been mostly confined to the low-precision regime since the number of iterations grows polynomially in $1/\epsilon$, where $\epsilon > 0$ is the duality gap. While it has been well-documented that linear convergence---an iteration complexity scaling as $\text{log}(1/\epsilon)$---can be attained even with gradient-based algorithms, that comes at the cost of introducing a dependency on certain condition number-like quantities which can be exponentially large in the description of the game. To address this shortcoming, we examine the iteration complexity of several gradient-based algorithms in the celebrated framework of smoothed analysis, and we show that they have polynomial smoothed complexity, in that their number of iterations grows as a polynomial in the dimensions of the game, $\text{log}(1/\epsilon)$, and $1/\sigma$, where $\sigma$ measures the magnitude of the smoothing perturbation. Our result applies to optimistic gradient and extra-gradient descent/ascent, as well as a certain iterative variant of Nesterov's smoothing technique. From a technical standpoint, the proof proceeds by characterizing and performing a smoothed analysis of a certain error bound, the key ingredient driving linear convergence in zero-sum games. En route, our characterization also makes a natural connection between the convergence rate of such algorithms and perturbation-stability properties of the equilibrium, which is of interest beyond the model of smoothed complexity.
Oja's Algorithm for Streaming Sparse PCA
Oja's algorithm for Streaming Principal Component Analysis (PCA) for $n$ data-points in a $d$ dimensional space achieves the same sin-squared error $O(r_{\mathsf{eff}}/n)$ as the offline algorithm in $O(d)$ space and $O(nd)$ time and a single pass through the datapoints. Here $r_{\mathsf{eff}}$ is the effective rank (ratio of the trace and the principal eigenvalue of the population covariance matrix $\Sigma$). Under this computational budget, we consider the problem of sparse PCA, where the principal eigenvector of $\Sigma$ is $s$-sparse, and $r_{\mathsf{eff}}$ can be large. In this setting, to our knowledge, *there are no known single-pass algorithms* that achieve the minimax error bound in $O(d)$ space and $O(nd)$ time without either requiring strong initialization conditions or assuming further structure (e.g., spiked) of the covariance matrix.We show that a simple single-pass procedure that thresholds the output of Oja's algorithm (the Oja vector) can achieve the minimax error bound under some regularity conditions in $O(d)$ space and $O(nd)$ time. We present a nontrivial and novel analysis of the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector. This is completely different from previous analyses of Oja's algorithm and matrix products, which have been done when the $r_{\mathsf{eff}}$ is bounded.
Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization
This is a canonical problem that admits two difficulties in optimization: 1) non-convexity and 2) non-smoothness (due to unbalancedness of $\mathbf{U}$ and $\mathbf{V}$). This is also a prototype for more complex problems such as asymmetric matrix sensing and matrix completion. Despite being non-convex and non-smooth, it has been observed empirically that the randomly initialized gradient descent algorithm can solve this problem in polynomial time. Existing theories to explain this phenomenon all require artificial modifications of the algorithm, such as adding noise in each iteration and adding a balancing regularizer to balance the $\mathbf{U}$ and $\mathbf{V}$.This paper presents the first proof that shows randomly initialized gradient descent converges to a global minimum of the asymmetric low-rank factorization problem with a polynomial rate. For the proof, we develop 1) a new symmetrization technique to capture the magnitudes of the symmetry and asymmetry, and 2) a quantitative perturbation analysis to approximate matrix derivatives. We believe both are useful for other related non-convex problems.
Dimension-free Private Mean Estimation for Anisotropic Distributions
Previous private estimators on distributions over \mathbb{R} d suffer from a curse of dimensionality, as they require \Omega(d {1/2}) samples to achieve non-trivial error, even in cases where O(1) samples suffice without privacy. This rate is unavoidable when the distribution is isotropic, namely, when the covariance is a multiple of the identity matrix. Yet, real-world data is often highly anisotropic, with signals concentrated on a small number of principal components. We develop estimators that are appropriate for such signals---our estimators are (\varepsilon,\delta) -differentially private and have sample complexity that is dimension-independent for anisotropic subgaussian distributions. We show that this is the optimal sample complexity for this task up to logarithmic factors.
Oja's Algorithm for Streaming Sparse PCA
Oja's algorithm for Streaming Principal Component Analysis (PCA) for n data-points in a d dimensional space achieves the same sin-squared error O(r_{\mathsf{eff}}/n) as the offline algorithm in O(d) space and O(nd) time and a single pass through the datapoints. Here r_{\mathsf{eff}} is the effective rank (ratio of the trace and the principal eigenvalue of the population covariance matrix \Sigma). Under this computational budget, we consider the problem of sparse PCA, where the principal eigenvector of \Sigma is s -sparse, and r_{\mathsf{eff}} can be large. In this setting, to our knowledge, *there are no known single-pass algorithms* that achieve the minimax error bound in O(d) space and O(nd) time without either requiring strong initialization conditions or assuming further structure (e.g., spiked) of the covariance matrix.We show that a simple single-pass procedure that thresholds the output of Oja's algorithm (the Oja vector) can achieve the minimax error bound under some regularity conditions in O(d) space and O(nd) time. We present a nontrivial and novel analysis of the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
I have read the paper fast classification rates for high-dimensional conditional Gaussian models". The paper studies the problem of binary classification using a Gaussian model and provides some theoretical results on the convergence of the classification error rates (compared to the Bayes classifier). The paper presents some nice theoretical results and is interesting to some extent. I am generally positive about the paper but I have the following concerns. First, it is about the practical relevance.
Review for NeurIPS paper: Reinforced Molecular Optimization with Neighborhood-Controlled Grammars
Additional Feedback: I appreciate the authors for addressing most of my concerns. I have updated my score from 4 to 6. i) For the empirical evaluation, I understand that the proposed method performs better than the method I found, when compared in fair settings. I think the experimental setting is sound enough, because the evaluation score is independent of the classifier. I wish the authors mention the existence of such benchmark environments in the main text so that following papers can use them. I would like the authors to clarify that the valency-preserving property comes from the inference algorithm rather than the definition of the molecular NCE grammar, because Definition 1 does not much specify the embedding function phi. For example, if we add phi(1, 6) "..." in the production rule shown in the top of Figure 2, this production rule does not preserve the degree of node 1, while the embedding function with phi(1, 6) "..." is still legal.
Reviews: Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization
The fundamental claim [line 101 & 239] is that asymptotically, for streaming PCA, the delay tau is allowed to scale as (1 - mu) 2 / sqrt(eta), where mu is the step size and mu the momentum parameter. Major Comments Before we discuss the proof, I think the introduction is somewhat misleading. In line 76, the authors point out previous work all focus on analyzing convergence to a first order optimal solution. The readers can be confused that this paper improved the results of previous work. However, the problems studies in those paper and streaming PCA are different.
Reviews: Neural Tangent Kernel: Convergence and Generalization in Neural Networks
The authors prove that networks of infinite width trained with SGD and (infinitely) small step size evolve according to a differential equation, the solution of which depends only on the covariance kernel of the data and, in the case of L2 regression, on the eigenspectrum of the Kernel. I believe this is a breakthrough result in the field of neural network theory. It elevates the analysis of infinitely wide networks from the study of the static initial function to closely predicting the entire training path. There are a plethora of powerful consequences about infinitely wide, fully-connected networks: - They cannot learn information not contained in the covariance matrix - Change to latent representation and parameters tends to zero as width goes to infinity. Therefore choosing nonlinearities in all layers reduces to choosing a single 1d function.
INTERVIEW WITH A CAUTIOUS ARRAY
The encounter with the Specifying Integrating Generalizing Mass Array took place not in Sigma's supercooled chamber, but through the laptop. My interview with Sigma was full of irony. I hoped I was not being rude. Anything else I can help you with?" The programming of AI's may have made them smart, I thought, but not polite. "It is a word used specifically to distinguish living beings from inanimate things.