Statistical Learning
The Benefits of Implicit Regularization from SGD in Least Squares Problems
Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.
Revisiting Active Sets for Gaussian Process Decoders
Decoders built on Gaussian processes (GPs) are enticing due to the marginalisation over the non-linear function space. Such models (also known as GP-LVMs) are often expensive and notoriously difficult to train in practice, but can be scaled using variational inference and inducing points. In this paper, we revisit active set approximations. We develop a new stochastic estimate of the log-marginal likelihood based on recently discovered links to cross-validation, and we propose a computationally efficient approximation thereof. We demonstrate that the resulting stochastic active sets (SAS) approximation significantly improves the robustness of GP decoder training, while reducing computational cost. The SAS-GP obtains more structure in the latent space, scales to many datapoints, and learns better representations than variational autoencoders, which is rarely the case for GP decoders.
Supplementary Material: " Compressing Neural Networks: Towards Determining the Optimal Layer-wise Decomposition "
The input tensor shape is 6 3 3. The corresponding weight matrix has f = 20 rows (one row per filter) and 24 columns (c κ1 κ2), as for the corresponding feature matrix, it has 24 rows and 4 columns, the 4 here is the number of convolution windows (i.e., number of pixels/entries in each of the output feature maps). After multiplying those matrices, we reshape them to the desired shape to obtain the desired output feature maps. In this section, we provide more details pertaining to our method. A.1 Method Preliminaries Our layer-wise compression technique hinges upon the insight that any linear layer may be cast as a matrix multiplication, which enables us to rely on SVD as compression subroutine. Focusing on convolutions we show how such a layer can be recast as matrix multiplication. Similar approaches have been used by Denton et al. (2014); Idelbayev and Carreira-Perpinán (2020); Wen et al. (2017) among others. The equivalence of Y and Y can be easily established via an appropriate reshaping operation since p= p1p2. Equipped with the notion of correspondence between convolution and matrix multiplication our goal is to decompose the layer via its matrix operator W Rf cκ1κ2. To this end, we compute the j-rank approximation of W using SVD and factor it into a pair of smaller matrices U Rf j and V Rj cκ1κ2.
First-Order Algorithms for Min-Max Optimization in Geodesic Metric Spaces
From optimal transport to robust dimensionality reduction, a plethora of machine learning applications can be cast into the min-max optimization problems over Riemannian manifolds. Though many min-max algorithms have been analyzed in the Euclidean setting, it has proved elusive to translate these results to the Riemannian case. Zhang et al. have recently shown that geodesic convex concave Riemannian problems always admit saddle-point solutions. Inspired by this result, we study whether a performance gap between Riemannian and optimal Euclidean space convex-concave algorithms is necessary. We answer this question in the negative--we prove that the Riemannian corrected extragradient (RCEG) method achieves last-iterate convergence at a linear rate in the geodesically stronglyconvex-concave case, matching the Euclidean result. Our results also extend to the stochastic or non-smooth case where RCEG and Riemanian gradient ascent descent (RGDA) achieve near-optimal convergence rates up to factors depending on curvature of the manifold.