Goto

Collaborating Authors

 Rebeschini, Patrick


Linear Convergence for Natural Policy Gradient with Log-linear Policy Parametrization

arXiv.org Artificial Intelligence

We analyze the convergence rate of the unregularized natural policy gradient algorithm with log-linear policy parametrizations in infinite-horizon discounted Markov decision processes. In the deterministic case, when the Q-value is known and can be approximated by a linear combination of a known feature function up to a bias error, we show that a geometrically-increasing step size yields a linear convergence rate towards an optimal policy. We then consider the sample-based case, when the best representation of the Q- value function among linear combinations of a known feature function is known up to an estimation error. In this setting, we show that the algorithm enjoys the same linear guarantees as in the deterministic case up to an error term that depends on the estimation error, the bias error, and the condition number of the feature covariance matrix. Our results build upon the general framework of policy mirror descent and extend previous findings for the softmax tabular parametrization to the log-linear policy class.


Time-independent Generalization Bounds for SGLD in Non-convex Settings

arXiv.org Machine Learning

We establish generalization error bounds for stochastic gradient Langevin dynamics (SGLD) with constant learning rate under the assumptions of dissipativity and smoothness, a setting that has received increased attention in the sampling/optimization literature. Unlike existing bounds for SGLD in non-convex settings, ours are time-independent and decay to zero as the sample size increases. Using the framework of uniform stability, we establish time-independent bounds by exploiting the Wasserstein contraction property of the Langevin diffusion, which also allows us to circumvent the need to bound gradients using Lipschitz-like assumptions. Our analysis also supports variants of SGLD that use different discretization methods, incorporate Euclidean projections, or use non-isotropic noise.


On Optimal Interpolation In Linear Regression

arXiv.org Machine Learning

Understanding when and why interpolating methods generalize well has recently been a topic of interest in statistical learning theory. However, systematically connecting interpolating methods to achievable notions of optimality has only received partial attention. In this paper, we investigate the question of what is the optimal way to interpolate in linear regression using functions that are linear in the response variable (as the case for the Bayes optimal estimator in ridge regression) and depend on the data, the population covariance of the data, the signal-to-noise ratio and the covariance of the prior for the signal, but do not depend on the value of the signal itself nor the noise vector in the training data. We provide a closed-form expression for the interpolator that achieves this notion of optimality and show that it can be derived as the limit of preconditioned gradient descent with a specific initialization. We identify a regime where the minimum-norm interpolator provably generalizes arbitrarily worse than the optimal response-linear achievable interpolator that we introduce, and validate with numerical experiments that the notion of optimality we consider can be achieved by interpolating methods that only use the training data as input in the case of an isotropic prior. Finally, we extend the notion of optimal response-linear interpolation to random features regression under a linear data-generating model that has been previously studied in the literature.


Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?

arXiv.org Machine Learning

Modern methods for learning from data depend on many tuning parameters, such as the stepsize for optimization methods, and the regularization strength for regularized learning methods. Since performance can depend strongly on these parameters, it is important to develop comparisons between \emph{classes of methods}, not just for particularly tuned ones. Here, we take aim to compare classes of estimators via the relative performance of the \emph{best method in the class}. This allows us to rigorously quantify the tuning sensitivity of learning algorithms. As an illustration, we investigate the statistical estimation performance of ridge regression with a uniform grid of regularization parameters, and of gradient descent iterates with a fixed stepsize, in the standard linear model with a random isotropic ground truth parameter. (1) For orthogonal designs, we find the \emph{exact minimax optimal classes of estimators}, showing they are equal to gradient descent with a polynomially decaying learning rate. We find the exact suboptimalities of ridge regression and gradient descent with a fixed stepsize, showing that they decay as either $1/k$ or $1/k^2$ for specific ranges of $k$ estimators. (2) For general designs with a large number of non-zero eigenvalues, we find that gradient descent outperforms ridge regression when the eigenvalues decay slowly, as a power law with exponent less than unity. If instead the eigenvalues decay quickly, as a power law with exponent greater than unity or exponentially, we find that ridge regression outperforms gradient descent. Our results highlight the importance of tuning parameters. In particular, while optimally tuned ridge regression is the best estimator in our case, it can be outperformed by gradient descent when both are restricted to being tuned over a finite regularization grid.


Implicit Regularization in Matrix Sensing via Mirror Descent

arXiv.org Machine Learning

We study discrete-time mirror descent applied to the unregularized empirical risk in matrix sensing. In both the general case of rectangular matrices and the particular case of positive semidefinite matrices, a simple potential-based analysis in terms of the Bregman divergence allows us to establish convergence of mirror descent -- with different choices of the mirror maps -- to a matrix that, among all global minimizers of the empirical risk, minimizes a quantity explicitly related to the nuclear norm, the Frobenius norm, and the von Neumann entropy. In both cases, this characterization implies that mirror descent, a first-order algorithm minimizing the unregularized empirical risk, recovers low-rank matrices under the same set of assumptions that are sufficient to guarantee recovery for nuclear-norm minimization. When the sensing matrices are symmetric and commute, we show that gradient descent with full-rank factorized parametrization is a first-order approximation to mirror descent, in which case we obtain an explicit characterization of the implicit bias of gradient flow as a by-product.


Nearly Minimax-Optimal Rates for Noisy Sparse Phase Retrieval via Early-Stopped Mirror Descent

arXiv.org Machine Learning

This paper studies early-stopped mirror descent applied to noisy sparse phase retrieval, which is the problem of recovering a $k$-sparse signal $\mathbf{x}^\star\in\mathbb{R}^n$ from a set of quadratic Gaussian measurements corrupted by sub-exponential noise. We consider the (non-convex) unregularized empirical risk minimization problem and show that early-stopped mirror descent, when equipped with the hyperbolic entropy mirror map and proper initialization, achieves a nearly minimax-optimal rate of convergence, provided the sample size is at least of order $k^2$ (modulo logarithmic term) and the minimum (in modulus) non-zero entry of the signal is on the order of $\|\mathbf{x}^\star\|_2/\sqrt{k}$. Our theory leads to a simple algorithm that does not rely on explicit regularization or thresholding steps to promote sparsity. More generally, our results establish a connection between mirror descent and sparsity in the non-convex problem of noisy sparse phase retrieval, adding to the literature on early stopping that has mostly focused on non-sparse, Euclidean, and convex settings via gradient descent. Our proof combines a potential-based analysis of mirror descent with a quantitative control on a variational coherence property that we establish along the path of mirror descent, up to a prescribed stopping time.


A Continuous-Time Mirror Descent Approach to Sparse Phase Retrieval

arXiv.org Machine Learning

We analyze continuous-time mirror descent applied to sparse phase retrieval, which is the problem of recovering sparse signals from a set of magnitude-only measurements. We apply mirror descent to the unconstrained empirical risk minimization problem (batch setting), using the square loss and square measurements. We provide a convergence analysis of the algorithm in this non-convex setting and prove that, with the hypentropy mirror map, mirror descent recovers any $k$-sparse vector $\mathbf{x}^\star\in\mathbb{R}^n$ with minimum (in modulus) non-zero entry on the order of $\| \mathbf{x}^\star \|_2/\sqrt{k}$ from $k^2$ Gaussian measurements, modulo logarithmic terms. This yields a simple algorithm which, unlike most existing approaches to sparse phase retrieval, adapts to the sparsity level, without including thresholding steps or adding regularization terms. Our results also provide a principled theoretical understanding for Hadamard Wirtinger flow [58], as Euclidean gradient descent applied to the empirical risk problem with Hadamard parametrization can be recovered as a first-order approximation to mirror descent in discrete time.


Decentralised Learning with Random Features and Distributed Gradient Descent

arXiv.org Machine Learning

We investigate the generalisation performance of Distributed Gradient Descent with Implicit Regularisation and Random Features in the homogenous setting where a network of agents are given data sampled independently from the same unknown distribution. Along with reducing the memory footprint, Random Features are particularly convenient in this setting as they provide a common parameterisation across agents that allows to overcome previous difficulties in implementing Decentralised Kernel Regression. Under standard source and capacity assumptions, we establish high probability bounds on the predictive performance for each agent as a function of the step size, number of iterations, inverse spectral gap of the communication matrix and number of Random Features. By tuning these parameters, we obtain statistical rates that are minimax optimal with respect to the total number of samples in the network. The algorithm provides a linear improvement over single machine Gradient Descent in memory cost and, when agents hold enough data with respect to the network size and inverse spectral gap, a linear speed-up in computational runtime for any network topology. We present simulations that show how the number of Random Features, iterations and samples impact predictive performance.


Hadamard Wirtinger Flow for Sparse Phase Retrieval

arXiv.org Machine Learning

Phase retrieval, the problem of reconstructing a signal from the (squared) magnitude of its Fourier (or any linear) transform, arises in many fields of science and engineering. Such a task is naturally involved in applications such as crystallography (Millane, 1990) and diffraction imaging (Bunk et al., 2007), where optical sensors are able to measure the intensity, but not the phase of a light wave. Due to the loss of phase information, the one-dimensional Fourier phase retrieval problem is ill-posed in general. Common approaches to overcome this ill-posedness include using prior information such as non-negativity, sparsity and the signal's magnitude (Fienup, 1982; Jaganathan et al., 2016), or introducing redundancy into the measurements by oversampling random Gaussian measurements or coded diffraction patterns (Candès et al., 2015; Chen and Candès, 2015). In many applications, the underlying signal is naturally sparse (Jaganathan et al., 2016). A wide range of algorithms has been devised for phase retrieval with a sparse signal, including alternating minimization (SparseAltMinPhase) (Netrapalli et al., 2015), non-convex optimization based approaches such as thresholded Wirtinger flow (TWF) (Cai et al., 2016), sparse truncated amplitude flow (SPARTA) (Wang et al., 2018), compressive reweighted amplitude flow (CRAF) (Zhang et al., 2018) and sparse Wirtinger flow (SWF) (Yuan et al., 2019), and convex relaxation methods such as compressive phase retrieval via lifting (CPRL) (Ohlsson et al., 2012) and SparsePhaseMax (Hand and Voroninski, 2016). Other approaches to sparse phase retrieval include the greedy algorithm GESPAR (Schechtman et al., 2014), an algorithm based on generalized


Decentralised Sparse Multi-Task Regression

arXiv.org Machine Learning

We consider a sparse multi-task regression framework for fitting a collection of related sparse models. Representing models as nodes in a graph with edges between related models, a framework that fuses lasso regressions with the total variation penalty is investigated. Under a form of restricted eigenvalue assumption, bounds on prediction and squared error are given that depend upon the sparsity of each model and the differences between related models. This assumption relates to the smallest eigenvalue restricted to the intersection of two cone sets of the covariance matrix constructed from each of the agents' covariances. We show that this assumption can be satisfied if the constructed covariance matrix satisfies a restricted isometry property. In the case of a grid topology high-probability bounds are given that match, up to log factors, the no-communication setting of fitting a lasso on each model, divided by the number of agents. A decentralised dual method that exploits a convex-concave formulation of the penalised problem is proposed to fit the models and its effectiveness demonstrated on simulations against the group lasso and variants.