Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

Neural Information Processing Systems

In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace.The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically inthe dimension of the instances.


Online PCA with Optimal Regret

#artificialintelligence

When viewed in the right parameterization, this compression loss is linear, i.e. it can be rewritten as \text{tr}(\mathbf{W}_t\x_t\x_t \top), where \mathbf{W}_t is the parameter of the algorithm and the outer product \x_t\x_t \top (with \ \x_t\ \le 1) is the instance matrix. In this paper generalize PCA to arbitrary positive definite instance matrices \mathbf{X}_t with the linear loss \text{tr}(\mathbf{W}_t\X_t) . We evaluate online algorithms in terms of their worst-case regret, which is a bound on the additional total loss of the online algorithm on all instances matrices over the compression loss of the best k -dimensional subspace (chosen in hindsight). We focus on two popular online algorithms for generalized PCA: the Gradient Descent (GD) and Matrix Exponentiated Gradient (MEG) algorithms. We show that if the regret is expressed as a function of the number of trials, then both algorithms are optimal to within a constant factor on worst-case sequences of positive definite instances matrices with trace norm at most one (which subsumes the original PCA problem with outer products).


Learning Eigenvectors for Free

Neural Information Processing Systems

We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of $n$ outcomes is replaced by the set of all dyads, i.e. outer products $\u\u^\top$ where $\u$ is a vector in $\R^n$ of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the $n^2$ parameters of a density matrix is much harder than learning the $n$ parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret.


Bandit Principal Component Analysis

arXiv.org Machine Learning

We consider a partial-feedback variant of the well-studied online PCA problem where a learner attempts to predict a sequence of $d$-dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment's choices. We focus on a natural notion of bandit feedback where the learner only observes the loss associated with its own prediction. Based on the classical observation that this decision-making problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of $O(d^{3/2}\sqrt{T})$ after $T$ rounds in the worst case. We also prove data-dependent bounds that improve on the basic result when the loss matrices of the environment have bounded rank or the loss of the best action is bounded. One version of our algorithm runs in $O(d)$ time per trial which massively improves over every previously known online PCA method. We complement these results by a lower bound of $\Omega(d\sqrt{T})$.


Stochastic Optimization of PCA with Capped MSG

arXiv.org Machine Learning

We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as "Matrix Stochastic Gradient" (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically.