Koolen, Wouter M., Kotlowski, Wojciech, Warmuth, Manfred K. K.

We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of $n$ outcomes is replaced by the set of all dyads, i.e. outer products $\u\u \top$ where $\u$ is a vector in $\R n$ of unit length. Whereas in the classical case the goal is to learn (i.e. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the $n 2$ parameters of a density matrix is much harder than learning the $n$ parameters of a multinomial distribution.

Kotłowski, Wojciech, Warmuth, Manfred K.

Most of machine learning deals with vector parameters. Ideally we would like to take higher order information into account and make use of matrix or even tensor parameters. However the resulting algorithms are usually inefficient. Here we address on-line learning with matrix parameters. It is often easy to obtain online algorithm with good generalization performance if you eigendecompose the current parameter matrix in each trial (at a cost of $O(n^3)$ per trial). Ideally we want to avoid the decompositions and spend $O(n^2)$ per trial, i.e. linear time in the size of the matrix data. There is a core trade-off between the running time and the generalization performance, here measured by the regret of the on-line algorithm (total gain of the best off-line predictor minus the total gain of the on-line algorithm). We focus on the key matrix problem of rank $k$ Principal Component Analysis in $\mathbb{R}^n$ where $k \ll n$. There are $O(n^3)$ algorithms that achieve the optimum regret but require eigendecompositions. We develop a simple algorithm that needs $O(kn^2)$ per trial whose regret is off by a small factor of $O(n^{1/4})$. The algorithm is based on the Follow the Perturbed Leader paradigm. It replaces full eigendecompositions at each trial by the problem finding $k$ principal components of the current covariance matrix that is perturbed by Gaussian noise.

The classical Bayes rule computes the posterior model probability from the prior probability and the data likelihood. We generalize this rule to the case when the prior is a density matrix (symmetric positive definite and trace one) and the data likelihood a covariance matrix. The classical Bayes rule is retained as the special case when the matrices are diagonal. In the classical setting, the calculation of the probability of the data is an expected likelihood, where the expectation is over the prior distribution. In the generalized setting, this is replaced by an expected variance calculation where the variance is computed along the eigenvectors of the prior density matrix and the expectation is over the eigenvalues of the density matrix (which form a probability vector).The variances along any direction is determined by the covariance matrix. Curiously enough this expected variance calculationis a quantum measurement where the covariance matrix specifies the instrument and the prior density matrix the mixture state of the particle. We motivate both the classical and the generalized Bayes rule with a minimum relative entropy principle, wherethe Kullbach-Leibler version gives the classical Bayes rule and Umegaki's quantum relative entropy the new Bayes rule for density matrices.

Kakade, Sham M., Seeger, Matthias W., Foster, Dean P.

Dean P. Foster University of Pennsylvania We present a competitive analysis of some nonparametric Bayesian algorithms ina worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. We consider models which use a Gaussian process prior (over the space of all functions) andprovide bounds on the regret (under the log loss) for commonly usednon-parametric Bayesian algorithms -- including Gaussian regression and logistic regression -- which show how these algorithms can perform favorably under rather general conditions.

Warmuth, Manfred K., Kuzmin, Dima

In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace.The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically inthe dimension of the instances.