Most machine learning algorithms have been developed and statistically validated for linearly separable data. Popular examples are linear classifiers like Support Vector Machines (SVMs) or the (standard) Principal Component Analysis (PCA) for dimensionality reduction. However, most real world data requires nonlinear methods in order to perform tasks that involve the analysis and discovery of patterns successfully. The focus of this article is to briefly introduce the idea of kernel methods and to implement a Gaussian radius basis function (RBF) kernel that is used to perform nonlinear dimensionality reduction via BF kernel principal component analysis (kPCA). The main purpose of principal component analysis (PCA) is the analysis of data to identify patterns that represent the data "well."

Kotłowski, Wojciech, Warmuth, Manfred K.

Most of machine learning deals with vector parameters. Ideally we would like to take higher order information into account and make use of matrix or even tensor parameters. However the resulting algorithms are usually inefficient. Here we address on-line learning with matrix parameters. It is often easy to obtain online algorithm with good generalization performance if you eigendecompose the current parameter matrix in each trial (at a cost of $O(n^3)$ per trial). Ideally we want to avoid the decompositions and spend $O(n^2)$ per trial, i.e. linear time in the size of the matrix data. There is a core trade-off between the running time and the generalization performance, here measured by the regret of the on-line algorithm (total gain of the best off-line predictor minus the total gain of the on-line algorithm). We focus on the key matrix problem of rank $k$ Principal Component Analysis in $\mathbb{R}^n$ where $k \ll n$. There are $O(n^3)$ algorithms that achieve the optimum regret but require eigendecompositions. We develop a simple algorithm that needs $O(kn^2)$ per trial whose regret is off by a small factor of $O(n^{1/4})$. The algorithm is based on the Follow the Perturbed Leader paradigm. It replaces full eigendecompositions at each trial by the problem finding $k$ principal components of the current covariance matrix that is perturbed by Gaussian noise.

Figure 1: A two-dimensional convex function represented via contour lines. The function value is constant on the boundary of each such ellipse, and decreases as the ellipse becomes smaller and smaller. Let us assume we want to minimize this function starting from a point \(A\). The red line shows the path followed by a gradient descent optimizer converging to the minimum point \(B\), while the green dashed line represents the direct line joining \(A\) and \(B\). In today's post, we will discuss an interesting property concerning the trajectory of gradient descent iterates, namely the length of the Gradient Descent curve.

To kick off this series, will start with something simple yet foundational: linear regression via ordinary least squares. While not exciting, linear regression finds widespread use both as a standalone learning algorithm and as a building block in more advanced learning algorithms. The output layer of a deep neural network trained for regression with MSE loss, simple AR time series models, and the "local regression" part of LOWESS smoothing are all examples of linear regression being used as an ingredient in a more sophisticated model. Linear regression is also the "simple harmonic oscillator" of machine learning; that is to say, a pedagogical example that allows us to present deep theoretical ideas about machine learning in a context that is not too mathematically taxing. There is also the small matter of it being the most widely used supervised learning algorithm in the world; although how much weight that carries I suppose depends on where you are on the "applied" to "theoretical" spectrum. However, since I can already feel your eyes glazing over from such an introductory topic, we can spice things up a little bit by doing something which isn't often done in introductory machine learning - we can present the algorithm that [your favorite statistical software here] actually uses to fit linear regression models: QR decomposition.