Mathematical & Statistical Methods
Stochastic modified equations and adaptive stochastic gradient algorithms
Li, Qianxiao, Tai, Cheng, E, Weinan
We develop the method of stochastic modified equations (SME), in which stochastic gradient algorithms are approximated in the weak sense by continuous-time stochastic differential equations. We exploit the continuous formulation together with optimal control theory to derive novel adaptive hyper-parameter adjustment policies. Our algorithms have competitive performance with the added benefit of being robust to varying models and datasets. This provides a general methodology for the analysis and design of stochastic gradient algorithms.
SciPy Cheat Sheet: Linear Algebra in Python
By now, you will have already learned that NumPy, one of the fundamental packages for scientific computing, forms at least for a part the fundament of other important packages that you might use used for data manipulation and machine learning with Python. One of those packages is SciPy, another one of the core packages for scientific computing in Python that provides mathematical algorithms and convenience functions built on the NumPy extension of Python. You might now wonder why this library might come in handy for data science. Well, SciPy has many modules that will help you to understand some of the basic components that you need to master when you're learning data science, namely, math, stats and machine learning. You can find out what other things you need to tackle to learn data science here.
Four Weird Mathematical Objects
Here I discuss four interesting mathematical problems (mostly involving famous unsolved conjectures) of considerable interest, and that even high school kids can understand. For the data scientist, it gives an unique opportunity to test various techniques to either disprove or make progress on these problems. The field itself has been a source of constant innovation -- especially to develop distributed architectures, as well as HPC (high performance computing) and quantum computing to try to solve (to non avail so far) these very difficult yet basic problems. And the data sets involved in these problems are incredibly massive and entirely free: it consists of all the integers, and real numbers! The first two problems have been addressed on Data Science Central (DSC) before, the two other ones are presented here on DSC for the first time.
A Maximum Matching Algorithm for Basis Selection in Spectral Learning
Quattoni, Ariadna, Carreras, Xavier, Gallé, Matthias
We present a solution to scale spectral algorithms for learning sequence functions. We are interested in the case where these functions are sparse (that is, for most sequences they return 0). Spectral algorithms reduce the learning problem to the task of computing an SVD decomposition over a special type of matrix called the Hankel matrix. This matrix is designed to capture the relevant statistics of the training sequences. What is crucial is that to capture long range dependencies we must consider very large Hankel matrices. Thus the computation of the SVD becomes a critical bottleneck. Our solution finds a subset of rows and columns of the Hankel that realizes a compact and informative Hankel submatrix. The novelty lies in the way that this subset is selected: we exploit a maximal bipartite matching combinatorial algorithm to look for a sub-block with full structural rank, and show how computation of this sub-block can be further improved by exploiting the specific structure of Hankel matrices.
19 MOOCs on Maths & Statistics for Data Science & Machine Learning
This is an interesting course on applications of linear algebra in data science. The course will first take you through fundamentals of linear algebra. Then, it will introduce you to applications of linear algebra for recognizing handwritten numbers, ranking of sports team along with online codes. The course is open for enrollment.
Clojure Linear Algebra Refresher (2) - Eigenvalues and Eigenvectors
If there are scalar \(\lambda\) and a non-zero vector \(\mathbf{x}\) such that \(A\mathbf{x} \lambda\mathbf{x}\), we call such scalar eigenvalue, and such vector eigenvector. There can be more than one eigenvalue for a given matrix, and there is an infinite number of eigenvectors corresponding to one eigenvalue. All eigenvectors that correspond to one eigenvalue lie on the same line, but have different magnitudes. Seems simple, and it is, but so what? It looks like a trivial thing; how come these eigenvectors and eigenvalues are so ubiquitous in linear algebra?
Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory
Richtárik, Peter, Takáč, Martin
We develop a family of reformulations of an arbitrary consistent linear system into a stochastic problem. The reformulations are governed by two user-defined parameters: a positive definite matrix defining a norm, and an arbitrary discrete or continuous distribution over random matrices. Our reformulation has several equivalent interpretations, allowing for researchers from various communities to leverage their domain specific insights. In particular, our reformulation can be equivalently seen as a stochastic optimization problem, stochastic linear system, stochastic fixed point problem and a probabilistic intersection problem. We prove sufficient, and necessary and sufficient conditions for the reformulation to be exact. Further, we propose and analyze three stochastic algorithms for solving the reformulated problem---basic, parallel and accelerated methods---with global linear convergence rates. The rates can be interpreted as condition numbers of a matrix which depends on the system matrix and on the reformulation parameters. This gives rise to a new phenomenon which we call stochastic preconditioning, and which refers to the problem of finding parameters (matrix and distribution) leading to a sufficiently small condition number. Our basic method can be equivalently interpreted as stochastic gradient descent, stochastic Newton method, stochastic proximal point method, stochastic fixed point method, and stochastic projection method, with fixed stepsize (relaxation parameter), applied to the reformulations.
GAN and VAE from an Optimal Transport Point of View
Genevay, Aude, Peyré, Gabriel, Cuturi, Marco
This short article revisits some of the ideas introduced in [1] and [4] in a simple setup. "pushes forward" each elementary mass of a measure ζ in P(Z) by applying the map g to obtain then a mass in X, to build on aggregate a Because (1) is a linear program, it has a dual formulation, known as the Kantorovich problem [13, Thm. A key remark in Kantorovich's formulation is to notice that the cost of any pair (h, h) can always be improved by replacing h in (2) by the c-transform h As a side-note, and as previously commented in the literature, there is at this point no empirical evidence that supports the idea that using discriminative deep networks that way can result in accurate approximations of Wasserstein distances. These alternative formulations provide instead a very useful proxy for a quantity directly related to the Wasserstein distance. This is advantageous because now π is defined over Z X, which is lowerdimensional than X X, and also because, as in Equation (2), θ does not appear in the constraints either.
Sample complexity of population recovery
Polyanskiy, Yury, Suresh, Ananda Theertha, Wu, Yihong
The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions. We consider one of the two polling impediments: (a) in lossy population recovery, a pollee may skip each question with probability $\epsilon$, (b) in noisy population recovery, a pollee may lie on each question with probability $\epsilon$. Given $n$ lossy or noisy samples, the goal is to estimate the probabilities of all $2^d$ binary vectors simultaneously within accuracy $\delta$ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is $\tilde\Theta(\delta^{-2\max\{\frac{\epsilon}{1-\epsilon},1\}})$, improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result depends at most on the logarithm of the dimension. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when $\epsilon$ exceeds $1/2$. For noisy population recovery, the sharp sample complexity turns out to be more sensitive to dimension and scales as $\exp(\Theta(d^{1/3} \log^{2/3}(1/\delta)))$ except for the trivial cases of $\epsilon=0,1/2$ or $1$. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam's method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.
Krylov Subspace Recycling for Fast Iterative Least-Squares in Machine Learning
de Roos, Filip, Hennig, Philipp
Solving symmetric positive definite linear problems is a fundamental computational task in machine learning. The exact solution, famously, is cubicly expensive in the size of the matrix. To alleviate this problem, several linear-time approximations, such as spectral and inducing-point methods, have been suggested and are now in wide use. These are low-rank approximations that choose the low-rank space a priori and do not refine it over time. While this allows linear cost in the data-set size, it also causes a finite, uncorrected approximation error. Authors from numerical linear algebra have explored ways to iteratively refine such low-rank approximations, at a cost of a small number of matrix-vector multiplications. This idea is particularly interesting in the many situations in machine learning where one has to solve a sequence of related symmetric positive definite linear problems. From the machine learning perspective, such deflation methods can be interpreted as transfer learning of a low-rank approximation across a time-series of numerical tasks. We study the use of such methods for our field. Our empirical results show that, on regression and classification problems of intermediate size, this approach can interpolate between low computational cost and numerical precision.