Goto

Collaborating Authors

 Mathematical & Statistical Methods


rTop-k: A Statistical Estimation Approach to Distributed SGD

arXiv.org Machine Learning

The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k and random-k shown to be particularly effective. The same observation has also motivated a separate line of work in distributed statistical estimation theory focusing on the impact of communication constraints on the estimation efficiency of different statistical models. The primary goal of this paper is to connect these two research lines and demonstrate how statistical estimation models and their analysis can lead to new insights in the design of communication-efficient training techniques. We propose a simple statistical estimation model for the stochastic gradients which captures the sparsity and skewness of their distribution. The statistically optimal communication scheme arising from the analysis of this model leads to a new sparsification technique for SGD, which concatenates random-k and top-k, considered separately in the prior literature. We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the concatenated application of these two sparsification methods consistently and significantly outperforms either method applied alone.


New Algorithms And Fast Implementations To Approximate Stochastic Processes

arXiv.org Machine Learning

We present new algorithms and fast implementations to find efficient approximations for modelling stochastic processes. For many numerical computations it is essential to develop finite approximations for stochastic processes. While the goal is always to find a finite model, which represents a given knowledge about the real data process as accurate as possible, the ways of estimating the discrete approximating model may be quite different: (i) if the stochastic model is known as a solution of a stochastic differential equation, e.g., one may generate the scenario tree directly from the specified model; (ii) if a simulation algorithm is available, which allows simulating trajectories from all conditional distributions, a scenario tree can be generated by stochastic approximation; (iii) if only some observed trajectories of the scenario process are available, the construction of the approximating process can be based on non-parametric conditional density estimates.


Soft-Robust Algorithms for Handling Model Misspecification

arXiv.org Machine Learning

In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the percentile criterion, which minimizes the probability of a catastrophic failure. Unfortunately, such policies are typically overly conservative as the percentile criterion is non-convex, difficult to optimize, and ignores the mean performance. To overcome these shortcomings, we study the soft-robust criterion, which uses risk measures to balance the mean and percentile criteria better. In this paper, we establish the soft-robust criterion's fundamental properties, show that it is NP-hard to optimize, and propose and analyze two algorithms to optimize it approximately. Our theoretical analyses and empirical evaluations demonstrate that our algorithms compute much less conservative solutions than the existing approximate methods for optimizing the percentile-criterion.


Riemannian Gaussian distributions, random matrix ensembles and diffusion kernels

arXiv.org Machine Learning

We show that the Riemannian Gaussian distributions on symmetric spaces, introduced in recent years, are of standard random matrix type. We exploit this to compute analytically marginals of the probability density functions. This can be done fully, using Stieltjes-Wigert orthogonal polynomials, for the case of the space of Hermitian matrices, where the distributions have already appeared in the physics literature. For the case when the symmetric space is the space of $m \times m$ symmetric positive definite matrices, we show how to efficiently compute by evaluating Pfaffians at specific values of $m$. Equivalently, we can obtain the same result by constructing specific skew orthogonal polynomials with regards to the log-normal weight function (skew Stieltjes-Wigert polynomials). Other symmetric spaces are studied and the same type of result is obtained for the quaternionic case. Moreover, we show how the probability density functions are a particular case of diffusion reproducing kernels of the Karlin-McGregor type, describing non-intersecting Brownian motions, which are also diffusion processes in the Weyl chamber of Lie groups.


Machine Learning for Signal Processing: Data Science, Algorithms, and Computational Statistics: Little, Max A.: 9780198714934: Amazon.com: Books

#artificialintelligence

"This book provides an excellent pathway for gaining first-class expertise in machine learning. It provides both the technical background that explains why certain approaches, but not others, are best practice in real world problems, and a framework for how to think about and approach new problems. I highly recommend it for people with a signal processing background who are seeking to become an expert in machine learning." With this book, Prof. Little has taken an important step in unifying ร‚machine learning and signal processing. As a whole, this book covers many topics, new and old, that are important in their own right and equips the reader with a broader perspective than traditional signal processing textbooks.


Autonomous learning of nonlocal stochastic neuron dynamics

arXiv.org Machine Learning

Neuronal dynamics is driven by externally imposed or internally generated random excitations/noise, and is often described by systems of stochastic ordinary differential equations. A solution to these equations is the joint probability density function (PDF) of neuron states. It can be used to calculate such information-theoretic quantities as the mutual information between the stochastic stimulus and various internal states of the neuron (e.g., membrane potential), as well as various spiking statistics. When random excitations are modeled as Gaussian white noise, the joint PDF of neuron states satisfies exactly a Fokker-Planck equation. However, most biologically plausible noise sources are correlated (colored). In this case, the resulting PDF equations require a closure approximation. We propose two methods for closing such equations: a modified nonlocal large-eddy-diffusivity closure and a data-driven closure relying on sparse regression to learn relevant features. The closures are tested for stochastic leaky integrate-and-fire (LIF) and FitzHugh-Nagumo (FHN) neurons driven by sine-Wiener noise. Mutual information and total correlation between the random stimulus and the internal states of the neuron are calculated for the FHN neuron.


Sparse sketches with small inversion bias

arXiv.org Machine Learning

For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$. This phenomenon, which we call inversion bias, arises, e.g., in statistics and distributed optimization, when averaging multiple independently constructed estimates of quantities that depend on the inverse covariance. We develop a framework for analyzing inversion bias, based on our proposed concept of an $(\epsilon,\delta)$-unbiased estimator for random matrices. We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, then after simple rescaling, the estimator $(\frac m{m-d}\tilde A^\top\tilde A)^{-1}$ is $(\epsilon,\delta)$-unbiased for $(A^\top A)^{-1}$ with a sketch of size $m=O(d+\sqrt d/\epsilon)$. This implies that for $m=O(d)$, the inversion bias of this estimator is $O(1/\sqrt d)$, which is much smaller than the $\Theta(1)$ approximation error obtained as a consequence of the subspace embedding guarantee for sub-gaussian sketches. We then propose a new sketching technique, called LEverage Score Sparsified (LESS) embeddings, which uses ideas from both data-oblivious sparse embeddings as well as data-aware leverage-based row sampling methods, to get $\epsilon$ inversion bias for sketch size $m=O(d\log d+\sqrt d/\epsilon)$ in time $O(\text{nnz}(A)\log n+md^2)$, where nnz is the number of non-zeros. The key techniques enabling our analysis include an extension of a classical inequality of Bai and Silverstein for random quadratic forms, which we call the Restricted Bai-Silverstein inequality; and anti-concentration of the Binomial distribution via the Paley-Zygmund inequality, which we use to prove a lower bound showing that leverage score sampling sketches generally do not achieve small inversion bias.


How Goodhart's Law Can Save Machine Learning Research

#artificialintelligence

"When a measure becomes a target, it ceases to be a good measure." Stochastic Gradient Descent (SGD) has been responsible for many of the most outstanding achievements in machine learning. The objective of SGD is to optimise a target in the form of a loss function. But SGD fails in finding'standard' loss functions in a few settings as it converges to the'easy' solutions. As we see above, when classifying sheep, the network learns to use the green background to identify the sheep present.


Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-order Convergence

arXiv.org Machine Learning

In this paper, we propose a new {\it \underline{R}ecursive} {\it \underline{I}mportance} {\it \underline{S}ketching} algorithm for {\it \underline{R}ank} constrained least squares {\it \underline{O}ptimization} (RISRO). As its name suggests, the algorithm is based on a new sketching framework, recursive importance sketching. Several existing algorithms in the literature can be reinterpreted under the new sketching framework and RISRO offers clear advantages over them. RISRO is easy to implement and computationally efficient, where the core procedure in each iteration is only solving a dimension reduced least squares problem. Different from numerous existing algorithms with locally geometric convergence rate, we establish the local quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions. In addition, we discover a deep connection of RISRO to Riemannian manifold optimization on fixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machine learning and statistics: low-rank matrix trace regression and phase retrieval. Simulation studies demonstrate the superior numerical performance of RISRO.


The Roadmap of Mathematics for Deep Learning

#artificialintelligence

Knowing the mathematics behind machine learning algorithms is a superpower. If you have ever built a model for a real-life problem, you probably experienced that being familiar with the details can go a long way if you want to move beyond baseline performance. This is especially true when you want to push the boundaries of state of the art. However, most of this knowledge is hidden behind layers of advanced mathematics. Understanding methods like stochastic gradient descent might seem difficult since it is built on top of multivariable calculus and probability theory.