AITopics | Mathematical & Statistical Methods

Collaborating Authors

Mathematical & Statistical Methods

News Overviews Instructional Materials AI-Alerts Classics

rTop-k: A Statistical Estimation Approach to Distributed SGD

Barnes, Leighton Pate, Inan, Huseyin A., Isik, Berivan, Ozgur, Ayfer

arXiv.org Machine LearningDec-2-2020

The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k and random-k shown to be particularly effective. The same observation has also motivated a separate line of work in distributed statistical estimation theory focusing on the impact of communication constraints on the estimation efficiency of different statistical models. The primary goal of this paper is to connect these two research lines and demonstrate how statistical estimation models and their analysis can lead to new insights in the design of communication-efficient training techniques. We propose a simple statistical estimation model for the stochastic gradients which captures the sparsity and skewness of their distribution. The statistically optimal communication scheme arising from the analysis of this model leads to a new sparsification technique for SGD, which concatenates random-k and top-k, considered separately in the prior literature. We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the concatenated application of these two sparsification methods consistently and significantly outperforms either method applied alone.

experiment, gradient, node, (14 more...)

arXiv.org Machine Learning

2005.10761

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.70)

Add feedback

New Algorithms And Fast Implementations To Approximate Stochastic Processes

Kirui, Kipngeno Benard, Pflug, Georg Ch., Pichler, Alois

arXiv.org Machine LearningDec-1-2020

We present new algorithms and fast implementations to find efficient approximations for modelling stochastic processes. For many numerical computations it is essential to develop finite approximations for stochastic processes. While the goal is always to find a finite model, which represents a given knowledge about the real data process as accurate as possible, the ways of estimating the discrete approximating model may be quite different: (i) if the stochastic model is known as a solution of a stochastic differential equation, e.g., one may generate the scenario tree directly from the specified model; (ii) if a simulation algorithm is available, which allows simulating trajectories from all conditional distributions, a scenario tree can be generated by stochastic approximation; (iii) if only some observed trajectories of the scenario process are available, the construction of the approximating process can be based on non-parametric conditional density estimates.

approximation, scenario tree, trajectory, (14 more...)

arXiv.org Machine Learning

2012.01185

Country:

North America > United States > New York (0.04)
Europe > Germany (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
Europe > Austria > Vienna (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (1.00)
Banking & Finance > Trading (1.00)

Technology:

Information Technology > Mathematics of Computing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Soft-Robust Algorithms for Handling Model Misspecification

Lobo, Elita A., Ghavamzadeh, Mohammad, Petrik, Marek

arXiv.org Machine LearningNov-29-2020

In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the percentile criterion, which minimizes the probability of a catastrophic failure. Unfortunately, such policies are typically overly conservative as the percentile criterion is non-convex, difficult to optimize, and ignores the mean performance. To overcome these shortcomings, we study the soft-robust criterion, which uses risk measures to balance the mean and percentile criteria better. In this paper, we establish the soft-robust criterion's fundamental properties, show that it is NP-hard to optimize, and propose and analyze two algorithms to optimize it approximately. Our theoretical analyses and empirical evaluations demonstrate that our algorithms compute much less conservative solutions than the existing approximate methods for optimizing the percentile-criterion.

ambiguity, criterion, soft-robust criterion, (14 more...)

arXiv.org Machine Learning

2011.14495

Country:

North America > United States > New Hampshire (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)

Add feedback

Riemannian Gaussian distributions, random matrix ensembles and diffusion kernels

Santilli, Leonardo, Tierz, Miguel

arXiv.org Machine LearningNov-27-2020

We show that the Riemannian Gaussian distributions on symmetric spaces, introduced in recent years, are of standard random matrix type. We exploit this to compute analytically marginals of the probability density functions. This can be done fully, using Stieltjes-Wigert orthogonal polynomials, for the case of the space of Hermitian matrices, where the distributions have already appeared in the physics literature. For the case when the symmetric space is the space of $m \times m$ symmetric positive definite matrices, we show how to efficiently compute by evaluating Pfaffians at specific values of $m$. Equivalently, we can obtain the same result by constructing specific skew orthogonal polynomials with regards to the log-normal weight function (skew Stieltjes-Wigert polynomials). Other symmetric spaces are studied and the same type of result is obtained for the quaternionic case. Moreover, we show how the probability density functions are a particular case of diffusion reproducing kernels of the Karlin-McGregor type, describing non-intersecting Brownian motions, which are also diffusion processes in the Weyl chamber of Lie groups.

matrix, survey article, upstream oil & gas, (18 more...)

arXiv.org Machine Learning

2011.1368

Country:

North America > United States (0.67)
Europe > Portugal (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report (0.63)

Industry: Energy > Oil & Gas > Upstream (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.46)

Add feedback

Machine Learning for Signal Processing: Data Science, Algorithms, and Computational Statistics: Little, Max A.: 9780198714934: Amazon.com: Books

#artificialintelligenceNov-22-2020, 21:25:38 GMT

"This book provides an excellent pathway for gaining first-class expertise in machine learning. It provides both the technical background that explains why certain approaches, but not others, are best practice in real world problems, and a framework for how to think about and approach new problems. I highly recommend it for people with a signal processing background who are seeking to become an expert in machine learning." With this book, Prof. Little has taken an important step in unifying Âmachine learning and signal processing. As a whole, this book covers many topics, new and old, that are important in their own right and equips the reader with a broader perspective than traditional signal processing textbooks.

computational statistic, machine learning, signal processing, (10 more...)

#artificialintelligence

Country:

North America > United States > Massachusetts (0.07)
Europe > Denmark > North Jutland > Aalborg (0.07)

Genre: Summary/Review (0.83)

Industry:

Retail > Online (0.40)
Health & Medicine (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.40)

Add feedback

Autonomous learning of nonlocal stochastic neuron dynamics

Maltba, Tyler E., Zhao, Hongli, Tartakovsky, Daniel M.

arXiv.org Machine LearningNov-22-2020

Neuronal dynamics is driven by externally imposed or internally generated random excitations/noise, and is often described by systems of stochastic ordinary differential equations. A solution to these equations is the joint probability density function (PDF) of neuron states. It can be used to calculate such information-theoretic quantities as the mutual information between the stochastic stimulus and various internal states of the neuron (e.g., membrane potential), as well as various spiking statistics. When random excitations are modeled as Gaussian white noise, the joint PDF of neuron states satisfies exactly a Fokker-Planck equation. However, most biologically plausible noise sources are correlated (colored). In this case, the resulting PDF equations require a closure approximation. We propose two methods for closing such equations: a modified nonlocal large-eddy-diffusivity closure and a data-driven closure relying on sparse regression to learn relevant features. The closures are tested for stochastic leaky integrate-and-fire (LIF) and FitzHugh-Nagumo (FHN) neurons driven by sine-Wiener noise. Mutual information and total correlation between the random stimulus and the internal states of the neuron are calculated for the FHN neuron.

closure, health & medicine, upstream oil & gas, (20 more...)

arXiv.org Machine Learning

2011.10955

Country:

Europe > United Kingdom (0.14)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Industry:

Energy > Oil & Gas > Upstream (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.45)

Add feedback

Sparse sketches with small inversion bias

Dereziński, Michał, Liao, Zhenyu, Dobriban, Edgar, Mahoney, Michael W.

arXiv.org Machine LearningNov-20-2020

For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$. This phenomenon, which we call inversion bias, arises, e.g., in statistics and distributed optimization, when averaging multiple independently constructed estimates of quantities that depend on the inverse covariance. We develop a framework for analyzing inversion bias, based on our proposed concept of an $(\epsilon,\delta)$-unbiased estimator for random matrices. We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, then after simple rescaling, the estimator $(\frac m{m-d}\tilde A^\top\tilde A)^{-1}$ is $(\epsilon,\delta)$-unbiased for $(A^\top A)^{-1}$ with a sketch of size $m=O(d+\sqrt d/\epsilon)$. This implies that for $m=O(d)$, the inversion bias of this estimator is $O(1/\sqrt d)$, which is much smaller than the $\Theta(1)$ approximation error obtained as a consequence of the subspace embedding guarantee for sub-gaussian sketches. We then propose a new sketching technique, called LEverage Score Sparsified (LESS) embeddings, which uses ideas from both data-oblivious sparse embeddings as well as data-aware leverage-based row sampling methods, to get $\epsilon$ inversion bias for sketch size $m=O(d\log d+\sqrt d/\epsilon)$ in time $O(\text{nnz}(A)\log n+md^2)$, where nnz is the number of non-zeros. The key techniques enabling our analysis include an extension of a classical inequality of Bai and Silverstein for random quadratic forms, which we call the Restricted Bai-Silverstein inequality; and anti-concentration of the Binomial distribution via the Paley-Zygmund inequality, which we use to prove a lower bound showing that leverage score sampling sketches generally do not achieve small inversion bias.

inversion bias, matrix, sketch, (17 more...)

arXiv.org Machine Learning

2011.10695

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > Pennsylvania (0.04)
(6 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science (0.92)

Add feedback

How Goodhart's Law Can Save Machine Learning Research

#artificialintelligenceNov-17-2020, 06:40:24 GMT

"When a measure becomes a target, it ceases to be a good measure." Stochastic Gradient Descent (SGD) has been responsible for many of the most outstanding achievements in machine learning. The objective of SGD is to optimise a target in the form of a loss function. But SGD fails in finding'standard' loss functions in a few settings as it converges to the'easy' solutions. As we see above, when classifying sheep, the network learns to use the green background to identify the sheep present.

algorithm, goodhart, save machine learning research, (12 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.59)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.36)

Add feedback

Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-order Convergence

Luo, Yuetian, Huang, Wen, Li, Xudong, Zhang, Anru R.

arXiv.org Machine LearningNov-16-2020

In this paper, we propose a new {\it \underline{R}ecursive} {\it \underline{I}mportance} {\it \underline{S}ketching} algorithm for {\it \underline{R}ank} constrained least squares {\it \underline{O}ptimization} (RISRO). As its name suggests, the algorithm is based on a new sketching framework, recursive importance sketching. Several existing algorithms in the literature can be reinterpreted under the new sketching framework and RISRO offers clear advantages over them. RISRO is easy to implement and computationally efficient, where the core procedure in each iteration is only solving a dimension reduced least squares problem. Different from numerous existing algorithms with locally geometric convergence rate, we establish the local quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions. In addition, we discover a deep connection of RISRO to Riemannian manifold optimization on fixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machine learning and statistics: low-rank matrix trace regression and phase retrieval. Simulation studies demonstrate the superior numerical performance of RISRO.

algorithm, phase retrieval, risro, (13 more...)

arXiv.org Machine Learning

2011.0836

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Fujian Province > Xiamen (0.04)

Genre: Research Report (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.67)

Add feedback

The Roadmap of Mathematics for Deep Learning

#artificialintelligenceNov-15-2020, 21:45:59 GMT

Knowing the mathematics behind machine learning algorithms is a superpower. If you have ever built a model for a real-life problem, you probably experienced that being familiar with the details can go a long way if you want to move beyond baseline performance. This is especially true when you want to push the boundaries of state of the art. However, most of this knowledge is hidden behind layers of advanced mathematics. Understanding methods like stochastic gradient descent might seem difficult since it is built on top of multivariable calculus and probability theory.

deep learning, mathematics, roadmap, (3 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.44)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback