Goto

Collaborating Authors

 Mathematical & Statistical Methods


Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

Communications of the ACM

Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of straggling nodes--computing nodes that unpredictably slow down or fail--is a major bottleneck in such distributed computations. Ideal load balancing strategies that dynamically allocate more tasks to faster nodes require knowledge or monitoring of node speeds as well as the ability to quickly move data. Recently proposed fixed-rate erasure coding strategies can handle unpredictable node slowdown, but they ignore partial work done by straggling nodes, thus resulting in a lot of redundant computation. We propose a rateless fountain coding strategy that achieves the best of both worlds--we prove that its latency is asymptotically equal to ideal load balancing, and it performs asymptotically zero redundant computations. Our idea is to create linear combinations of the m rows of the matrix and assign these encoded rows to different worker nodes. The original matrix-vector product can be decoded as soon as slightly more than m row-vector products are collectively finished by the nodes. Evaluation on parallel and distributed computing yields as much as three times speedup over uncoded schemes. Matrix-vector multiplications form the core of a plethora of scientific computing and machine learning applications that include solving partial differential equations, forward and back propagation in neural networks, computing the PageRank of graphs, etcetera. In the age of Big Data, most of these applications involve multiplying extremely large matrices and vectors and the computations cannot be performed efficiently on a single machine. This has motivated the development of several algorithms that seek to speed up matrix-vector multiplication by distributing the computation across multiple computing nodes.


Linear Algebra: Linear combination of Vectors - Master Data Science

#artificialintelligence

Highlights: In this post we are going to continue our story about vectors. We will talk more about basis vectors, linear combination of vectors and what is the span of vectors. We provide a code examples to demonstrate how to work with vectors in Python. Let's talk about vectors in more details. Vectors are related to pairs of numbers that we call coordinates.


Local optimisation of Nystr\"om samples through stochastic gradient descent

arXiv.org Machine Learning

We study a relaxed version of the column-sampling problem for the Nystr\"om approximation of kernel matrices, where approximations are defined from multisets of landmark points in the ambient space; such multisets are referred to as Nystr\"om samples. We consider an unweighted variation of the radial squared-kernel discrepancy (SKD) criterion as a surrogate for the classical criteria used to assess the Nystr\"om approximation accuracy; in this setting, we discuss how Nystr\"om samples can be efficiently optimised through stochastic gradient descent. We perform numerical experiments which demonstrate that the local minimisation of the radial SKD yields Nystr\"om samples with improved Nystr\"om approximation accuracy.


Bounds on Wasserstein distances between continuous distributions using independent samples

arXiv.org Machine Learning

The plug-in estimator of the Wasserstein distance is known to be conservative, however its usefulness is severely limited when the distributions are similar as its bias does not decay to zero with the true Wasserstein distance. We propose a linear combination of plug-in estimators for the squared 2-Wasserstein distance with a reduced bias that decays to zero with the true distance. The new estimator is provably conservative provided one distribution is appropriately overdispersed with respect the other, and is unbiased when the distributions are equal. We apply it to approximately bound from above the 2-Wasserstein distance between the target and current distribution in Markov chain Monte Carlo, running multiple identically distributed chains which start, and remain, overdispersed with respect to the target. Our bound consistently outperforms the current state-of-the-art bound, which uses coupling, improving mixing time bounds by up to an order of magnitude.


Top 3 Free Resources to Learn Linear Algebra for Machine Learning - KDnuggets

#artificialintelligence

Mathematics is the core of all machine learning algorithms. And while it isn't a prerequisite to have formal math education in order to become a data scientist, you need to understand the principles of the subject well enough to successfully build models that add value. In an article I wrote previously, I explained the three branches of mathematics that were essential to gain a deeper understanding of ML algorithms -- statistics, calculus, and linear algebra. This article will solely focus on linear algebra, as it forms the backbone of machine learning model implementation. Linear algebra concepts like vectorization allow for faster computation speeds, and are implemented in libraries like Pandas, Scipy, and Scikit-Learn.


Linear Algebra Mathematics for Machine Learning Data Science

#artificialintelligence

The Common mistake by a data scientist is Applying the tools without the intuition of how it works and behaves. Having the solid foundation of mathematics will help you to understand how each algorithm work, its limitations and its underlying assumptions. With this, you will have an edge over your peers and makes you more confident in all the applications of Machine Learning, Data Science, and Deep Learning. It always pays to know the machinery under the hood, rather than being a guy who is just behind the wheel with no knowledge about the car. Linear Algebra is one of the areas where everyone agrees to be a starting point in the learning curve of Machine Learning, Data Science, and Deep Learning.. Its basic elements – Vectors and Matrices are where we store our data for input as well as output.


Random Graph Matching in Geometric Models: the Case of Complete Graphs

arXiv.org Machine Learning

This paper studies the problem of matching two complete graphs with edge weights correlated through latent geometries, extending a recent line of research on random graph matching with independent edge weights to geometric models. Specifically, given a random permutation $\pi^*$ on $[n]$ and $n$ iid pairs of correlated Gaussian vectors $\{X_{\pi^*(i)}, Y_i\}$ in $\mathbb{R}^d$ with noise parameter $\sigma$, the edge weights are given by $A_{ij}=\kappa(X_i,X_j)$ and $B_{ij}=\kappa(Y_i,Y_j)$ for some link function $\kappa$. The goal is to recover the hidden vertex correspondence $\pi^*$ based on the observation of $A$ and $B$. We focus on the dot-product model with $\kappa(x,y)=\langle x, y \rangle$ and Euclidean distance model with $\kappa(x,y)=\|x-y\|^2$, in the low-dimensional regime of $d=o(\log n)$ wherein the underlying geometric structures are most evident. We derive an approximate maximum likelihood estimator, which provably achieves, with high probability, perfect recovery of $\pi^*$ when $\sigma=o(n^{-2/d})$ and almost perfect recovery with a vanishing fraction of errors when $\sigma=o(n^{-1/d})$. Furthermore, these conditions are shown to be information-theoretically optimal even when the latent coordinates $\{X_i\}$ and $\{Y_i\}$ are observed, complementing the recent results of [DCK19] and [KNW22] in geometric models of the planted bipartite matching problem. As a side discovery, we show that the celebrated spectral algorithm of [Ume88] emerges as a further approximation to the maximum likelihood in the geometric model.


Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorch

arXiv.org Artificial Intelligence

Linear algebra operations, which are ubiquitous in machine learning, form major performance bottlenecks. The High-Performance Computing community invests significant effort in the development of architecture-specific optimized kernels, such as those provided by the BLAS and LAPACK libraries, to speed up linear algebra operations. However, end users are progressively less likely to go through the error prone and time-consuming process of directly using said kernels; instead, frameworks such as TensorFlow (TF) and PyTorch (PyT), which facilitate the development of machine learning applications, are becoming more and more popular. Although such frameworks link to BLAS and LAPACK, it is not clear whether or not they make use of linear algebra knowledge to speed up computations. For this reason, in this paper we develop benchmarks to investigate the linear algebra optimization capabilities of TF and PyT. Our analyses reveal that a number of linear algebra optimizations are still missing; for instance, reducing the number of scalar operations by applying the distributive law, and automatically identifying the optimal parenthesization of a matrix chain. In this work, we focus on linear algebra computations in TF and PyT; we both expose opportunities for performance enhancement to the benefit of the developers of the frameworks and provide end users with guidelines on how to achieve performance gains.


#007 Linear Algebra - Change of basis - Master Data Science

#artificialintelligence

In the following image we can see an alternative basis for one coordinate system and those are basis vectors \(\vec{b}_{1} \) and \(\vec{b}_{2} \). On the other hand, in this different alternative coordinate system it is represented with coordinates \(-1 \) because that's how much we have to scale vector \(\vec{b}_{1} \) and it's scaled with \(2 \) along \(\vec{b}_{2} \), cause that's how we much we have to scale our \(\vec{b}_{2} \) vector.


Single Trajectory Nonparametric Learning of Nonlinear Dynamics

arXiv.org Machine Learning

Given a single trajectory of a dynamical system, we analyze the performance of the nonparametric least squares estimator (LSE). More precisely, we give nonasymptotic expected $l^2$-distance bounds between the LSE and the true regression function, where expectation is evaluated on a fresh, counterfactual, trajectory. We leverage recently developed information-theoretic methods to establish the optimality of the LSE for nonparametric hypotheses classes in terms of supremum norm metric entropy and a subgaussian parameter. Next, we relate this subgaussian parameter to the stability of the underlying process using notions from dynamical systems theory. When combined, these developments lead to rate-optimal error bounds that scale as $T^{-1/(2+q)}$ for suitably stable processes and hypothesis classes with metric entropy growth of order $\delta^{-q}$. Here, $T$ is the length of the observed trajectory, $\delta \in \mathbb{R}_+$ is the packing granularity and $q\in (0,2)$ is a complexity term. Finally, we specialize our results to a number of scenarios of practical interest, such as Lipschitz dynamics, generalized linear models, and dynamics described by functions in certain classes of Reproducing Kernel Hilbert Spaces (RKHS).