Mathematical & Statistical Methods
On the Universality of Graph Neural Networks on Large Random Graphs
Keriven, Nicolas, Bietti, Alberto, Vaiter, Samuel
We study the approximation power of Graph Neural Networks (GNNs) on latent position random graphs. In the large graph limit, GNNs are known to converge to certain "continuous" models known as c-GNNs, which directly enables a study of their approximation power on random graph models. In the absence of input node features however, just as GNNs are limited by the Weisfeiler-Lehman isomorphism test, c-GNNs will be severely limited on simple random graph models. For instance, they will fail to distinguish the communities of a well-separated Stochastic Block Model (SBM) with constant degree function. Thus, we consider recently proposed architectures that augment GNNs with unique node identifiers, referred to as Structural GNNs here (SGNNs). We study the convergence of SGNNs to their continuous counterpart (c-SGNNs) in the large random graph limit, under new conditions on the node identifiers. We then show that c-SGNNs are strictly more powerful than c-GNNs in the continuous limit, and prove their universality on several random graph models of interest, including most SBMs and a large class of random geometric graphs. Our results cover both permutation-invariant and permutation-equivariant architectures.
Hashing embeddings of optimal dimension, with applications to linear least squares
Cartis, Coralia, Fiala, Jan, Shao, Zhen
The aim of this paper is two-fold: firstly, to present subspace embedding properties for $s$-hashing sketching matrices, with $s\geq 1$, that are optimal in the projection dimension $m$ of the sketch, namely, $m=\mathcal{O}(d)$, where $d$ is the dimension of the subspace. A diverse set of results are presented that address the case when the input matrix has sufficiently low coherence (thus removing the $\log^2 d$ factor dependence in $m$, in the low-coherence result of Bourgain et al (2015) at the expense of a smaller coherence requirement); how this coherence changes with the number $s$ of column nonzeros (allowing a scaling of $\sqrt{s}$ of the coherence bound), or is reduced through suitable transformations (when considering hashed -- instead of subsampled -- coherence reducing transformations such as randomised Hadamard). Secondly, we apply these general hashing sketching results to the special case of Linear Least Squares (LLS), and develop Ski-LLS, a generic software package for these problems, that builds upon and improves the Blendenpik solver on dense input and the (sequential) LSRN performance on sparse problems. In addition to the hashing sketching improvements, we add suitable linear algebra tools for rank-deficient and for sparse problems that lead Ski-LLS to outperform not only sketching-based routines on randomly generated input, but also state of the art direct solver SPQR and iterative code HSL on certain subsets of the sparse Florida matrix collection; namely, on least squares problems that are significantly overdetermined, or moderately sparse, or difficult.
Top Stories, May 10-16: Essential Linear Algebra for Data Science and Machine Learning - KDnuggets
Increase your data science and machine learning productivity with these Chrome extensions. Data Science Books You Should Start Reading in 2021, by Przemek Chojecki How to organize your data science project in 2021, by Benjamin Obi Tayo Data Preparation in SQL, with Cheat Sheet!, by Stan Pugsley Charticulator: Microsoft Research open-sourced a game-changing Data Visualization platform, by Josh Taylor Data Scientist vs Machine Learning Engineer – what are their skills?, by Matthew Przybyla Data Scientist vs Machine Learning Engineer – what are their skills?, by Matthew Przybyla Data Science Books You Should Start Reading in 2021, by Przemek Chojecki How to organize your data science project in 2021, by Benjamin Obi Tayo Data Scientist vs Machine Learning Engineer – what are their skills?, by Matthew Przybyla Essential Linear Algebra for Data Science and Machine Learning, by Benjamin Obi Tayo Top 10 Must-Know Machine Learning Algorithms for Data Scientists – Part 1, by Matthew Mayo Data Scientist vs Machine Learning Engineer – what are their skills?, by Matthew Przybyla
A Short Machine Learning Explanation -- in terms of Linear Algebra, Probability and Calculus
In some cases we will need an array with more than two axes. In the general case, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor. Tensors and Multidimensional arrays are different types of object, the first is a type of function, the second is a data structure suitable for representing a tensor in a coordinate system. A scalar is just a single number, in contrast to most of the other objects studied in linear algebra, which are usually arrays of multiple numbers. In terms of tensor -- A tensor that contains only one number is called a Scalar(or scalar tensor, or 0-dimensional tensor, or 0D tensor).
Gaussian processes (1/3) - From scratch
This post explores some concepts behind Gaussian processes, such as stochastic processes and the kernel function. We will build up deeper understanding of Gaussian process regression by implementing them from scratch using Python and NumPy. This post is followed by a second post demonstrating how to fit a Gaussian process kernel with TensorFlow probability . In what follows we assume familiarity with basic probability and linear algebra especially in the context of multivariate Gaussian distributions. Have a look at this post if you need a refresher on the Gaussian distribution.
Escaping Saddle Points with Compressed SGD
Avdiukhin, Dmitrii, Yaroslavtsev, Grigory
Stochastic Gradient Descent (SGD) and its variants are the main workhorses of modern machine learning. Distributed implementations of SGD on a cluster of machines with a central server and a large number of workers are frequently used in practice due to the massive size of the data. In distributed SGD each machine holds a copy of the model and the computation proceeds in rounds. In every round, each worker finds a stochastic gradient based on its batch of examples, the server averages these stochastic gradients to obtain the gradient of the entire batch, makes an SGD step, and broadcasts the updated model parameters to the workers. With a large number of workers, computation parallelizes efficiently while communication becomes the main bottleneck [Chilimbi et al., 2014, Strom, 2015], since each worker needs to send its gradients to the server and receive the updated model parameters. Common solutions for this problem include: local SGD and its variants, when each machine performs multiple local steps before communication [Stich, 2018]; decentralized architectures which allow pairwise communication between the workers [McMahan et al., 2017] and gradient compression, when a compressed version of the gradient is communicated instead of the full gradient [Bernstein et al., 2018, Stich et al., 2018, Karimireddy et al., 2019]. In this work, we consider the latter approach, which we refer to as compressed SGD. Most machine learning models can be described by a d-dimensional vector of parameters x and the model quality can be estimated as a function f(x).
Learning a Latent Simplex in Input-Sparsity Time
Bakshi, Ainesh, Bhattacharyya, Chiranjib, Kannan, Ravi, Woodruff, David P., Zhou, Samson
We consider the problem of learning a latent $k$-vertex simplex $K\subset\mathbb{R}^d$, given access to $A\in\mathbb{R}^{d\times n}$, which can be viewed as a data matrix with $n$ points that are obtained by randomly perturbing latent points in the simplex $K$ (potentially beyond $K$). A large class of latent variable models, such as adversarial clustering, mixed membership stochastic block models, and topic models can be cast as learning a latent simplex. Bhattacharyya and Kannan (SODA, 2020) give an algorithm for learning such a latent simplex in time roughly $O(k\cdot\textrm{nnz}(A))$, where $\textrm{nnz}(A)$ is the number of non-zeros in $A$. We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$, which holds in many of these applications. Further, we show this assumption is necessary, as otherwise an algorithm for learning a latent simplex would imply an algorithmic breakthrough for spectral low rank approximation. At a high level, Bhattacharyya and Kannan provide an adaptive algorithm that makes $k$ matrix-vector product queries to $A$ and each query is a function of all queries preceding it. Since each matrix-vector product requires $\textrm{nnz}(A)$ time, their overall running time appears unavoidable. Instead, we obtain a low-rank approximation to $A$ in input-sparsity time and show that the column space thus obtained has small $\sin\Theta$ (angular) distance to the right top-$k$ singular space of $A$. Our algorithm then selects $k$ points in the low-rank subspace with the largest inner product with $k$ carefully chosen random vectors. By working in the low-rank subspace, we avoid reading the entire matrix in each iteration and thus circumvent the $\Theta(k\cdot\textrm{nnz}(A))$ running time.
Automatic Sudoku (Number Place) Solver with Digit Recognition and Integer Linear Programming
Sudoku is a logic-based number placement puzzle that consists of 81 cells which are divided into 9 columns, rows and blocks. The goal of this game is to fill out each cells with numbers 1–9 so that there are no repeating numbers in each row, column and blocks. In this post, I aim to introduce a digit recognition and integer linear programming based automatic sudoku solver that uses the following: Keras (based on the MNIST database [1]) and OpenCV for digit recognition and PuLP for integer linear programming. The database is also widely used for training and testing in the field of machine learning. In this section, I explain the overview of image processing for digit recognition.
Hello World of Programming with Linear Algebra
Imagine this simplified code for inventory modeling. We're using floats for prices (bad), we store the data in global state, the architecture is far from even a simple web application. But the model is familiar enough to a typical software developer.) This (imaginary) application exists to track sales. A customer puts the desired products into a cart, we calculate the total price, and, later, perform the delivery.