Goto

Collaborating Authors

 Mathematical & Statistical Methods


Data Structures Related to Machine Learning Algorithms

#artificialintelligence

The Statsbot team has invited Peter Mills to tell you about data structures for machine learning approaches. So you've decided to move beyond canned algorithms and start to code your own machine learning methods. Maybe you've got an idea for a cool new way of clustering data, or maybe you are frustrated by the limitations in your favorite statistical classification package. In either case, the better your knowledge of data structures and algorithms, the easier time you'll have when it comes time to code up. I don't think the data structures used in machine learning are significantly different than those used in other areas of software development.


Four Weird Mathematical Objects

@machinelearnbot

Here I discuss four interesting mathematical problems (mostly involving famous unsolved conjectures) of considerable interest, and that even high school kids can understand. For the data scientist, it gives an unique opportunity to test various techniques to either disprove or make progress on these problems. The field itself has been a source of constant innovation -- especially to develop distributed architectures, as well as HPC (high performance computing) and quantum computing to try to solve (to non avail so far) these very difficult yet basic problems. And the data sets involved in these problems are incredibly massive and entirely free: it consists of all the integers, and real numbers! The first two problems have been addressed on Data Science Central (DSC) before, the two other ones are presented here on DSC for the first time.


A Gentle Introduction to Linear Algebra - Machine Learning Mastery

#artificialintelligence

Linear algebra is a field of mathematics that is universally agreed to be a prerequisite to a deeper understanding of machine learning. Although linear algebra is a large field with many esoteric theories and findings, the nuts and bolts tools and notations taken from the field are practical for machine learning practitioners. With a solid foundation of what linear algebra is, it is possible to focus on just the good or relevant parts. In this tutorial, you will discover what exactly linear algebra is from a machine learning perspective. A Gentle Introduction to Linear Algebra Photo by Steve Corey, some rights reserved.


Basics of Linear Algebra for Machine Learning - Machine Learning Mastery

@machinelearnbot

This book was designed around major data structures, operations, and techniques in linear algebra that are directly relevant to machine learning algorithms. There are a lot of things you could learn about linear algebra, from theory to abstract concepts to APIs. My goal is to take you straight to developing an intuition for the elements you must understand with laser-focused tutorials. I designed the tutorials to focus on how to get things done with linear algebra. They give you the tools to both rapidly understand and apply each technique or operation. Each tutorial is designed to take you about one hour to read through and complete, excluding the extensions and further reading. You can choose to work through the lessons one per day, one per week, or at your own pace. I think momentum is critically important, and this book is intended to be read and used, not to sit idle. I would recommend picking a schedule and sticking to it.


The Geometric Block Model

arXiv.org Machine Learning

To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model generalizes the random geometric graphs in the same way that the well-studied stochastic block model generalizes the Erdos-Renyi random graphs. It is also a natural extension of random community models inspired by the recent theoretical and practical advancement in community detection. While being a topic of fundamental theoretical interest, our main contribution is to show that many practical community structures are better explained by the geometric block model. We also show that a simple triangle-counting algorithm to detect communities in the geometric block model is near-optimal. Indeed, even in the regime where the average degree of the graph grows only logarithmically with the number of vertices (sparse-graph), we show that this algorithm performs extremely well, both theoretically and practically. In contrast, the triangle-counting algorithm is far from being optimum for the stochastic block model. We simulate our results on both real and synthetic datasets to show superior performance of both the new model as well as our algorithm.


Variational Inference for Gaussian Process Models with Linear Complexity

arXiv.org Machine Learning

Large-scale Gaussian process inference has long faced practical challenges due to time and space complexity that is superlinear in dataset size. While sparse variational Gaussian process models are capable of learning from large-scale data, standard strategies for sparsifying the model can prevent the approximation of complex functions. In this work, we propose a novel variational Gaussian process model that decouples the representation of mean and covariance functions in reproducing kernel Hilbert space. We show that this new parametrization generalizes previous models. Furthermore, it yields a variational inference problem that can be solved by stochastic gradient ascent with time and space complexity that is only linear in the number of mean function parameters, regardless of the choice of kernels, likelihoods, and inducing points. This strategy makes the adoption of large-scale expressive Gaussian process models possible. We run several experiments on regression tasks and show that this decoupled approach greatly outperforms previous sparse variational Gaussian process inference procedures.


Programming Statistical Applications in R Udemy

@machinelearnbot

Programming Statistical Applications in R is an introductory course teaching the basics of programming mathematical and statistical applications using the R language. The course makes extensive use of the Introduction to Scientific Programming and Simulation using R (spuRs) package from the Comprehensive R Archive Network (CRAN). The course is a scientific-programming foundations course and is a useful complement and precursor to the more simulation-application oriented R Programming for Simulation and Monte-Carlo Methods Udemy course. The two courses were originally developed as a two-course sequence (although they do share some exercises in common). Together, both courses provide a powerful set of unique and useful instruction about how to create your own mathematical and statistical functions and applications using R software.


Curious Mathematical Object: Hyperlogarithms

@machinelearnbot

Logarithms turn a product of numbers into a sum of numbers: log(xy) log(x) log(y). Hyperlogarithms generalize the concept as follows: Hlog(XY) Hlog(X) Hlog(y), where X and Y are any kind of objects, and the product and sum are replaced by operators in some arbitrary space. Here we focus exclusively on operations on sets: XY becomes the intersection of the sets X and Y, and X Y the union of X and Y. The question is: which functions satisfy Hlog(XY) Hlog(X) Hlog(y). We assume here that the argument for Hlog is a set X, and the returned value Hlog(X) Y is another set Y from the same set of sets. Let E {X, Y, ... } be the sets of all potential arguments for Hlog.


Estimating the Number of Connected Components in a Graph via Subgraph Sampling

arXiv.org Machine Learning

Learning properties of large graphs from samples has been an important problem in statistical network analysis since the early work of Goodman \cite{Goodman1949} and Frank \cite{Frank1978}. We revisit a problem formulated by Frank \cite{Frank1978} of estimating the number of connected components in a large graph based on the subgraph sampling model, in which we randomly sample a subset of the vertices and observe the induced subgraph. The key question is whether accurate estimation is achievable in the \emph{sublinear} regime where only a vanishing fraction of the vertices are sampled. We show that it is impossible if the parent graph is allowed to contain high-degree vertices or long induced cycles. For the class of chordal graphs, where induced cycles of length four or above are forbidden, we characterize the optimal sample complexity within constant factors and construct linear-time estimators that provably achieve these bounds. This significantly expands the scope of previous results which have focused on unbiased estimators and special classes of graphs such as forests or cliques. Both the construction and the analysis of the proposed methodology rely on combinatorial properties of chordal graphs and identities of induced subgraph counts. They, in turn, also play a key role in proving minimax lower bounds based on construction of random instances of graphs with matching structures of small subgraphs.


Beautiful Number Theory Problem and Sandbox for Data Scientists

@machinelearnbot

The Waring conjecture - actually a problem associated with a number of conjectures, many now being solved - is one of the most fascinating mathematical problems. This article covers new aspects of this problem, with a generalization and new conjectures, some with a tentative solution, and a new framework to tackle the problem. Yet it is written in simple English and accessible to the layman. I also review a number of famous related mathematical conjectures, including one with a $1 million award still waiting for a solution, as well as Goldbach's conjecture, yet unproved as of today. Many curious properties of the Floor function are also listed, and the emphasis is on machine learning and efficient computer-intensive algorithms to try to find surprising results, which then need to be formally proved or disproved.