Goto

Collaborating Authors

 Mathematical & Statistical Methods


Bellman equation explained

#artificialintelligence

In this article, I am going to explain the Bellman equation, which is one of the fundamental elements of reinforcement learning. The equation tells us what long-term reward can we expect, given the state we are in and assuming that we take the best possible action now and at each subsequent step. Obviously, the goal of reinforcement learning is to maximize the long-term reward, so the Bellman equation can be used to calculate whether we have achieved the goal. The equation below is the Bellman equation for deterministic environments. It gives the value of the current state when the best possible action is chosen in this (and all following steps).


Surprising Uses of Synthetic Random Data Sets

#artificialintelligence

I have used synthetic data sets many times for simulation purposes, most recently in my articles Six degrees of Separations between any two Datasets and How to Lie with p-values. Many applications (including the data sets themselves) can be found in my books Applied Stochastic Processes and New Foundations of Statistical Science. For instance, these data sets can be used to benchmark some statistical tests of hypothesis (the null hypothesis known to be true or false in advance) and to assess the power of such tests or confidence intervals. In other cases, it is used to simulate clusters and test cluster detection / pattern detection algorithms, see here. I also used such data sets to discover two new deep conjectures in number theory (see here), to design new Fintech models such as bounded Brownian motions, and find new families of statistical distributions (see here).


On the Complexity of Approximating Multimarginal Optimal Transport

arXiv.org Machine Learning

We study the complexity of approximating the multimarginal optimal transport (OT) problem, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the multimarginal OT problem is not a minimum-cost flow problem when $m \geq 3$. This implies that many of the combinatorial algorithms developed for classical OT are not applicable to multimarginal OT, and therefore the standard interior-point algorithm bounds result in an intractable complexity bound of $\widetilde{\mathcal{O}}(n^{3m})$. Second, we propose and analyze two simple algorithms for approximating the multimarginal OT problem. The first algorithm, which we refer to as multimarginal Sinkhorn, improves upon previous multimarginal generalizations of the celebrated Sinkhorn algorithm. We show that it achieves a near-linear time complexity bound of $\widetilde{\mathcal{O}}(m^3 n^m / \varepsilon^2)$ for a tolerance $\varepsilon \in (0, 1)$. This matches the best known complexity bound for the Sinkhorn algorithm when $m = 2$ for approximating the classical OT distance. The second algorithm, which we refer to as multimarginal Randkhorn, accelerates the first algorithm by incorporating a randomized estimate sequence and achieves a complexity bound of $\widetilde{\mathcal{O}}(m^{8/3} n^{m+1/3}/\varepsilon)$. This improves on the complexity bound of the first algorithm by $1/\varepsilon$ and matches the best known complexity bound for the Randkhorn algorithm when $m=2$ for approximating the classical OT distance.


Data Smashing 2.0: Sequence Likelihood (SL) Divergence For Fast Time Series Comparison

arXiv.org Machine Learning

Abstract--Recognizing subtle historical patterns is central to modeling and forecasting problems in time series analysis. Here we introduce and develop a new approach to quantify deviations in the underlying hidden generators of observed data streams, resulting in a new efficiently computable universal metric for time series. The proposed metric is universal in the sense that we can compare and contrast data streams regardless of where and how they are generated, and without any feature engineering step. The approach proposed in this paper is conceptually distinct from our previous work on data smashing [4], and vastly improves discrimination performance and computing speed. The core idea here is the generalization of the notion of KL divergence often used to compare probability distributions to a notion of divergence in time series. We call this the sequence likelihood (SL) divergence, which may be used to measure deviations within a well-defined class of discrete-valued stochastic processes. We devise efficient estimators of SL divergence from finite sample paths, and subsequently formulate a universal metric useful for computing distance between time series produced by hidden stochastic generators. We illustrate the superior performance of the new smash2.0 Pattern disambiguation in two distinct applications involving electroencephalogram data and gait recognition is also illustrated. We are hopeful that the smash2.0 Effi ciently learning stochastic processes is a key challenge in analyzing time-dependency in domains where randomness cannot be ignored. For such learning to occur, we need todefine a distance metric to compare and contrast time series. However these distance metrics mentioned all have either or both of the following limitations: first, dimensionality reduction and feature selection heavily relies on domain knowledge and inevitably incur trade-o ff between precision and computability. Secondly, when dealing with data from nontrivial stochastic process dynamics, state of the art techniques might fail to correctly estimate the similarity or lack thereof between exemplars.


Scheduling optimization of parallel linear algebra algorithms using Supervised Learning

arXiv.org Machine Learning

Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations.


Data-driven approximation of the Koopman generator: Model reduction, system identification, and control

arXiv.org Machine Learning

We derive a data-driven method for the approximation of the Koopman generator called gEDMD, which can be regarded as a straightforward extension of EDMD (extended dynamic mode decomposition). This approach is applicable to deterministic and stochastic dynamical systems. It can be used for computing eigenvalues, eigenfunctions, and modes of the generator and for system identification. In addition to learning the governing equations of deterministic systems, which then reduces to SINDy (sparse identification of nonlinear dynamics), it is possible to identify the drift and diffusion terms of stochastic differential equations from data. Moreover, we apply gEDMD to derive coarse-grained models of high-dimensional systems, and also to determine efficient model predictive control strategies. We highlight relationships with other methods and demonstrate the efficacy of the proposed methods using several guiding examples and prototypical molecular dynamics problems.


PDE-Inspired Algorithms for Semi-Supervised Learning on Point Clouds

arXiv.org Machine Learning

Given a data set and a subset of labels the problem of semi-supervised learning on point clouds is to extend the labels to the entire data set. In this paper we extend the labels by minimising the constrained discrete $p$-Dirichlet energy. Under suitable conditions the discrete problem can be connected, in the large data limit, with the minimiser of a weighted continuum $p$-Dirichlet energy with the same constraints. We take advantage of this connection by designing numerical schemes that first estimate the density of the data and then apply PDE methods, such as pseudo-spectral methods, to solve the corresponding Euler-Lagrange equation. We prove that our scheme is consistent in the large data limit for two methods of density estimation: kernel density estimation and spline kernel density estimation.


Inference of modes for linear stochastic processes

arXiv.org Machine Learning

For dynamical systems that can be modelled as asymptotically stable linear systems forced by Gaussian noise, this paper develops methods to infer their modes from observations in real time. The modes can be real or complex. For a real mode, we infer its damping rate, mode shape and amplitude. For a complex mode, we infer its frequency, damping rate, (complex) mode shape and (complex) amplitude. The work is motivated and illustrated by the problem of detection of oscillations in power flow in AC electrical networks. Suggestions of other applications are given.


A generalization of regularized dual averaging and its dynamics

arXiv.org Machine Learning

Excessive computational cost for learning large data and streaming data can be alleviated by using stochastic algorithms, such as stochastic gradient descent and its variants. Recent advances improve stochastic algorithms on convergence speed, adaptivity and structural awareness. However, distributional aspects of these new algorithms are poorly understood, especially for structured parameters. To develop statistical inference in this case, we propose a class of generalized regularized dual averaging (gRDA) algorithms with constant step size, which improves RDA (Xiao, 2010; Flammarion and Bach, 2017). Weak convergence of gRDA trajectories are studied, and as a consequence, for the first time in the literature, the asymptotic distributions for online l1 penalized problems become available. These general results apply to both convex and non-convex differentiable loss functions, and in particular, recover the existing regret bound for convex losses (Nemirovski et al., 2009). As important applications, statistical inferential theory on online sparse linear regression and online sparse principal component analysis are developed, and are supported by extensive numerical analysis. Interestingly, when gRDA is properly tuned, support recovery and central limiting distribution (with mean zero) hold simultaneously in the online setting, which is in contrast with the biased central limiting distribution of batch Lasso (Knight and Fu, 2000). Technical devices, including weak convergence of stochastic mirror descent, are developed as by-products with independent interest. Preliminary empirical analysis of modern image data shows that learning very sparse deep neural networks by gRDA does not necessarily sacrifice testing accuracy.


Neanderthal - Fast Native Matrix and Linear Algebra in Clojure

#artificialintelligence

Built with Clojure in mind, and implemented in Clojure. Sane interface and functions that fit into functional style while still respecting the reality of number crunching. Comes with fast primitive versions of map and reduce. Pluggable engine implementations - even a pure Java engine could be plugged in for the cases when speed is not the priority.