Goto

Collaborating Authors

 Mathematical & Statistical Methods


Inference for Change Points in High Dimensional Mean Shift Models

arXiv.org Machine Learning

Detection of change points constitutes a canonical statistical problem due to numerous applications in diverse areas, including economics and finance (Basseville et al. [1993], Frisén [2008]), quality process control (Qiu [2013]), functional genomics and neuroscience (Koepcke et al. [2016]). The offline version of the problem, wherein one examines the data retrospectively and aims to detect the presence and/or location of change points has been studied extensively for a variety of statistical models, including signal plus noise, regression, graphical, random graph, factor and time series models and various algorithms have been developed to accomplish this task -dynamic programming, regularized cost functions, binary segmentation, multiscale methods, etc., see, e.g. the review article Niu et al. [2016]. In the presence of multiple change points, consistency of the estimated location of the change points under certain regularity assumptions on the temporal spacing between change points and on the magnitude of the changes in the underlying model parameters have been established, see, e.g. Fryzlewicz [2014], Frick et al. [2014] and Wang and Samworth [2018] amongst several others, here the former two are under a fixed p framework and the latter under a high dimensional framework. Further, when a single change point has been assumed, the asymptotic distribution of the change point estimator has been established for various statistical models, see, e.g., [Bai, 1994, 1997], Csorgo and Horváth [1997], under fixed p setting, and [Bhattacharjee et al., 2017, 2019], [Kaul et al., 2020, 2021], under diverging dimensionality, where the last two articles allow potential high dimensionality.


Learning Mixed-Integer Linear Programs from Contextual Examples

arXiv.org Artificial Intelligence

Mixed-integer linear programs (MILPs) are widely used in artificial intelligence and operations research to model complex decision problems like scheduling and routing. Designing such programs however requires both domain and modelling expertise. In this paper, we study the problem of acquiring MILPs from contextual examples, a novel and realistic setting in which examples capture solutions and non-solutions within a specific context. The resulting learning problem involves acquiring continuous parameters -- namely, a cost vector and a feasibility polytope -- but has a distinctly combinatorial flavor. To solve this complex problem, we also contribute MISSLE, an algorithm for learning MILPs from contextual examples. MISSLE uses a variant of stochastic local search that is guided by the gradient of a continuous surrogate loss function. Our empirical evaluation on synthetic data shows that MISSLE acquires better MILPs faster than alternatives based on stochastic local search and gradient descent.


NumPy: Linear Algebra on Images

#artificialintelligence

In this instructional exercise, we will use a matrix decomposition from linear algebra, the Singular Value Decomposition, to generate a compressed approximation of an image. Let's start with What is Singular Value Decomposition? Singular Value Decomposition, or SVD, has an extensive variety of applications. These include dimensionality reduction, compressing images, and denoising data. Fundamentally, SVD says that a matrix can be represented as the product of three other matrices.


Rates of Estimation of Optimal Transport Maps using Plug-in Estimators via Barycentric Projections

arXiv.org Machine Learning

Optimal transport maps between two probability distributions $\mu$ and $\nu$ on $\mathbb{R}^d$ have found extensive applications in both machine learning and statistics. In practice, these maps need to be estimated from data sampled according to $\mu$ and $\nu$. Plug-in estimators are perhaps most popular in estimating transport maps in the field of computational optimal transport. In this paper, we provide a comprehensive analysis of the rates of convergences for general plug-in estimators defined via barycentric projections. Our main contribution is a new stability estimate for barycentric projections which proceeds under minimal smoothness assumptions and can be used to analyze general plug-in estimators. We illustrate the usefulness of this stability estimate by first providing rates of convergence for the natural discrete-discrete and semi-discrete estimators of optimal transport maps. We then use the same stability estimate to show that, under additional smoothness assumptions of Besov type or Sobolev type, wavelet based or kernel smoothed plug-in estimators respectively speed up the rates of convergence and significantly mitigate the curse of dimensionality suffered by the natural discrete-discrete/semi-discrete estimators. As a by-product of our analysis, we also obtain faster rates of convergence for plug-in estimators of $W_2(\mu,\nu)$, the Wasserstein distance between $\mu$ and $\nu$, under the aforementioned smoothness assumptions, thereby complementing recent results in Chizat et al. (2020). Finally, we illustrate the applicability of our results in obtaining rates of convergence for Wasserstein barycenters between two probability distributions and obtaining asymptotic detection thresholds for some recent optimal-transport based tests of independence.


Online Matching in Sparse Random Graphs: Non-Asymptotic Performances of Greedy Algorithm

arXiv.org Machine Learning

Motivated by sequential budgeted allocation problems, we investigate online matching problems where connections between vertices are not i.i.d., but they have fixed degree distributions -- the so-called configuration model. We estimate the competitive ratio of the simplest algorithm, GREEDY, by approximating some relevant stochastic discrete processes by their continuous counterparts, that are solutions of an explicit system of partial differential equations. This technique gives precise bounds on the estimation errors, with arbitrarily high probability as the problem size increases. In particular, it allows the formal comparison between different configuration models. We also prove that, quite surprisingly, GREEDY can have better performance guarantees than RANKING, another celebrated algorithm for online matching that usually outperforms the former.


Why Graph Theory Is Cooler Than You Thought

#artificialintelligence

Talk to a scientist in just about any discipline, and ask them the question -- based on their discipline -- "how does that stuff work?" You'll likely find that there are systems and networks that you have to consider before you can really understand how any given thing works: whether that's the human body, a food chain in an ecosystem, a chemical reaction, or a society as a whole. Without understanding the relationship between two animals in an ecosystem, two atoms in a molecule, or cells and tissues in our body, you just have a bunch of data: a list of cells, a readout of animals and what they eat, etc. Traditional machine learning models often take data this way: they take lists or tables of data and do some stuff (the details of which depend on the algorithm being used as well as a few other parameters) to make predictions about a thing. If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material. Graphs are data structures that represent networks of, or relationships between the data they contain. Typically, they're represented as "nodes" and lines, or "edges".


Applications of the Free Energy Principle to Machine Learning and Neuroscience

arXiv.org Artificial Intelligence

In this thesis, we explore and apply methods inspired by the free energy principle to two important areas in machine learning and neuroscience. The free energy principle is a general mathematical theory of the necessary information-theoretic behaviours of systems which maintain a separation from their environment. A core postulate of the theory is that complex systems can be seen as performing variational Bayesian inference and minimizing an information-theoretic quantity called the variational free energy. The free energy principle originated in, and has been extremely influential in theoretical neuroscience, having spawned a number of neurophysiologically realistic process theories, and maintaining close links with Bayesian Brain viewpoints. The thesis is split into three main parts where we apply methods and insights from the free energy principle to understand questions first in perception, then action, and finally learning. Specifically, in the first section, we focus on the theory of predictive coding, a neurobiologically plausible process theory derived from the free energy principle under certain assumptions, which argues that the primary function of the brain is to minimize prediction errors. We focus on scaling up predictive coding architectures and simulate large-scale predictive coding networks for perception on machine learning benchmarks; we investigate predictive coding's relationship to other classical filtering algorithms, and we demonstrate that many biologically implausible aspects of current models of predictive coding can be relaxed without unduly harming the performance of predictive coding models which allows for a potentially more literal translation of predictive coding theory into cortical microcircuits. In the second part of the thesis, we focus on the application of methods deriving from the free energy principle to action. We study the extension of methods of'active inference', a neurobiologically grounded account of action through variational message passing, to utilize deep artificial neural networks, allowing these methods to'scale up' to be competitive with state of the art deep reinforcement learning methods.


Affine-Invariant Integrated Rank-Weighted Depth: Definition, Properties and Finite Sample Analysis

arXiv.org Machine Learning

Because it determines a center-outward ordering of observations in $\mathbb{R}^d$ with $d\geq 2$, the concept of statistical depth permits to define quantiles and ranks for multivariate data and use them for various statistical tasks (\textit{e.g.} inference, hypothesis testing). Whereas many depth functions have been proposed \textit{ad-hoc} in the literature since the seminal contribution of \cite{Tukey75}, not all of them possess the properties desirable to emulate the notion of quantile function for univariate probability distributions. In this paper, we propose an extension of the \textit{integrated rank-weighted} statistical depth (IRW depth in abbreviated form) originally introduced in \cite{IRW}, modified in order to satisfy the property of \textit{affine-invariance}, fulfilling thus all the four key axioms listed in the nomenclature elaborated by \cite{ZuoS00a}. The variant we propose, referred to as the Affine-Invariant IRW depth (AI-IRW in short), involves the covariance/precision matrices of the (supposedly square integrable) $d$-dimensional random vector $X$ under study, in order to take into account the directions along which $X$ is most variable to assign a depth value to any point $x\in \mathbb{R}^d$. The accuracy of the sampling version of the AI-IRW depth is investigated from a nonasymptotic perspective. Namely, a concentration result for the statistical counterpart of the AI-IRW depth is proved. Beyond the theoretical analysis carried out, applications to anomaly detection are considered and numerical results are displayed, providing strong empirical evidence of the relevance of the depth function we propose here.


On Contrastive Representations of Stochastic Processes

arXiv.org Machine Learning

Learning representations of stochastic processes is an emerging problem in machine learning with applications from meta-learning to physical object models to time series. Typical methods rely on exact reconstruction of observations, but this approach breaks down as observations become high-dimensional or noise distributions become complex. To address this, we propose a unifying framework for learning contrastive representations of stochastic processes (CRESP) that does away with exact reconstruction. We dissect potential use cases for stochastic process representations, and propose methods that accommodate each. Empirically, we show that our methods are effective for learning representations of periodic functions, 3D objects and dynamical processes. Our methods tolerate noisy high-dimensional observations better than traditional approaches, and the learned representations transfer to a range of downstream tasks.


Non-PSD Matrix Sketching with Applications to Regression and Optimization

arXiv.org Machine Learning

A variety of dimensionality reduction techniques have been applied for computations involving large matrices. The underlying matrix is randomly compressed into a smaller one, while approximately retaining many of its original properties. As a result, much of the expensive computation can be performed on the small matrix. The sketching of positive semidefinite (PSD) matrices is well understood, but there are many applications where the related matrices are not PSD, including Hessian matrices in non-convex optimization and covariance matrices in regression applications involving complex numbers. In this paper, we present novel dimensionality reduction methods for non-PSD matrices, as well as their ``square-roots", which involve matrices with complex entries. We show how these techniques can be used for multiple downstream tasks. In particular, we show how to use the proposed matrix sketching techniques for both convex and non-convex optimization, $\ell_p$-regression for every $1 \leq p \leq \infty$, and vector-matrix-vector queries.