Goto

Collaborating Authors

 Mathematical & Statistical Methods


Local identifiability of $l_1$-minimization dictionary learning: a sufficient and almost necessary condition

arXiv.org Machine Learning

We study the theoretical properties of learning a dictionary from $N$ signals $\mathbf x_i\in \mathbb R^K$ for $i=1,...,N$ via $l_1$-minimization. We assume that $\mathbf x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a complete (i.e., square and invertible) reference dictionary $\mathbf D_0 \in \mathbb R^{K\times K}$. Here, the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. First, for the population case, we establish a sufficient and almost necessary condition for the reference dictionary $\mathbf D_0$ to be locally identifiable, i.e., a local minimum of the expected $l_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition by Gribonval and Schnass (2010). In addition, we show that for a complete $\mu$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $\mu\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(\mu^{-2})$ nonzeros on average. Moreover, our local identifiability results also translate to the finite sample case with high probability provided that the number of signals $N$ scales as $O(K\log K)$.


From Dependence to Causation

arXiv.org Machine Learning

Machine learning is the science of discovering statistical dependencies in data, and the use of those dependencies to perform predictions. During the last decade, machine learning has made spectacular progress, surpassing human performance in complex tasks such as object recognition, car driving, and computer gaming. However, the central role of prediction in machine learning avoids progress towards general-purpose artificial intelligence. As one way forward, we argue that causal inference is a fundamental component of human intelligence, yet ignored by learning algorithms. Causal inference is the problem of uncovering the cause-effect relationships between the variables of a data generating system. Causal structures provide understanding about how these systems behave under changing, unseen environments. In turn, knowledge about these causal dynamics allows to answer "what if" questions, describing the potential responses of the system under hypothetical manipulations and interventions. Thus, understanding cause and effect is one step from machine learning towards machine reasoning and machine intelligence. But, currently available causal inference algorithms operate in specific regimes, and rely on assumptions that are difficult to verify in practice. This thesis advances the art of causal inference in three different ways. First, we develop a framework for the study of statistical dependence based on copulas and random features. Second, we build on this framework to interpret the problem of causal inference as the task of distribution classification, yielding a family of novel causal inference algorithms. Third, we discover causal structures in convolutional neural network features using our algorithms. The algorithms presented in this thesis are scalable, exhibit strong theoretical guarantees, and achieve state-of-the-art performance in a variety of real-world benchmarks.


The Oracle of Arithmetic Works Best Without Writing Down a Thing

WIRED

In 2010, a startling rumor filtered through the number theory community and reached Jared Weinstein. Apparently, some graduate student at the University of Bonn in Germany had written a paper that redid "Harris-Taylor"--a 288-page book dedicated to a single impenetrable proof in number theory--in only 37 pages. The 22-year-old student, Peter Scholze, had found a way to sidestep one of the most complicated parts of the proof, which deals with a sweeping connection between number theory and geometry. "It was just so stunning for someone so young to have done something so revolutionary," said Weinstein, a 34-year-old number theorist now at Boston University. Mathematicians at the University of Bonn, who made Scholze a full professor just two years later, were already aware of his extraordinary mathematical mind.


Expectation propagation for continuous time stochastic processes

arXiv.org Machine Learning

Physical and technological processes frequently exhibit intrinsic stochasticity. The main mathematical framework to describe and reason about such systems is provided by the theory of continuous time (Markovian) stochastic processes. Such processes have been well studied in chemical physics for several decades as models of chemical reactions at very low concentrations [Gardiner, 1985, e.g.]. More recently, the theory has found novel and diverse areas of application including systems biology at the single cell level [Wilkinson, 2011], ecology [Volkov et al., 2007] and performance modelling in computer systems [Hillston, 2005], to name but a few. The popularity of the approach has been greatly enhanced by the availability of efficient and accurate simulation algorithms [Gillespie, 1977, Gillespie et al., 2013], which permit a numerical solution of medium-sized systems within a reasonable time frame. As with most of science, many of the application domains of continuous time stochastic processes are becoming increasingly data-rich, creating a critical demand for inference algorithms which can use data to calibrate the models and analyse the uncertainty in the predictions. This raises new challenges and opportunities for statistics and machine learning, and has motivated the development of several algorithms for efficient inference in these systems. In this paper, we focus on the Bayesian approach, and formulate the inverse problem in terms of obtaining an approximation to a posterior distribution over the stochastic process, given observations of the system and using existing scientific information to build a prior model of the process.


Linear Algebra Mathematics MIT OpenCourseWare

#artificialintelligence

This course covers matrix theory and linear algebra, emphasizing topics useful in other disciplines such as physics, economics and social sciences, natural sciences, and engineering. It parallels the combination of theory and applications in Professor Strang's textbook Introduction to Linear Algebra. This course has been designed for independent study. It provides everything you will need to understand the concepts covered in the course.


Network analytics: more than pretty pictures

@machinelearnbot

Network analysis is a rapidly growing analytics domain propelled by the explosion of interest in social networking. The methods rest upon much older foundations in the realms of statistics and social science. Euler's graph theory was proposed in the early 18th century and Moreno established the foundations for social network analysis (SNA) in the 1930's. One of the exciting aspects of network analysis is the ability to generate elaborate and insightful visualizations. Indeed this can be a valuable tool for discovery and pattern identification.


Selection of resources to learn Artificial Intelligence / Machine Learning / Statistical Inferenceโ€ฆ

#artificialintelligence

This is a very incomplete and subjective selection of resources to learn about the algorithms and maths of Artificial Intelligence (AI) / Machine Learning (ML) / Statistical Inference (SI) / Deep Learning (DL) / Reinforcement Learning (RL) -- for beginners. It is not an exhaustive list and only contains some of the learning materials that I have personally completed so that I can include brief personal comments on them. It is also by no means the best path to follow (nowadays most MOOCs have full paths all the way from basic statistics and linear algebra to ML/DL). But this is the path I took and in a sense it's a partial documentation of my personal journey into DL (actually I bounced around all of these back and forth like crazy). As someone who has no formal background in Computer Science (but has been programming for many years), the language, notation and concepts of ML/SI/DL and even CS was completely alien to me, and the learning curve was not only steep, but vertical, treacherous and slippery like ice.


A Sharp Bound on the Computation-Accuracy Tradeoff for Majority Voting Ensembles

arXiv.org Machine Learning

When random forests are used for binary classification, an ensemble of $t=1,2,\dots$ randomized classifiers is generated, and the predictions of the classifiers are aggregated by majority vote. Due to the randomness in the algorithm, there is a natural tradeoff between statistical performance and computational cost. On one hand, as $t$ increases, the (random) prediction error of the ensemble tends to decrease and stabilize. On the other hand, larger ensembles require greater computational cost for training and making new predictions. The present work offers a new approach for quantifying this tradeoff: Given a fixed training set $\mathcal{D}$, let the random variables $\text{Err}_{t,0}$ and $\text{Err}_{t,1}$ denote the class-wise prediction error rates of a randomly generated ensemble of size $t$. As $t\to\infty$, we provide a general bound on the "algorithmic variance", $\text{var}(\text{Err}_{t,l}|\mathcal{D})\leq \frac{f_l(1/2)^2}{4t}+o(\frac{1}{t})$, where $l\in\{0,1\}$, and $f_l$ is a density function that arises from the ensemble method. Conceptually, this result is somewhat surprising, because $\text{var}(\text{Err}_{t,l}|\mathcal{D})$ describes how $\text{Err}_{t,l}$ varies over repeated runs of the algorithm, and yet, the formula leads to a method for bounding $\text{var}(\text{Err}_{t,l}|\mathcal{D})$ with a single ensemble. The bound is also sharp in the sense that it is attained by an explicit family of randomized classifiers. With regard to the task of estimating $f_l(1/2)$, the presence of the ensemble leads to a unique twist on the classical setup of non-parametric density estimation --- wherein the effects of sample size and computational cost are intertwined. In particular, we propose an estimator for $f_l(1/2)$, and derive an upper bound on its MSE that matches "standard optimal non-parametric rates" when $t$ is sufficiently large.


Recycling Randomness with Structure for Sublinear time Kernel Expansions

arXiv.org Machine Learning

We propose a scheme for recycling Gaussian random vectors into structured matrices to approximate various kernel functions in sublin-ear time via random embeddings. Our framework includes the Fastfood construction of Le et al. (2013) as a special case, but also extends to Circulant, Toeplitz and Hankel matrices, and the broader family of structured matrices that are characterized by the concept of low-displacement rank. We introduce notions of coherence and graph-theoretic structural constants that control the approximation quality, and prove unbiasedness and low-variance properties of random feature maps that arise within our framework. For the case of low-displacement matrices, we show how the degree of structure and randomness can be controlled to reduce statistical variance at the cost of increased computation and storage requirements. Empirical results strongly support our theory and justify the use of a broader family of structured matrices for scaling up kernel methods using random features.


A simple and provable algorithm for sparse diagonal CCA

arXiv.org Machine Learning

Given two sets of variables, derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced canonical variables are maximally correlated. Sparse CCA is NP-hard. We propose a novel combinatorial algorithm for sparse diagonal CCA, i.e., sparse CCA under the additional assumption that variables within each set are standardized and uncorrelated. Our algorithm operates on a low rank approximation of the input data and its computational complexity scales linearly with the number of input variables. It is simple to implement, and parallelizable. In contrast to most existing approaches, our algorithm administers precise control on the sparsity of the extracted canonical vectors, and comes with theoretical data-dependent global approximation guarantees, that hinge on the spectrum of the input data. Finally, it can be straightforwardly adapted to other constrained variants of CCA enforcing structure beyond sparsity. We empirically evaluate the proposed scheme and apply it on a real neuroimaging dataset to investigate associations between brain activity and behavior measurements.