Goto

Collaborating Authors

 Bayesian Inference


ATD: Anomalous Topic Discovery in High Dimensional Discrete Data

arXiv.org Machine Learning

We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD


On the estimation of initial conditions in kernel-based system identification

arXiv.org Machine Learning

Recent developments in system identification have brought attention to regularized kernel-based methods, where, adopting the recently introduced stable spline kernel, prior information on the unknown process is enforced. This reduces the variance of the estimates and thus makes kernel-based methods particularly attractive when few input-output data samples are available. In such cases however, the influence of the system initial conditions may have a significant impact on the output dynamics. In this paper, we specifically address this point. We propose three methods that deal with the estimation of initial conditions using different types of information. The methods consist in various mixed maximum likelihood--a posteriori estimators which estimate the initial conditions and tune the hyperparameters characterizing the stable spline kernel. To solve the related optimization problems, we resort to the expectation-maximization method, showing that the solutions can be attained by iterating among simple update steps. Numerical experiments show the advantages, in terms of accuracy in reconstructing the system impulse response, of the proposed strategies, compared to other kernel-based schemes not accounting for the effect initial conditions.


Blind system identification using kernel-based methods

arXiv.org Machine Learning

We propose a new method for blind system identification. Resorting to a Gaussian regression framework, we model the impulse response of the unknown linear system as a realization of a Gaussian process. The structure of the covariance matrix (or kernel) of such a process is given by the stable spline kernel, which has been recently introduced for system identification purposes and depends on an unknown hyperparameter. We assume that the input can be linearly described by few parameters. We estimate these parameters, together with the kernel hyperparameter and the noise variance, using an empirical Bayes approach. The related optimization problem is efficiently solved with a novel iterative scheme based on the Expectation-Maximization method. In particular, we show that each iteration consists of a set of simple update rules. We show, through some numerical experiments, very promising performance of the proposed method.


Variational Gaussian Copula Inference

arXiv.org Machine Learning

We utilize copulas to constitute a unified framework for constructing and optimizing variational proposals in hierarchical Bayesian models. For models with continuous and non-Gaussian hidden variables, we propose a semiparametric and automated variational Gaussian copula approach, in which the parametric Gaussian copula family is able to preserve multivariate posterior dependence, and the nonparametric transformations based on Bernstein polynomials provide ample flexibility in characterizing the univariate marginal posteriors.


Recurrent Exponential-Family Harmoniums without Backprop-Through-Time

arXiv.org Machine Learning

Exponential-family harmoniums (EFHs), which extend restricted Boltzmann machines (RBMs) from Bernoulli random variables to other exponential families (Welling et al., 2005), are generative models that can be trained with unsupervised-learning techniques, like contrastive divergence (Hinton et al., 2006; Hinton, 2002), as density estimators for static data. Methods for extending RBMs--and likewise EFHs--to data with temporal dependencies have been proposed previously (Sutskever and Hinton, 2007; Sutskever et al., 2009), the learning procedure being validated by qualitative assessment of the generative model. Here we propose and justify, from a very different perspective, an alternative training procedure, proving sufficient conditions for optimal inference under that procedure. The resulting algorithm can be learned with only forward passes through the data--backprop-through-time is not required, as in previous approaches. The proof exploits a recent result about information retention in density estimators (Makin and Sabes, 2015), and applies it to a "recurrent EFH" (rEFH) by induction. Finally, we demonstrate optimality by simulation, testing the rEFH: (1) as a filter on training data generated with a linear dynamical system, the position of which is noisily reported by a population of "neurons" with Poisson-distributed spike counts; and (2) with the qualitative experiments proposed by Sutskever et al. (2009).


Classification of Big Data with Application to Imaging Genetics

arXiv.org Machine Learning

ECENT technological achievements and globalization have increased data acquisition capability in almost all corners of human activities, ranging from scientific and engineering endeavors such as genomics, medical imaging, remote sensing, economics and finance, and all the way to people's personal lives with the emergence of social media through the world wide web and mobile networks. The enormous growth of data creates daunting challenges, not only in finding out how to store and access the data, but more importantly, how to process and make sense of it. Also, since data collection is expensive, we are somehow obliged to make good use of the data at hand, so it is obvious that for further progress, the development of efficient algorithms for processing big data is very important. Big data is usually considered in terms of the number of observations n and the number of variables p measured on each observation. In many branches of science such as genetics and medical imaging, the number of variables is very large and is often much larger than the number of observations. This scenario is often denoted as p n.


Arimo Predictive Engine (tm) Shows Opportunity to Improve Investor Returns in Peer-to-Peer Lending - Arimo

#artificialintelligence

Random forest model using Lending Club public dataset shows opportunity to improve adjusted return by 2.75% Arimo recently performed a study using a public dataset provided by Lending Club with the goal of showing how machine learning could improve investor returns. To do this we used the PredictiveEngine component of our Data Intelligence Platform, which provides the ability to easily build a variety of predictive machine learning models which scale transparently when deployed on distributed parallel computing platforms. Lending Club is an online peer-to-peer lending company that connects borrowers with investors who have capital to lend. When a loan application is submitted by a borrower, Lending Club reviews and decides whether to offer a loan at a risk-adjusted rate or to reject the application. As of the 3rd quarter of 2015, more than 12 billion in loans have been issued through Lending Club.


How To Think Real Good

#artificialintelligence

First, it is a brain dump: too long, epsilon-baked, and unpolished. Second, it is not obviously relevant to the topic of this site. Third, parts are more technical than most readers would want. However, a quick, bad post may be better than none. This post was prompted by discussions about Bayesianism and the LessWrong rationalist community, with Scott Alexander, Catharine G. Evans, muflax, and St. Rev. (among others). They are each brilliant, quirky, articulate, and fascinating; consider following them online! They might disagree with much of this post, though, and are not implicated in its defects.] This site concerns ways of thinking about some particularly important things: purpose, self, ethics, authority, and meaning, for instance. My aim is to point out common mistakes in thinking about those things, and how to do better. I enjoy thinking about thinking. That's one reason I spent a dozen years in artificial intelligence research. To make a computer think, you'd need to understand how you think. So AI research is a way of thinking about thinking that forces you to be specific. It calls your bluff if you think you understand thinking, but don't. I thought a lot about how to do AI. 1 In 1988, I put together "How to do research at the MIT AI Lab," a guide for graduate students. Although I edited it, it was a collaboration of many people. There are now many similar guides, some of them better, but this was the first.


Fast methods for training Gaussian processes on large data sets

arXiv.org Machine Learning

Gaussian process regression (GPR) is a non-parametric Bayesian technique for interpolating or fitting data. The main barrier to further uptake of this powerful tool rests in the computational costs associated with the matrices which arise when dealing with large data sets. Here, we derive some simple results which we have found useful for speeding up the learning stage in the GPR algorithm, and especially for performing Bayesian model comparison between different covariance functions. We apply our techniques to both synthetic and real data and quantify the speed-up relative to using nested sampling to numerically evaluate model evidences.


Unbiased Bayesian Inference for Population Markov Jump Processes via Random Truncations

arXiv.org Machine Learning

We consider continuous time Markovian processes where populations of individual agents interact stochastically according to kinetic rules. Despite the increasing prominence of such models in fields ranging from biology to smart cities, Bayesian inference for such systems remains challenging, as these are continuous time, discrete state systems with potentially infinite state-space. Here we propose a novel efficient algorithm for joint state / parameter posterior sampling in population Markov Jump processes. We introduce a class of pseudo-marginal sampling algorithms based on a random truncation method which enables a principled treatment of infinite state spaces. Extensive evaluation on a number of benchmark models shows that this approach achieves considerable savings compared to state of the art methods, retaining accuracy and fast convergence. We also present results on a synthetic biology data set showing the potential for practical usefulness of our work.