Goto

Collaborating Authors

 Statistical Learning


Distributed Estimation, Information Loss and Exponential Families

arXiv.org Machine Learning

Distributed learning of probabilistic models from multiple data repositories with minimum communication is increasingly important. We study a simple communication-efficient learning framework that first calculates the local maximum likelihood estimates (MLE) based on the data subsets, and then combines the local MLEs to achieve the best possible approximation to the global MLE given the whole dataset. We study this framework's statistical properties, showing that the efficiency loss compared to the global setting relates to how much the underlying distribution families deviate from full exponential families, drawing connection to the theory of information loss by Fisher, Rao and Efron. We show that the "full-exponential-family-ness" represents the lower bound of the error rate of arbitrary combinations of local MLEs, and is achieved by a KL-divergence-based combination method but not by a more common linear combination method. We also study the empirical properties of both methods, showing that the KL method significantly outperforms linear combination in practical settings with issues such as model misspecification, non-convexity, and heterogeneous data partitions.


Bayesian CP Factorization of Incomplete Tensors with Automatic Rank Determination

arXiv.org Machine Learning

--CANDECOMP/P ARAFAC (CP) tensor factorization of incomplete data is a powerful technique for tensor completion through explicitly capturing the multilinear latent factors. The existing CP algorithms require the tensor rank to be manually specified, however, the determination of tensor rank remains a challenging problem especially for CP rank . In addition, existing approaches do not take into account uncertainty information of latent factors, as well as missing entries. T o address these issues, we formulate CP factorization using a hierarchical probabilistic model and employ a fully Bayesian treatment by incorporating a sparsity-inducing prior over multiple latent factors and the appropriate hyperpriors over all hyperparameters, resulting in automatic rank determination. T o learn the model, we develop an efficient deterministic Bayesian inference algorithm, which scales linearly with data size. Our method is characterized as a tuning parameter-free approach, which can effectively infer underlying multilinear factors with a low-rank constraint, while also providing predictive distributions over missing entries. Extensive simulations on synthetic data illustrate the intrinsic capability of our method to recover the ground-truth of CP rank and prevent the overfitting problem, even when a large amount of entries are missing. Moreover, the results from real-world applications, including image inpainting and facial image synthesis, demonstrate that our method outperforms state-of-the-art approaches for both tensor factorization and tensor completion in terms of predictive performance. For instance, a video sequence can be represented by a third-order tensor with dimensionality of height width time; an image ensemble measured under multiple conditions can be represented by a higher order tensor with dimensionality ofpixel person pose illumination . T ensor factorization enables us to explicitly take into account the structure information by effectively capturing the multilinear interactions among multiple latent factors. Therefore, its theory and algorithms have been an active area of study during the past decade (see e.g., [1], [2]), and have been successfully applied to various application fields, such as face recognition, social network analysis, image and video completion, and brain signal processing. This issue has attracted a great deal of research interest in tensor completion in recent years. The objective of tensor factorization of incomplete data is to capture the underlying multilinear factors from only partially observed entries, which can in turn predict the missing entries.


Bayesian tracking and parameter learning for non-linear multiple target tracking models

arXiv.org Machine Learning

We propose a new Bayesian tracking and parameter learning algorithm for non-linear non-Gaussian multiple target tracking (MTT) models. We design a Markov chain Monte Carlo (MCMC) algorithm to sample from the posterior distribution of the target states, birth and death times, and association of observations to targets, which constitutes the solution to the tracking problem, as well as the model parameters. In the numerical section, we present performance comparisons with several competing techniques and demonstrate significant performance improvements in all cases.


Projecting Ising Model Parameters for Fast Mixing

arXiv.org Machine Learning

Inference in general Ising models is difficult, due to high treewidth making tree-based algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.


Joint Estimation of Multiple Graphical Models from High Dimensional Time Series

arXiv.org Machine Learning

In this manuscript we consider the problem of jointly estimating multiple graphical models in high dimensions. We assume that the data are collected from n subjects, each of which consists of T possibly dependent observations. The graphical models of subjects vary, but are assumed to change smoothly corresponding to a measure of closeness between subjects. We propose a kernel based method for jointly estimating all graphical models. Theoretically, under a double asymptotic framework, where both (T,n) and the dimension d can increase, we provide the explicit rate of convergence in parameter estimation. It characterizes the strength one can borrow across different individuals and impact of data dependence on parameter estimation. Empirically, experiments on both synthetic and real resting state functional magnetic resonance imaging (rs-fMRI) data illustrate the effectiveness of the proposed method.


Analysis of corporate environmental reports using statistical techniques and data mining

arXiv.org Artificial Intelligence

Measuring the effectiveness of corporate environmental reports, it being highly qualitative and less regulated, is often considered as a daunting task. The task becomes more complex if comparisons are to be performed. This study is undertaken to overcome the physical verification problems by implementing data mining technique. It further explores on the effectiveness by performing exploratory analysis and structural equation model to bring out the significant linkages between the selected 10 variables. Samples of five hundred and thirty nine reports across various countries are used from an international directory to perform the statistical analysis like: One way ANOVA (Analysis of Variance), MDA (Multivariate Discriminant Analysis) and SEM (Structural Equation Modeling). The results indicate the significant differences among the various types of industries in their environmental reporting, and the exploratory factors like stakeholder, organization strategy and industrial oriented factors, proved significant. The major accomplishment is that the findings correlate with the conceptual frame work of GRI.


Generalization Bounds for Learning with Linear, Polygonal, Quadratic and Conic Side Knowledge

arXiv.org Machine Learning

Mach Learn manuscript No. (will be inserted by the editor) Abstract In this paper, we consider a supervised learning setting where side knowledge is provided about the labels of unlabeled examples. The side knowledge has the effect of reducing the hypothesis space, leading to tighter generalization bounds, and thus possibly better generalization. We consider several types of side knowledge, the first leading to linear and polygonal constraints on the hypothesis space, the second leading to quadratic constraints, and the last leading to conic constraints. We show how different types of domain knowledge can lead directly to these kinds of side knowledge. We prove bounds on complexity measures of the hypothesis space for quadratic and conic side knowledge, and show that these bounds are tight in a specific sense for the quadratic case. Keywords statistical learning theory ยท generalization bounds ยท Rademacher complexity ยท covering numbers, constrained linear function classes ยท side knowledge 1 Introduction Surely, for many applications the amount of domain knowledge we could potentially use within our learning processes is vastly larger than the amount of domain knowledge we actually use. One reason for this is that domain knowledge may be nontrivial to incorporate into algorithms or analysis. For example, researchers in NLP (Natural Language Processing) have long figured out various exotic domain specific knowledge that one can use while performing a learning task [Chang et al., 2008a,b]. The present work aims to provide theoretical guarantees for a large class of problems with a general type of domain knowledge that goes beyond sparsity and smoothness. To define this large class of problems, we will keep the usual supervised learning assumption that the training examples are drawn i.i.d.


On Classification with Bags, Groups and Sets

arXiv.org Machine Learning

In recent years, the field of pattern recognition has seen many problems that are difficult to formulate as regular supervised classification problems where (feature vector, label) pairs are available to train a classifier that, in turn, can predict labels for previously unseen feature vectors. A subset of these problems contains learning scenarios where (part of) the objects are represented by sets or bags of feature vectors or instances. Such learning scenarios include multiple instance learning [11], set classification [42], group-based classification [47] and many others. In this paper we review these learning scenarios. There are several reasons why a bag representation might be chosen in a pattern recognition problem. The first reason is that a single feature vector is often too restrictive to describe an object. For example, in drug activity prediction, we are interested in classifying molecules as having the desired effect (active) or not. However, a molecule is not just a list of its elements: most molecules can fold into different shapes or conformations, which can influence the activity of that molecule.


Gamma Processes, Stick-Breaking, and Variational Inference

arXiv.org Artificial Intelligence

While most Bayesian nonparametric models in machine learning have focused on the Dirichlet process, the beta process, or their variants, the gamma process has recently emerged as a useful nonparametric prior in its own right. Current inference schemes for models involving the gamma process are restricted to MCMC-based methods, which limits their scalability. In this paper, we present a variational inference framework for models involving gamma process priors. Our approach is based on a novel stick-breaking constructive definition of the gamma process. We prove correctness of this stick-breaking process by using the characterization of the gamma process as a completely random measure (CRM), and we explicitly derive the rate measure of our construction using Poisson process machinery. We also derive error bounds on the truncation of the infinite process required for variational inference, similar to the truncation analyses for other nonparametric models based on the Dirichlet and beta processes. Our representation is then used to derive a variational inference algorithm for a particular Bayesian nonparametric latent structure formulation known as the infinite Gamma-Poisson model, where the latent variables are drawn from a gamma process prior with Poisson likelihoods. Finally, we present results for our algorithms on nonnegative matrix factorization tasks on document corpora, and show that we compare favorably to both sampling-based techniques and variational approaches based on beta-Bernoulli priors.


Fast Prediction with SVM Models Containing RBF Kernels

arXiv.org Machine Learning

We present an approximation scheme for support vector machine models that use an RBF kernel. A second-order Maclaurin series approximation is used for exponentials of inner products between support vectors and test instances. The approximation is applicable to all kernel methods featuring sums of kernel evaluations and makes no assumptions regarding data normalization. The prediction speed of approximated models no longer relates to the amount of support vectors but is quadratic in terms of the number of input dimensions. If the number of input dimensions is small compared to the amount of support vectors, the approximated model is significantly faster in prediction and has a smaller memory footprint. An optimized C++ implementation was made to assess the gain in prediction speed in a set of practical tests. We additionally provide a method to verify the approximation accuracy, prior to training models or during run-time, to ensure the loss in accuracy remains acceptable and within known bounds.