Goto

Collaborating Authors

 Statistical Learning


Bayesian Masking: Sparse Bayesian Estimation with Weaker Shrinkage Bias

arXiv.org Machine Learning

A common strategy for sparse linear regression is to introduce regularization, which eliminates irrelevant features by letting the corresponding weights be zeros. However, regularization often shrinks the estimator for relevant features, which leads to incorrect feature selection. Motivated by the above-mentioned issue, we propose Bayesian masking (BM), a sparse estimation method which imposes no regularization on the weights. The key concept of BM is to introduce binary latent variables that randomly mask features. Estimating the masking rates determines the relevance of the features automatically. We derive a variational Bayesian inference algorithm that maximizes the lower bound of the factorized information criterion (FIC), which is a recently developed asymptotic criterion for evaluating the marginal log-likelihood. In addition, we propose reparametrization to accelerate the convergence of the derived algorithm. Finally, we show that BM outperforms Lasso and automatic relevance determination (ARD) in terms of the sparsity-shrinkage trade-off.


Jointly Learning Multiple Measures of Similarities from Triplet Comparisons

arXiv.org Artificial Intelligence

Similarity between objects is multi-faceted and it can be easier for human annotators to measure it when the focus is on a specific aspect. We consider the problem of mapping objects into view-specific embeddings where the distance between them is consistent with the similarity comparisons of the form "from the t-th view, object A is more similar to B than to C". Our framework jointly learns view-specific embeddings exploiting correlations between views. Experiments on a number of datasets, including one of multi-view crowdsourced comparison on bird images, show the proposed method achieves lower triplet generalization error when compared to both learning embeddings independently for each view and all views pooled into one view. Our method can also be used to learn multiple measures of similarity over input features taking class labels into account and compares favorably to existing approaches for multi-task metric learning on the ISOLET dataset.


Bayesian Inference via Approximation of Log-likelihood for Priors in Exponential Family

arXiv.org Machine Learning

In this paper, a Bayesian inference technique based on Taylor series approximation of the logarithm of the likelihood function is presented. The proposed approximation is devised for the case, where the prior distribution belongs to the exponential family of distributions. The logarithm of the likelihood function is linearized with respect to the sufficient statistic of the prior distribution in exponential family such that the posterior obtains the same exponential family form as the prior. Similarities between the proposed method and the extended Kalman filter for nonlinear filtering are illustrated. Furthermore, an extended target measurement update for target models where the target extent is represented by a random matrix having an inverse Wishart distribution is derived. The approximate update covers the important case where the spread of measurement is due to the target extent as well as the measurement noise in the sensor.


Improved Estimation of Class Prior Probabilities through Unlabeled Data

arXiv.org Machine Learning

Work in the classification literature has shown that in computing a classification function, one need not know the class membership of all observations in the training set; the unlabeled observations still provide information on the marginal distribution of the feature set, and can thus contribute to increased classification accuracy for future observations. The present paper will show that this scheme can also be used for the estimation of class prior probabilities, which would be very useful in applications in which it is difficult or expensive to determine class membership. Both parametric and nonparametric estimators are developed. Asymptotic distributions of the estimators are derived, and it is proven that the use of the unlabeled observations does reduce asymptotic variance. This methodology is also extended to the estimation of subclass probabilities.


Boosting in the presence of outliers: adaptive classification with non-convex loss functions

arXiv.org Artificial Intelligence

This paper examines the role and efficiency of the non-convex loss functions for binary classification problems. In particular, we investigate how to design a simple and effective boosting algorithm that is robust to the outliers in the data. The analysis of the role of a particular non-convex loss for prediction accuracy varies depending on the diminishing tail properties of the gradient of the loss -- the ability of the loss to efficiently adapt to the outlying data, the local convex properties of the loss and the proportion of the contaminated data. In order to use these properties efficiently, we propose a new family of non-convex losses named $\gamma$-robust losses. Moreover, we present a new boosting framework, {\it Arch Boost}, designed for augmenting the existing work such that its corresponding classification algorithm is significantly more adaptable to the unknown data contamination. Along with the Arch Boosting framework, the non-convex losses lead to the new class of boosting algorithms, named adaptive, robust, boosting (ARB). Furthermore, we present theoretical examples that demonstrate the robustness properties of the proposed algorithms. In particular, we develop a new breakdown point analysis and a new influence function analysis that demonstrate gains in robustness. Moreover, we present new theoretical results, based only on local curvatures, which may be used to establish statistical and optimization properties of the proposed Arch boosting algorithms with highly non-convex loss functions. Extensive numerical calculations are used to illustrate these theoretical properties and reveal advantages over the existing boosting methods when data exhibits a number of outliers.


Distributed Parameter Map-Reduce

arXiv.org Machine Learning

This paper describes how to convert a machine learning problem into a series of map-reduce tasks. We study logistic regression algorithm. In logistic regression algorithm, it is assumed that samples are independent and each sample is assigned a probability. Parameters are obtained by maxmizing the product of all sample probabilities. Rapid expansion of training samples brings challenges to machine learning method. Training samples are so many that they can be only stored in distributed file system and driven by map-reduce style programs. The main step of logistic regression is inference. According to map-reduce spirit, each sample makes inference through a separate map procedure. But the premise of inference is that the map procedure holds parameters for all features in the sample. In this paper, we propose Distributed Parameter Map-Reduce, in which not only samples, but also parameters are distributed in nodes of distributed filesystem. Through a series of map-reduce tasks, we assign each sample parameters for its features, make inference for the sample and update paramters of the model. The above processes are excuted looply until convergence. We test the proposed algorithm in actual hadoop production environment. Experiments show that the acceleration of the algorithm is in linear relationship with the number of cluster nodes.


Convex Modeling of Interactions with Strong Heredity

arXiv.org Machine Learning

We consider the task of fitting a regression model involving interactions among a potentially large set of covariates, in which we wish to enforce strong heredity. We propose FAMILY, a very general framework for this task. Our proposal is a generalization of several existing methods, such as VANISH [Radchenko and James, 2010], hierNet [Bien et al., 2013], the all-pairs lasso, and the lasso using only main effects. It can be formulated as the solution to a convex optimization problem, which we solve using an efficient alternating directions method of multipliers (ADMM) algorithm. This algorithm has guaranteed convergence to the global optimum, can be easily specialized to any convex penalty function of interest, and allows for a straightforward extension to the setting of generalized linear models. We derive an unbiased estimator of the degrees of freedom of FAMILY, and explore its performance in a simulation study and on an HIV sequence data set.


Online Tensor Methods for Learning Latent Variable Models

arXiv.org Machine Learning

We introduce an online tensor decomposition based approach for two latent variable modeling problems namely, (1) community detection, in which we learn the latent communities that the social actors in social networks belong to, and (2) topic modeling, in which we infer hidden topics of text articles. We consider decomposition of moment tensors using stochastic gradient descent. We conduct optimization of multilinear operations in SGD and avoid directly forming the tensors, to save computational and storage costs. We present optimized algorithm in two platforms. Our GPU-based implementation exploits the parallelism of SIMD architectures to allow for maximum speed-up by a careful optimization of storage and data transfer, whereas our CPU-based implementation uses efficient sparse matrix computations and is suitable for large sparse datasets. For the community detection problem, we demonstrate accuracy and computational efficiency on Facebook, Yelp and DBLP datasets, and for the topic modeling problem, we also demonstrate good performance on the New York Times dataset. We compare our results to the state-of-the-art algorithms such as the variational method, and report a gain of accuracy and a gain of several orders of magnitude in the execution time.


Distributed Multitask Learning

arXiv.org Machine Learning

We consider the problem of distributed multi-task learning, where each machine learns a separate, but related, task. Specifically, each machine learns a linear predictor in high-dimensional space,where all tasks share the same small support. We present a communication-efficient estimator based on the debiased lasso and show that it is comparable with the optimal centralized method.


Distinguishing short and long $Fermi$ gamma-ray bursts

arXiv.org Machine Learning

Two classes of gamma-ray bursts (GRBs), short and long, have been determined without any doubts, and are usually ascribed to different progenitors, yet these classes overlap for a variety of descriptive parameters. A subsample of 46 long and 22 short $Fermi$ GRBs with estimated Hurst Exponents (HEs), complemented by minimum variability time-scales (MVTS) and durations ($T_{90}$) is used to perform a supervised Machine Learning (ML) and Monte Carlo (MC) simulation using a Support Vector Machine (SVM) algorithm. It is found that while $T_{90}$ itself performs very well in distinguishing short and long GRBs, the overall success ratio is higher when the training set is complemented by MVTS and HE. These results may allow to introduce a new (non-linear) parameter that might provide less ambiguous classification of GRBs.