Goto

Collaborating Authors

 Statistical Learning


An Improved EM algorithm

arXiv.org Artificial Intelligence

In this paper, we firstly give a brief introduction of expectation maximization (EM) algorithm, and then discuss the initial value sensitivity of expectation maximization algorithm. Subsequently, we give a short proof of EM's convergence. Then, we implement experiments with the expectation maximization algorithm (We implement all the experiments on Gaussion mixture model (GMM)). Our experiment with expectation maximization is performed in the following three cases: initialize randomly; initialize with result of K-means; initialize with result of K-medoids. The experiment result shows that expectation maximization algorithm depend on its initial state or parameters. And we found that EM initialized with K-medoids performed better than both the one initialized with K-means and the one initialized randomly.


Sparse/Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

arXiv.org Machine Learning

We introduce a class of quadratic support (QS) functions, many of which play a crucial role in a variety of applications, including machine learning, robust statistical inference, sparsity promotion, and Kalman smoothing. Well known examples include the l2, Huber, l1 and Vapnik losses. We build on a dual representation for QS functions using convex analysis, revealing the structure necessary for a QS function to be interpreted as the negative log of a probability density, and providing the foundation for statistical interpretation and analysis of QS loss functions. For a subclass of QS functions called piecewise linear quadratic (PLQ) penalties, we also develop efficient numerical estimation schemes. These components form a flexible statistical modeling framework for a variety of learning applications, together with a toolbox of efficient numerical methods for inference. In particular, for PLQ densities, interior point (IP) methods can be used. IP methods solve nonsmooth optimization problems by working directly with smooth systems of equations characterizing their optimality. The efficiency of the IP approach depends on the structure of particular applications. We consider the class of dynamic inverse problems using Kalman smoothing, where the aim is to reconstruct the state of a dynamical system with known process and measurement models starting from noisy output samples. In the classical case, Gaussian errors are assumed in the process and measurement models. The extended framework allows arbitrary PLQ densities to be used, and the proposed IP approach solves the generalized Kalman smoothing problem while maintaining the linear complexity in the size of the time series, just as in the Gaussian case. This extends the computational efficiency of classic algorithms to a much broader nonsmooth setting, and includes many recently proposed robust and sparse smoothers as special cases.


Tensor Decompositions: A New Concept in Brain Data Analysis?

arXiv.org Machine Learning

Matrix factorizations and their extensions to tensor factorizations and decompositions have become prominent techniques for linear and multilinear blind source separation (BSS), especially multiway Independent Component Analysis (ICA), NonnegativeMatrix and Tensor Factorization (NMF/NTF), Smooth Component Analysis (SmoCA) and Sparse Component Analysis (SCA). Moreover, tensor decompositions have many other potential applications beyond multilinear BSS, especially feature extraction, classification, dimensionality reduction and multiway clustering. In this paper, we briefly overview new and emerging models and approaches for tensor decompositions in applications to group and linked multiway BSS/ICA, feature extraction, classification andMultiway Partial Least Squares (MPLS) regression problems. Keywords: Multilinear BSS, linked multiway BSS/ICA, tensor factorizations and decompositions, constrained Tucker and CP models, Penalized Tensor Decompositions (PTD), feature extraction, classification, multiway PLS and CCA.


Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition

arXiv.org Machine Learning

In the high-dimensional regression model a response variable is linearly related to $p$ covariates, but the sample size $n$ is smaller than $p$. We assume that only a small subset of covariates is `active' (i.e., the corresponding coefficients are non-zero), and consider the model-selection problem of identifying the active covariates. A popular approach is to estimate the regression coefficients through the Lasso ($\ell_1$-regularized least squares). This is known to correctly identify the active set only if the irrelevant covariates are roughly orthogonal to the relevant ones, as quantified through the so called `irrepresentability' condition. In this paper we study the `Gauss-Lasso' selector, a simple two-stage method that first solves the Lasso, and then performs ordinary least squares restricted to the Lasso active set. We formulate `generalized irrepresentability condition' (GIC), an assumption that is substantially weaker than irrepresentability. We prove that, under GIC, the Gauss-Lasso correctly recovers the active set.


Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction

arXiv.org Machine Learning

We introduce 'mixed LICORS', an algorithm for learning nonlinear, high-dimensional dynamics from spatio-temporal data, suitable for both prediction and simulation. Mixed LICORS extends the recent LICORS algorithm (Goerg and Shalizi, 2012) from hard clustering of predictive distributions to a non-parametric, EM-like soft clustering. This retains the asymptotic predictive optimality of LICORS, but, as we show in simulations, greatly improves out-of-sample forecasts with limited data. The new method is implemented in the publicly-available R package "LICORS" (http://cran.r-project.org/web/packages/LICORS/).


Semi-Supervised Information-Maximization Clustering

arXiv.org Machine Learning

Semi-supervised clustering aims to introduce prior knowledge in the decision process of a clustering algorithm. In this paper, we propose a novel semi-supervised clustering algorithm based on the information-maximization principle. The proposed method is an extension of a previous unsupervised information-maximization clustering algorithm based on squared-loss mutual information to effectively incorporate must-links and cannot-links. The proposed method is computationally efficient because the clustering solution can be obtained analytically via eigendecomposition. Furthermore, the proposed method allows systematic optimization of tuning parameters such as the kernel width, given the degree of belief in the must-links and cannot-links. The usefulness of the proposed method is demonstrated through experiments.


Revealing social networks of spammers through spectral clustering

arXiv.org Machine Learning

Previous studies on spam have mostly focused on studying its content or its source. Likewise, currently used anti-spam methods mostly involve filtering emails based on their content or by their email server IP address. More recently, there have been studies on the network-level behavior of spammers [1], [2]. However, very little attention has been devoted to studying how spammers acquire the email addresses that they send spam to, a process commonly referred to as harvesting. Harvesting is the first phase of the spam cycle; sending the spam emails to the acquired addresses is the second phase. Spammers send spam emails using spam servers, which are typically compromised computers or open proxies, both of which allow spammers to hide their identities. On the other hand, it has been observed that spammers do not make the same effort to conceal their identities during the harvesting phase [3], indicating that harvesters, which are individuals or bots that collect email addresses, are closely related to the spammers who are sending the spam emails. The harvester and spam server are the two intermediaries in the path of spam, illustrated in Figure 1. In this paper we try to reveal social networks of spammers by identifying communities of harvesters using data from both phases of the spam cycle.


Inferring ground truth from multi-annotator ordinal data: a probabilistic approach

arXiv.org Machine Learning

A popular approach for large scale data annotation tasks is crowdsourcing, wherein each data point is labeled by multiple noisy annotators. We consider the problem of inferring ground truth from noisy ordinal labels obtained from multiple annotators of varying and unknown expertise levels. Annotation models for ordinal data have been proposed mostly as extensions of their binary/categorical counterparts and have received little attention in the crowdsourcing literature. We propose a new model for crowdsourced ordinal data that accounts for instance difficulty as well as annotator expertise, and derive a variational Bayesian inference algorithm for parameter estimation. We analyze the ordinal extensions of several state-of-the-art annotator models for binary/categorical labels and evaluate the performance of all the models on two real world datasets containing ordinal query-URL relevance scores, collected through Amazon's Mechanical Turk. Our results indicate that the proposed model performs better or as well as existing state-of-the-art methods and is more resistant to'spammy' annotators (i.e., annotators who assign labels randomly without actually looking at the instance) than popular baselines such as mean, median, and majority vote which do not account for annotator expertise. Part of the work was done while at Yandex Labs.


Clustering processes

arXiv.org Machine Learning

The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general non-parametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes. (again, no parametric or independence assumptions). In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent.


Semi-supervised Eigenvectors for Large-scale Locally-biased Learning

arXiv.org Machine Learning

In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks "nearby" that prespecified target region. For example, one might be interested in the clustering structure of a data graph near a prespecified "seed set" of nodes, or one might be interested in finding partitions in an image that are near a prespecified "ground truth" set of pixels. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities, thus limiting the applicability of eigenvector-based methods in situations where one is interested in very local properties of the data. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We show that these semi-supervised eigenvectors can be computed quickly as the solution to a system of linear equations; and we also describe several variants of our basic method that have improved scaling properties. We provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning; and we discuss the relationship between our results and recent machine learning algorithms that use global eigenvectors of the graph Laplacian.