Goto

Collaborating Authors

 Statistical Learning


Multi-view Laplacian Support Vector Machines

arXiv.org Machine Learning

We propose a new approach, multi-view Laplacian support vector machines (SVMs), for semi-supervised learning under the multi-view scenario. It integrates manifold regularization and multi-view regularization into the usual formulation of SVMs and is a natural extension of SVMs from supervised learning to multi-view semi-supervised learning. The function optimization problem in a reproducing kernel Hilbert space is converted to an optimization in a finite-dimensional Euclidean space. After providing a theoretical bound for the generalization performance of the proposed method, we further give a formulation of the empirical Rademacher complexity which affects the bound significantly. From this bound and the empirical Rademacher complexity, we can gain insights into the roles played by different regularization terms to the generalization performance. Experimental results on synthetic and real-world data sets are presented, which validate the effectiveness of the proposed multi-view Laplacian SVMs approach.


Does generalization performance of $l^q$ regularization learning depend on $q$? A negative example

arXiv.org Machine Learning

$l^q$-regularization has been demonstrated to be an attractive technique in machine learning and statistical modeling. It attempts to improve the generalization (prediction) capability of a machine (model) through appropriately shrinking its coefficients. The shape of a $l^q$ estimator differs in varying choices of the regularization order $q$. In particular, $l^1$ leads to the LASSO estimate, while $l^{2}$ corresponds to the smooth ridge regression. This makes the order $q$ a potential tuning parameter in applications. To facilitate the use of $l^{q}$-regularization, we intend to seek for a modeling strategy where an elaborative selection on $q$ is avoidable. In this spirit, we place our investigation within a general framework of $l^{q}$-regularized kernel learning under a sample dependent hypothesis space (SDHS). For a designated class of kernel functions, we show that all $l^{q}$ estimators for $0< q < \infty$ attain similar generalization error bounds. These estimated bounds are almost optimal in the sense that up to a logarithmic factor, the upper and lower bounds are asymptotically identical. This finding tentatively reveals that, in some modeling contexts, the choice of $q$ might not have a strong impact in terms of the generalization capability. From this perspective, $q$ can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc..


Cluster Trees on Manifolds

arXiv.org Machine Learning

In this paper we investigate the problem of estimating the cluster tree for a density $f$ supported on or near a smooth $d$-dimensional manifold $M$ isometrically embedded in $\mathbb{R}^D$. We analyze a modified version of a $k$-nearest neighbor based algorithm recently proposed by Chaudhuri and Dasgupta. The main results of this paper show that under mild assumptions on $f$ and $M$, we obtain rates of convergence that depend on $d$ only but not on the ambient dimension $D$. We also show that similar (albeit non-algorithmic) results can be obtained for kernel density estimators. We sketch a construction of a sample complexity lower bound instance for a natural class of manifold oblivious clustering algorithms. We further briefly consider the known manifold case and show that in this case a spatially adaptive algorithm achieves better rates.


Scaling the Indian Buffet Process via Submodular Maximization

arXiv.org Machine Learning

Inference for latent feature models is inherently difficult as the inference space grows exponentially with the size of the input data and number of latent features. In this work, we use Kurihara & Welling (2008)'s maximization-expectation framework to perform approximate MAP inference for linear-Gaussian latent feature models with an Indian Buffet Process (IBP) prior. This formulation yields a submodular function of the features that corresponds to a lower bound on the model evidence. By adding a constant to this function, we obtain a nonnegative submodular function that can be maximized via a greedy algorithm that obtains at least a one-third approximation to the optimal solution. Our inference method scales linearly with the size of the input data, and we show the efficacy of our method on the largest datasets currently analyzed using an IBP model.


Supervised Metric Learning with Generalization Guarantees

arXiv.org Machine Learning

The crucial importance of metrics in machine learning algorithms has led to an increasing interest in optimizing distance and similarity functions, an area of research known as metric learning. When data consist of feature vectors, a large body of work has focused on learning a Mahalanobis distance. Less work has been devoted to metric learning from structured objects (such as strings or trees), most of it focusing on optimizing a notion of edit distance. We identify two important limitations of current metric learning approaches. First, they allow to improve the performance of local algorithms such as k-nearest neighbors, but metric learning for global algorithms (such as linear classifiers) has not been studied so far. Second, the question of the generalization ability of metric learning methods has been largely ignored. In this thesis, we propose theoretical and algorithmic contributions that address these limitations. Our first contribution is the derivation of a new kernel function built from learned edit probabilities. Our second contribution is a novel framework for learning string and tree edit similarities inspired by the recent theory of (e,g,t)-good similarity functions. Using uniform stability arguments, we establish theoretical guarantees for the learned similarity that give a bound on the generalization error of a linear classifier built from that similarity. In our third contribution, we extend these ideas to metric learning from feature vectors by proposing a bilinear similarity learning method that efficiently optimizes the (e,g,t)-goodness. Generalization guarantees are derived for our approach, highlighting that our method minimizes a tighter bound on the generalization error of the classifier. Our last contribution is a framework for establishing generalization bounds for a large class of existing metric learning algorithms based on a notion of algorithmic robustness.


Topic Segmentation and Labeling in Asynchronous Conversations

Journal of Artificial Intelligence Research

Topic segmentation and labeling is often considered a prerequisite for higher-level conversation analysis and has been shown to be useful in many Natural Language Processing (NLP) applications. We present two new corpora of email and blog conversations annotated with topics, and evaluate annotator reliability for the segmentation and labeling tasks in these asynchronous conversations. We propose a complete computational framework for topic segmentation and labeling in asynchronous conversations. Our approach extends state-of-the-art methods by considering a fine-grained structure of an asynchronous conversation, along with other conversational features by applying recent graph-based methods for NLP. For topic segmentation, we propose two novel unsupervised models that exploit the fine-grained conversational structure, and a novel graph-theoretic supervised model that combines lexical, conversational and topic features. For topic labeling, we propose two novel (unsupervised) random walk models that respectively capture conversation specific clues from two different sources: the leading sentences and the fine-grained conversational structure. Empirical evaluation shows that the segmentation and the labeling performed by our best models beat the state-of-the-art, and are highly correlated with human annotations.


Bayesian inference for logistic models using Polya-Gamma latent variables

arXiv.org Machine Learning

We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Polya-Gamma distributions, which are constructed in detail. A variety of examples are presented to show the versatility of the method, including logistic regression, negative binomial regression, nonlinear mixed-effects models, and spatial models for count data. In each case, our data-augmentation strategy leads to simple, effective methods for posterior inference that: (1) circumvent the need for analytic approximations, numerical integration, or Metropolis-Hastings; and (2) outperform other known data-augmentation strategies, both in ease of use and in computational efficiency. All methods, including an efficient sampler for the Polya-Gamma distribution, are implemented in the R package BayesLogit. In the technical supplement appended to the end of the paper, we provide further details regarding the generation of Polya-Gamma random variables; the empirical benchmarks reported in the main manuscript; and the extension of the basic data-augmentation framework to contingency tables and multinomial outcomes.


Performance comparison of State-of-the-art Missing Value Imputation Algorithms on Some Bench mark Datasets

arXiv.org Machine Learning

The presence of missing values influences the selection of appropriate set of attributes that render degradation in classification accuracies of the classifiers. Missing values are a common problem in almost all real world data sets [1] used in knowledge discovery and data mining(KDD) applications. Specifically they are more frequent in clinical databases [2, 3, 4] and temporal climate databases [5, 6]. Their presence would greatly affect the performance of classifiers [7]. The missing values in the databases may arise due various reasons such as the value being lost (erased or deleted) or not recorded, incorrect measurements, equipment errors, or possibly due to an expert not attaching any importance to a particular procedure. The incomplete data can be identified by looking for null values in the data set. However, this is not always true, since missing values can appear in the form of outliers or even wrong data (i.e.


Sparse Factor Analysis for Learning and Content Analytics

arXiv.org Machine Learning

We develop a new model and algorithms for machine learning-based learning analytics, which estimate a learner's knowledge of the concepts underlying a domain, and content analytics, which estimate the relationships among a collection of questions and those concepts. Our model represents the probability that a learner provides the correct response to a question in terms of three factors: their understanding of a set of underlying concepts, the concepts involved in each question, and each question's intrinsic difficulty. We estimate these factors given the graded responses to a collection of questions. The underlying estimation problem is ill-posed in general, especially when only a subset of the questions are answered. The key observation that enables a well-posed solution is the fact that typical educational domains of interest involve only a small number of key concepts. Leveraging this observation, we develop both a bi-convex maximum-likelihood and a Bayesian solution to the resulting SPARse Factor Analysis (SPARFA) problem. We also incorporate user-defined tags on questions to facilitate the interpretability of the estimated factors. Experiments with synthetic and real-world data demonstrate the efficacy of our approach. Finally, we make a connection between SPARFA and noisy, binary-valued (1-bit) dictionary learning that is of independent interest.


The Cluster Graphical Lasso for improved estimation of Gaussian graphical models

arXiv.org Machine Learning

We consider the task of estimating a Gaussian graphical model in the high-dimensional setting. The graphical lasso, which involves maximizing the Gaussian log likelihood subject to an l1 penalty, is a well-studied approach for this task. We begin by introducing a surprising connection between the graphical lasso and hierarchical clustering: the graphical lasso in effect performs a two-step procedure, in which (1) single linkage hierarchical clustering is performed on the variables in order to identify connected components, and then (2) an l1-penalized log likelihood is maximized on the subset of variables within each connected component. In other words, the graphical lasso determines the connected components of the estimated network via single linkage clustering. Unfortunately, single linkage clustering is known to perform poorly in certain settings. Therefore, we propose the cluster graphical lasso, which involves clustering the features using an alternative to single linkage clustering, and then performing the graphical lasso on the subset of variables within each cluster. We establish model selection consistency for this technique, and demonstrate its improved performance relative to the graphical lasso in a simulation study, as well as in applications to an equities data set, a university webpage data set, and a gene expression data set.