Goto

Collaborating Authors

 Performance Analysis


Robust Logistic Regression using Shift Parameters (Long Version)

arXiv.org Artificial Intelligence

Almost any large dataset has annotation errors, especially those complex, nuanced datasets commonly used in natural language processing. Low-quality annotations have become even more common in recent years with the rise of Amazon Mechanical Turk, as well as methods like distant supervision and co-training that involve automatically generating training data. Although small amounts of noise may not be detrimental, in some applications the level can be high: upon manually inspecting a relation extraction corpus commonly used in distant supervision, Riedel et al. (2010) report a 31% false positive rate. In cases like these, annotation errors have frequently been observed to hurt performance. Dingare et al. (2005), for example, conduct error analysis on a system to extract relations from biomedical text, and observe that over half of the system's errors could be attributed to inconsistencies in how the data was annotated. Similarly, in a case study on co-training for natural language tasks, Pierce and Cardie (2001) find that the degradation in data quality from automatic labelling prevents these systems from performing comparably to their fully-supervised counterparts.


Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates

arXiv.org Machine Learning

We establish optimal convergence rates for a decomposition-based scalable approach to kernel ridge regression. The method is simple to describe: it randomly partitions a dataset of size N into m subsets of equal size, computes an independent kernel ridge regression estimator for each subset, then averages the local solutions into a global predictor. This partitioning leads to a substantial reduction in computation time versus the standard approach of performing kernel ridge regression on all N samples. Our two main theorems establish that despite the computational speed-up, statistical optimality is retained: as long as m is not too large, the partition-based estimator achieves the statistical minimax rate over all estimators using the set of N samples. As concrete examples, our theory guarantees that the number of processors m may grow nearly linearly for finite-rank kernels and Gaussian kernels and polynomially in N for Sobolev spaces, which in turn allows for substantial reductions in computational cost. We conclude with experiments on both simulated data and a music-prediction task that complement our theoretical results, exhibiting the computational and statistical benefits of our approach.


High Dimensional Semiparametric Latent Graphical Model for Mixed Data

arXiv.org Machine Learning

Graphical models are commonly used tools for modeling multivariate random variables. While there exist many convenient multivariate distributions such as Gaussian distribution for continuous data, mixed data with the presence of discrete variables or a combination of both continuous and discrete variables poses new challenges in statistical modeling. In this paper, we propose a semiparametric model named latent Gaussian copula model for binary and mixed data. The observed binary data are assumed to be obtained by dichotomizing a latent variable satisfying the Gaussian copula distribution or the nonparanormal distribution. The latent Gaussian model with the assumption that the latent variables are multivariate Gaussian is a special case of the proposed model. A novel rank-based approach is proposed for both latent graph estimation and latent principal component analysis. Theoretically, the proposed methods achieve the same rates of convergence for both precision matrix estimation and eigenvector estimation, as if the latent variables were observed. Under similar conditions, the consistency of graph structure recovery and feature selection for leading eigenvectors is established. The performance of the proposed methods is numerically assessed through simulation studies, and the usage of our methods is illustrated by a genetic dataset.


Classifying pairs with trees for supervised biological network inference

arXiv.org Machine Learning

Networks are ubiquitous in biology and computational approaches have been largely investigated for their inference. In particular, supervised machine learning methods can be used to complete a partially known network by integrating various measurements. Two main supervised frameworks have been proposed: the local approach, which trains a separate model for each network node, and the global approach, which trains a single model over pairs of nodes. Here, we systematically investigate, theoretically and empirically, the exploitation of tree-based ensemble methods in the context of these two approaches for biological network inference. We first formalize the problem of network inference as classification of pairs, unifying in the process homogeneous and bipartite graphs and discussing two main sampling schemes. We then present the global and the local approaches, extending the later for the prediction of interactions between two unseen network nodes, and discuss their specializations to tree-based ensemble methods, highlighting their interpretability and drawing links with clustering techniques. Extensive computational experiments are carried out with these methods on various biological networks that clearly highlight that these methods are competitive with existing methods.


A Study on Stroke Rehabilitation through Task-Oriented Control of a Haptic Device via Near-Infrared Spectroscopy-Based BCI

arXiv.org Machine Learning

This paper presents a study in task-oriented approach to stroke rehabilitation by controlling a haptic device via near-infrared spectroscopy-based brain-computer interface (BCI). The task is to command the haptic device to move in opposing directions of leftward and rightward movement. Our study consists of data acquisition, signal preprocessing, and classification. In data acquisition, we conduct experiments based on two different mental tasks: one on pure motor imagery, and another on combined motor imagery and action observation. The experiments were conducted in both offline and online modes. In the signal preprocessing, we use localization method to eliminate channels that are irrelevant to the mental task, as well as perform feature extraction for subsequent classification. We propose multiple support vector machine classifiers with a majority-voting scheme for improved classification results. And lastly, we present test results to demonstrate the efficacy of our proposed approach to possible stroke rehabilitation practice.


Supersparse Linear Integer Models for Interpretable Classification

arXiv.org Machine Learning

Scoring systems are classification models that only require users to add, subtract and multiply a few meaningful numbers to make a prediction. These models are often used because they are practical and interpretable. In this paper, we introduce an off-the-shelf tool to create scoring systems that both accurate and interpretable, known as a Supersparse Linear Integer Model (SLIM). SLIM is a discrete optimization problem that minimizes the 0-1 loss to encourage a high level of accuracy, regularizes the L0-norm to encourage a high level of sparsity, and constrains coefficients to a set of interpretable values. We illustrate the practical and interpretable nature of SLIM scoring systems through applications in medicine and criminology, and show that they are are accurate and sparse in comparison to state-of-the-art classification models using numerical experiments.


Efficiency of conformalized ridge regression

arXiv.org Machine Learning

Conformal prediction is a method of producing prediction sets that can be applied on top of a wide range of prediction algorithms. The method has a guaranteed coverage probability under the standard IID assumption regardless of whether the assumptions (often considerably more restrictive) of the underlying algorithm are satisfied. However, for the method to be really useful it is desirable that in the case where the assumptions of the underlying algorithm are satisfied, the conformal predictor loses little in efficiency as compared with the underlying algorithm (whereas being a conformal predictor, it has the stronger guarantee of validity). In this paper we explore the degree to which this additional requirement of efficiency is satisfied in the case of Bayesian ridge regression; we find that asymptotically conformal prediction sets differ little from ridge regression prediction intervals when the standard Bayesian assumptions are satisfied.


A Permutation Approach for Selecting the Penalty Parameter in Penalized Model Selection

arXiv.org Machine Learning

The analysis of high dimensional data, in which the number of measured predictors is large and can exceed the number of samples, is an important and common problem in statistical applications. When samples are accompanied by a real or categorical response, data analysis typically includes model fitting with the aim of doing prediction or variable selection, or both. The goal of prediction is to derive a rule capable of accurately predicting the response of a new, unlabeled sample. The goal of variable selection is to select a (small) subset of the measured predictors whose individual or coordinated activity is significantly related to the response. In both cases, it is common to assume that the observed data arise from an underlying model that is sparse, in the sense that only a small subset of the predictors are related to the response. Whether sparsity is assumed, or viewed as a desirable feature of a model, analysis of high dimensional data is often carried out by penalized methods that produce models in which a relatively small subset of the available predictors are included. Popular penalized methods include the LASSO (Tibshirani, 1996), its numerous variations, and SCAD (Fan and Li, 2001). In what follows, we focus our attention on the LASSO. The LASSO and its variants require specification of a penalty/tuning parameter that controls the tradeoff between model fit and model size.


GRADE: Machine Learning Support for Graduate Admissions

AI Magazine

This article describes GRADE, a statistical machine learning system developed to support the work of the graduate admissions committee at the University of Texas at Austin Department of Computer Science (UTCS). In recent years, the number of applications to the UTCS PhD program has become too large to manage with a traditional review process. GRADE uses historical admissions data to predict how likely the committee is to admit each new applicant. It reports each prediction as a score similar to those used by human reviewers, and accompanies each by an explanation of what applicant features most influenced its prediction. GRADE makes the review process more efficient by enabling reviewers to spend most of their time on applicants near the decision boundary and by focusing their attention on parts of each applicant’s file that matter the most. An evaluation over two seasons of PhD admissions indicates that the system leads to dramatic time savings, reducing the total time spent on reviews by at least 74 percent.


Information-Theoretic Multi-view Domain Adaptation: A Theoretical and Empirical Study

Journal of Artificial Intelligence Research

Multi-view learning aims to improve classification performance by leveraging the consistency among different views of data. The incorporation of multiple views was paid little attention in the studies of domain adaptation, where the view consistency based on source data is largely violated in the target domain due to the distribution gap between different domain data. In this paper, we leverage multiple views for cross-domain document classification. The central idea is to strengthen the views' consistency on target data by identifying the associations of domain-specific features from different domains. We present an Information-theoretic Multi-view Adaptation Model (IMAM) using a multi-way clustering scheme, where word and link clusters can draw together seemingly unrelated features across domains, which boosts the consistency between document clusterings that are based on the respective word and link views. Moreover, we demonstrate that IMAM can always find the document clustering with the minimal disagreement rate to the overlap of view-based clusterings. We provide both theoretical and empirical justifications of the proposed method. Our experiments show that IMAM significantly outperforms traditional multi-view algorithm co-training, the co-training-based adaptation algorithm CODA, the single-view transfer model CoCC and the large-margin-based multi-view transfer model MVTL-LM.