Performance Analysis
Efficient Regularized Regression for Variable Selection with L0 Penalty
Variable (feature, gene, model, which we use interchangeably) selections for regression with high-dimensional BIGDATA have found many applications in bioinformatics, computational biology, image processing, and engineering. One appealing approach is the L0 regularized regression which penalizes the number of nonzero features in the model directly. L0 is known as the most essential sparsity measure and has nice theoretical properties, while the popular L1 regularization is only a best convex relaxation of L0. Therefore, it is natural to expect that L0 regularized regression performs better than LASSO. However, it is well-known that L0 optimization is NP-hard and computationally challenging. Instead of solving the L0 problems directly, most publications so far have tried to solve an approximation problem that closely resembles L0 regularization. In this paper, we propose an efficient EM algorithm (L0EM) that directly solves the L0 optimization problem. $L_0$EM is efficient with high dimensional data. It also provides a natural solution to all Lp p in [0,2] problems. The regularized parameter can be either determined through cross-validation or AIC and BIC. Theoretical properties of the L0-regularized estimator are given under mild conditions that permit the number of variables to be much larger than the sample size. We demonstrate our methods through simulation and high-dimensional genomic data. The results indicate that L0 has better performance than LASSO and L0 with AIC or BIC has similar performance as computationally intensive cross-validation. The proposed algorithms are efficient in identifying the non-zero variables with less-bias and selecting biologically important genes and pathways with high dimensional BIGDATA.
Classification of Online Health Discussions with Text and Health Feature Sets
Zhang, Mi (Drexel University) | Yang, Christopher C (Drexel University)
Nowadays, many health groups and forums are established on the Internet, where health consumers discuss health issues and interact with each other. Although there is a large amount of user generated content about healthcare on different social media sites, few studies have applied data mining or artificial intelligence techniques for knowledge discovery on a large scale of data in this particular emerging area. In online health forums, it is difficult for users to find relevant topics or peers due to the large amount of information. Traditional recommendation systems may not work well for health online forums, because health consumers have different intentions of participation or may be interest in different types of supports even if the content matches their interest. To help solving this problem, we apply Naïve Bayes methods in this study to classify posts and comments on QuitStop forum, which is an online community for smoking cessation intervention. Classifiers are built on different text features and health features of user quit status. Two different classification tasks are investigated: (1) classification of user intentions, and (2) classification of types of social support exchanged in interactions. We developed classifiers for posts and comments separately, and conducted experiments to compare classifiers with different text and health feature sets. It is found that using thread title or post content can achieve the highest classification accuracy on both posts and comments for user intention classification with text features. On the other hand, using the content of post or comment itself performs the best for the classification of social support types. In particular for the post, integrating health features of the post author can boost the text classifications of user intention and support type. However, user health features cannot help in improving text classifiers for the comments.
On Dataless Hierarchical Text Classification
Song, Yangqiu (University of Illinois at Urbana-Champaign) | Roth, Dan (University of Illinois at Urbana-Champaign)
In this paper, we systematically study the problem of dataless hierarchical text classification. Unlike standard text classification schemes that rely on supervised training, dataless classification depends on understanding the labels of the sought after categories and requires no labeled data. Given a collection of text documents and a set of labels, we show that understanding the labels can be used to accurately categorize the documents. This is done by embedding both labels and documents in a semantic space that allows one to compute meaningful semantic similarity between a document and a potential label. We show that this scheme can be used to support accurate multiclass classification without any supervision. We study several semantic representations and show how to improve the classification using bootstrapping. Our results show that bootstrapped dataless classification is competitive with supervised classification with thousands of labeled examples.
On Hair Recognition in the Wild by Machine
Roth, Joseph (Michigan State University) | Liu, Xiaoming (Michigan State University)
We present an algorithm for identity verification using only information from the hair. Face recognition in the wild (i.e., unconstrained settings) is highly useful in a variety of applications, but performance suffers due to many factors, e.g., obscured face, lighting variation, extreme pose angle, and expression. It is well known that humans utilize hair for identification under many of these scenarios due to either the consistent hair appearance of the same subject or obvious hair discrepancy of different subjects, but little work exists to replicate this intelligence artificially. We propose a learned hair matcher using shape, color, and texture features derived from localized patches through an AdaBoost technique with abstaining weak classifiers when features are not present in the given location. The proposed hair matcher achieves 71.53% accuracy on the LFW View 2 dataset. Hair also reduces the error of a Commercial Off-The-Shelf (COTS) face matcher through simple score-level fusion by 5.7%.
Online Classification Using a Voted RDA Method
Xu, Tianbing (Facebook) | Gao, Jianfeng (Microsoft Research) | Xiao, Lin (Microsoft Research) | Regan, Amelia C. (University of Califorina, Irvine)
We propose a voted dual averaging method for on- line classification problems with explicit regularization. This method employs the update rule of the regularized dual averaging (RDA) method proposed by Xiao, but only on the subsequence of training examples where a classification error is made. We derive a bound on the number of mistakes made by this method on the training set, as well as its generalization error rate. We also intro- duce the concept of relative strength of regularization, and show how it affects the mistake bound and gener- alization performance. We examine the method using l1-regularization on a large-scale natural language pro- cessing task, and obtained state-of-the-art classification performance with fairly sparse models.
Efficient Generalized Fused Lasso and its Application to the Diagnosis of Alzheimer’s Disease
Xin, Bo (Peking University) | Kawahara, Yoshinobu (Osaka University) | Wang, Yizhou (Peking University) | Gao, Wen (Peking University)
Generalized fused lasso (GFL) penalizes variables with L1 norms based both on the variables and their pairwise differences. GFL is useful when applied to data where prior information is expressed using a graph over the variables. However, the existing GFL algorithms incur high computational costs and they do not scale to high-dimensional problems. In this study, we propose a fast and scalable algorithm for GFL. Based on the fact that fusion penalty is the Lov'asz extension of a cut function, we show that the key building block of the optimization is equivalent to recursively solving parametric graph-cut problems. Thus, we use a parametric flow algorithm to solve GFL in an efficient manner. Runtime comparisons demonstrated a significant speed-up compared with the existing GFL algorithms. By exploiting the scalability of the proposed algorithm, we formulated the diagnosis of Alzheimer's disease as GFL. Our experimental evaluations demonstrated that the diagnosis performance was promising and that the selected critical voxels were well structured i.e., connected, consistent according to cross-validation and in agreement with prior clinical knowledge.
A Hybrid Grammar-Based Approach for Learning and Recognizing Natural Hand Gestures
Sadeghipour, Amir (Bielefeld University) | Kopp, Stefan (Bielefeld University)
In this paper, we present a hybrid grammar formalism designed to learn structured models of natural iconic gesture performances that allow for compressed representation and robust recognition. We analyze a dataset of iconic gestures and show how the proposed Feature-based Stochastic Context-Free Grammar (FSCFG) can generalize over both structural and feature-based variations among different gesture performances.
Bagging by Design (on the Suboptimality of Bagging)
Papakonstantinou, Periklis (Tsinghua University) | Xu, Jia (Tsinghua University) | Cao, Zhu (Tsinghua University)
Bagging (Breiman 1996) and its variants is one of the most popular methods in aggregating classifiers and regressors. Originally, its analysis assumed that the bootstraps are built from an unlimited, independent source of samples, therefore we call this form of bagging ideal-bagging. However in the real world, base predictors are trained on data subsampled from a limited number of training samples and thus they behave very differently. We analyze the effect of intersections between bootstraps, obtained by subsampling, to train different base predictors. Most importantly, we provide an alternative subsampling method called design-bagging based on a new construction of combinatorial designs, and prove it universally better than bagging. Methodologically, we succeed at this level of generality because we compare the prediction accuracy of bagging and design-bagging relative to the accuracy ideal-bagging. This finds potential applications in more involved bagging-based methods. Our analytical results are backed up by experiments on classification and regression settings.
Instance-Based Domain Adaptation in NLP via In-Target-Domain Logistic Approximation
Xia, Rui (Nanjing University of Science and Technology) | Yu, Jianfei (Nanjing University of Science and Technology) | Xu, Feng (Nanjing University of Science and Technology) | Wang, Shumei (Nanjing University of Science and Technology)
In the field of NLP, most of the existing domain adaptation studies belong to the feature-based adaptation, while the research of instance-based adaptation is very scarce. In this work, we propose a new instance-based adaptation model, called in-target-domain logistic approximation (ILA). In ILA, we adapt the source-domain data to the target domain by a logistic approximation. The normalized in-target-domain probability is assigned as an instance weight to each of the source-domain training data. An instance-weighted classification model is trained finally for the cross-domain classification problem. Compared to the previous techniques, ILA conducts instance adaptation in a dimensionality-reduced linear feature space to ensure efficiency in high-dimensional NLP tasks. The instance weights in ILA are learnt by leveraging the criteria of both maximum likelihood and minimum statistical distance. The empirical results on two NLP tasks including text categorization and sentiment classification show that our ILA model beats the state-of-the-art instance adaptation methods significantly, in cross-domain classification accuracy, parameter stability and computational efficiency.
Calibration-Free BCI Based Control
Grizou, Jonathan (INRIA - Ensta ParisTech) | Iturrate, Iñaki (CBNI, EPFL) | Montesano, Luis (I3A, University of Zaragoza) | Oudeyer, Pierre-Yves (INRIA - Ensta ParisTech) | Lopes, Manuel (INRIA - Ensta ParisTech)
Recent works have explored the use of brain signals to directly control virtual and robotic agents in sequential tasks. So far in such brain-computer interfaces (BCI), an explicit calibration phase was required to build a decoder that translates raw electroencephalography (EEG) signals from the brain of each user into meaningful instructions. This paper proposes a method that removes the calibration phase, and allows a user to control an agent to solve a sequential task. The proposed method assumes a distribution of possible tasks, and infers the interpretation of EEG signals and the task by selecting the hypothesis which best explains the history of interaction. We introduce a measure of uncertainty on the task and on the EEG signal interpretation to act as an exploratory bonus for a planning strategy. This speeds up learning by guiding the system to regions that better disambiguate among task hypotheses. We report experiments where four users use BCI to control an agent on a virtual world to reach a target without any previous calibration process.