Accuracy
An Evasion and Counter-Evasion Study in Malicious Websites Detection
Xu, Li, Zhan, Zhenxin, Xu, Shouhuai, Ye, Keyin
Malicious websites are a major cyber attack vector, and effective detection of them is an important cyber defense task. The main defense paradigm in this regard is that the defender uses some kind of machine learning algorithms to train a detection model, which is then used to classify websites in question. Unlike other settings, the following issue is inherent to the problem of malicious websites detection: the attacker essentially has access to the same data that the defender uses to train its detection models. This 'symmetry' can be exploited by the attacker, at least in principle, to evade the defender's detection models. In this paper, we present a framework for characterizing the evasion and counter-evasion interactions between the attacker and the defender, where the attacker attempts to evade the defender's detection models by taking advantage of this symmetry. Within this framework, we show that an adaptive attacker can make malicious websites evade powerful detection models, but proactive training can be an effective counter-evasion defense mechanism. The framework is geared toward the popular detection model of decision tree, but can be adapted to accommodate other classifiers.
Sparse and Low-Rank Covariance Matrices Estimation
Zhou, Shenglong, Xiu, Naihua, Luo, Ziyan, Kong, Lingchen
Estimation of population covariance matrices from samples of multivariate data has draw many attentions in the last decade owing to its fundamental importance in multivariate analysis. With dramatic advances in technology in recent years, various research fields, such as genetic data, brain imaging, spectroscopic imaging, climate data and so on, have been used to deal with massive highdimensional data sets, whose sample sizes can be very small relative to dimension. In such settings, the standard and the most usual sample covariance matrices often performs poorly [1, 2, 11]. Fortunately, regularization as a class of new methods to estimate covariance matrices has recently emerged to overcome those shortages of using traditional sample covariance matrices. These methods encompass several specified forms, banding [1, 6, 17], tapering [4, 10] and thresholding [2, 5, 8, 16] for instance.
Sure Screening for Gaussian Graphical Models
Luo, Shikai, Song, Rui, Witten, Daniela
We propose {graphical sure screening}, or GRASS, a very simple and computationally-efficient screening procedure for recovering the structure of a Gaussian graphical model in the high-dimensional setting. The GRASS estimate of the conditional dependence graph is obtained by thresholding the elements of the sample covariance matrix. The proposed approach possesses the sure screening property: with very high probability, the GRASS estimated edge set contains the true edge set. Furthermore, with high probability, the size of the estimated edge set is controlled. We provide a choice of threshold for GRASS that can control the expected false positive rate. We illustrate the performance of GRASS in a simulation study and on a gene expression data set, and show that in practice it performs quite competitively with more complex and computationally-demanding techniques for graph estimation.
Efficient Regularized Regression for Variable Selection with L0 Penalty
Variable (feature, gene, model, which we use interchangeably) selections for regression with high-dimensional BIGDATA have found many applications in bioinformatics, computational biology, image processing, and engineering. One appealing approach is the L0 regularized regression which penalizes the number of nonzero features in the model directly. L0 is known as the most essential sparsity measure and has nice theoretical properties, while the popular L1 regularization is only a best convex relaxation of L0. Therefore, it is natural to expect that L0 regularized regression performs better than LASSO. However, it is well-known that L0 optimization is NP-hard and computationally challenging. Instead of solving the L0 problems directly, most publications so far have tried to solve an approximation problem that closely resembles L0 regularization. In this paper, we propose an efficient EM algorithm (L0EM) that directly solves the L0 optimization problem. $L_0$EM is efficient with high dimensional data. It also provides a natural solution to all Lp p in [0,2] problems. The regularized parameter can be either determined through cross-validation or AIC and BIC. Theoretical properties of the L0-regularized estimator are given under mild conditions that permit the number of variables to be much larger than the sample size. We demonstrate our methods through simulation and high-dimensional genomic data. The results indicate that L0 has better performance than LASSO and L0 with AIC or BIC has similar performance as computationally intensive cross-validation. The proposed algorithms are efficient in identifying the non-zero variables with less-bias and selecting biologically important genes and pathways with high dimensional BIGDATA.
Classification of Online Health Discussions with Text and Health Feature Sets
Zhang, Mi (Drexel University) | Yang, Christopher C (Drexel University)
Nowadays, many health groups and forums are established on the Internet, where health consumers discuss health issues and interact with each other. Although there is a large amount of user generated content about healthcare on different social media sites, few studies have applied data mining or artificial intelligence techniques for knowledge discovery on a large scale of data in this particular emerging area. In online health forums, it is difficult for users to find relevant topics or peers due to the large amount of information. Traditional recommendation systems may not work well for health online forums, because health consumers have different intentions of participation or may be interest in different types of supports even if the content matches their interest. To help solving this problem, we apply Naïve Bayes methods in this study to classify posts and comments on QuitStop forum, which is an online community for smoking cessation intervention. Classifiers are built on different text features and health features of user quit status. Two different classification tasks are investigated: (1) classification of user intentions, and (2) classification of types of social support exchanged in interactions. We developed classifiers for posts and comments separately, and conducted experiments to compare classifiers with different text and health feature sets. It is found that using thread title or post content can achieve the highest classification accuracy on both posts and comments for user intention classification with text features. On the other hand, using the content of post or comment itself performs the best for the classification of social support types. In particular for the post, integrating health features of the post author can boost the text classifications of user intention and support type. However, user health features cannot help in improving text classifiers for the comments.
On Dataless Hierarchical Text Classification
Song, Yangqiu (University of Illinois at Urbana-Champaign) | Roth, Dan (University of Illinois at Urbana-Champaign)
In this paper, we systematically study the problem of dataless hierarchical text classification. Unlike standard text classification schemes that rely on supervised training, dataless classification depends on understanding the labels of the sought after categories and requires no labeled data. Given a collection of text documents and a set of labels, we show that understanding the labels can be used to accurately categorize the documents. This is done by embedding both labels and documents in a semantic space that allows one to compute meaningful semantic similarity between a document and a potential label. We show that this scheme can be used to support accurate multiclass classification without any supervision. We study several semantic representations and show how to improve the classification using bootstrapping. Our results show that bootstrapped dataless classification is competitive with supervised classification with thousands of labeled examples.
On Hair Recognition in the Wild by Machine
Roth, Joseph (Michigan State University) | Liu, Xiaoming (Michigan State University)
We present an algorithm for identity verification using only information from the hair. Face recognition in the wild (i.e., unconstrained settings) is highly useful in a variety of applications, but performance suffers due to many factors, e.g., obscured face, lighting variation, extreme pose angle, and expression. It is well known that humans utilize hair for identification under many of these scenarios due to either the consistent hair appearance of the same subject or obvious hair discrepancy of different subjects, but little work exists to replicate this intelligence artificially. We propose a learned hair matcher using shape, color, and texture features derived from localized patches through an AdaBoost technique with abstaining weak classifiers when features are not present in the given location. The proposed hair matcher achieves 71.53% accuracy on the LFW View 2 dataset. Hair also reduces the error of a Commercial Off-The-Shelf (COTS) face matcher through simple score-level fusion by 5.7%.
Online Classification Using a Voted RDA Method
Xu, Tianbing (Facebook) | Gao, Jianfeng (Microsoft Research) | Xiao, Lin (Microsoft Research) | Regan, Amelia C. (University of Califorina, Irvine)
We propose a voted dual averaging method for on- line classification problems with explicit regularization. This method employs the update rule of the regularized dual averaging (RDA) method proposed by Xiao, but only on the subsequence of training examples where a classification error is made. We derive a bound on the number of mistakes made by this method on the training set, as well as its generalization error rate. We also intro- duce the concept of relative strength of regularization, and show how it affects the mistake bound and gener- alization performance. We examine the method using l1-regularization on a large-scale natural language pro- cessing task, and obtained state-of-the-art classification performance with fairly sparse models.
A Hybrid Grammar-Based Approach for Learning and Recognizing Natural Hand Gestures
Sadeghipour, Amir (Bielefeld University) | Kopp, Stefan (Bielefeld University)
In this paper, we present a hybrid grammar formalism designed to learn structured models of natural iconic gesture performances that allow for compressed representation and robust recognition. We analyze a dataset of iconic gestures and show how the proposed Feature-based Stochastic Context-Free Grammar (FSCFG) can generalize over both structural and feature-based variations among different gesture performances.
Instance-Based Domain Adaptation in NLP via In-Target-Domain Logistic Approximation
Xia, Rui (Nanjing University of Science and Technology) | Yu, Jianfei (Nanjing University of Science and Technology) | Xu, Feng (Nanjing University of Science and Technology) | Wang, Shumei (Nanjing University of Science and Technology)
In the field of NLP, most of the existing domain adaptation studies belong to the feature-based adaptation, while the research of instance-based adaptation is very scarce. In this work, we propose a new instance-based adaptation model, called in-target-domain logistic approximation (ILA). In ILA, we adapt the source-domain data to the target domain by a logistic approximation. The normalized in-target-domain probability is assigned as an instance weight to each of the source-domain training data. An instance-weighted classification model is trained finally for the cross-domain classification problem. Compared to the previous techniques, ILA conducts instance adaptation in a dimensionality-reduced linear feature space to ensure efficiency in high-dimensional NLP tasks. The instance weights in ILA are learnt by leveraging the criteria of both maximum likelihood and minimum statistical distance. The empirical results on two NLP tasks including text categorization and sentiment classification show that our ILA model beats the state-of-the-art instance adaptation methods significantly, in cross-domain classification accuracy, parameter stability and computational efficiency.