Statistical Learning
Learning Parameters of the K-Means Algorithm From Subjective Human Annotation
Dutta, Haimonti (Columbia University) | Passonneau, Rebecca J. (Columbia University) | Lee, Austin (Columbia University) | Radeva, Axinia (Columbia University) | Xie, Boyi (Columbia University) | Waltz, David (Columbia University)
The New York Public Library is participating in the Chronicling America initiative to develop an online searchable database of historically significant newspaper articles. Microfilm copies of the papers are scanned and high resolution OCR software is run on them. The text from the OCR provides a wealth of data and opinion for researchers and historians. However, the categorization of articles provided by the OCR engine is rudimentary and a large number of the articles are labeled ``editorial" without further categorization. To provide a more refined grouping of articles, unsupervised machine learning algorithms (such as K-Means) are being investigated. The K-Means algorithm requires tuning of parameters such as the number of clusters and mechanism of seeding to ensure that the search is not prone to being caught in a local minima. We designed a pilot study to observe whether humans are adept at finding sub-categories. The subjective labels provided by humans are used as a guide to compare performance of the automated clustering techniques. In addition, seeds provided by annotators are carefully incorporated into a semi-supervised K-Means algorithm (Seeded K-Means); empirical results indicate that this helps to improve performance and provides an intuitive sub-categorization of the articles labeled ``editorial" by the OCR engine.
Robustness of Filter-Based Feature Ranking: A Case Study
Altidor, Wilker (Florida Atlantic University) | Khoshgoftaar, Taghi M. (Florida Atlantic University) | Hulse, Jason Van (Florida Atlantic University)
The filter model of feature selection has been well studied. In previous studies, classification performance has traditionally been proposed as a way to evaluate filter solutions. In this study, a new method of comparing feature ranking techniques is presented providing a straightforward approach for quantifying individual filters’ robustness to class noise. Six commonly-used filters, plus one which is rarely used, are investigated regarding their ability to retain, in the presence of class noise, strong classification performance. Three classifiers and one classification performance metric are considered. The experimental results of this study show that Gain Ratio, one of the well known and widely used filters, is very sensitive to class noise. ReliefF offers the best results with both the NB and kNN learners while Signal-to-noise, though not as widely used in the literature as the others, outperforms all the filters with the SVM learner.
Evaluation of Ontology Knowledge in Chinese Classical Poetry Classification
Fang, Chengyu Alex (The City University of Hong Kong) | Li, Wan Yin Claie (The City University of Hong Kong)
This paper describes preliminary research in the use of ontological knowledge for the task of automatically classifying classical Chinese poetry (CCCP) according to authorship. Based on a collection of poems written by Liu Yong (987–1053 AD) and Su Shi (1037– 1101 AD), which have been analyzed according to a taxonomy of ontological entities at the lexical level, the research looks into the issue of whether characteristic features can be automatically extracted as important stylistic differences between the two poets. This paper examines the efficiency of different ontological concepts as features in CCCP using Support Vector Machine (SVMs). The experiment shows that an integration of ontological knowledge and bags-of-words (BoW) produces a higher precision for CCCP than BoW only with an overall increase of 2.1% and 2.2% in terms of precision and F-score.
Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization
Villena-Román, Julio (Universidad Carlos III de Madrid) | Collada-Pérez, Sonia (Daedalus - Data, Decisions and Language, S.A.) | Lana-Serrano, Sara (Universidad Politécnica de Madrid) | González-Cristóbal, José Carlos (Universidad Politécnica de Madrid)
This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the results provided by the previous classifier, by filtering false positives and dealing with false negatives. The main advantage is that the system can be easily fine-tuned by adding specific rules for those noisy or conflicting categories that have not been successfully trained. We also describe an implementation based on k-Nearest Neighbor and a simple rule language to express lists of positive, negative and relevant (multiword) terms appearing in the input text. The system is evaluated in several scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and categorization using IPTC metadata, EUROVOC thesaurus and others. Results show that this approach achieves a precision that is comparable to top ranked methods, with the added value that it does not require a demanding human expert workload to train.
Automated Assessment of Paragraph Quality: Introduction, Body, and Conclusion Paragraphs
Roscoe, Rod (University of Memphis) | Crossley, Scott (Georgia State University) | Weston, Jennifer (University of Memphis) | McNamara, Danielle (University of Memphis)
Natural language processing and statistical methods were used to identify linguistic features associated with the quality of student-generated paragraphs. Linguistic features were assessed using Coh-Metrix. The resulting computational models demonstrated small to medium effect sizes for predicting paragraph quality: introduction quality r2 = .25, body quality r2 = .10, and conclusion quality r2 = .11. Although the variance explained was somewhat low, the linguistic features identified were consistent with the rhetorical goals of paragraph types. Avenues for bolstering this approach by considering individual writing styles and techniques are considered.
Dissimilarity Kernels for Paraphrase Identification
Lintean, Mihai (University of Memphis) | Rus, Vasile ( University of Memphis )
We present in this paper a novel solution to the problem of paraphrase identification based on lexical dissimilarity kernels. Lexical kernels in conjunction with Support Vector Machines are preferred over other learning methods, e.g. decision trees, due to their ability to handle a high number of features. Dissimilarity-based kernels emphasize dissimilarities among text fragments and therefore are appropriate for text similarity tasks characterized by high lexical overlap. We conducted experiments with our kernels on the Microsoft Research (MSR) Paraphrase Corpus, a standardized data set used for assessing approaches to paraphrase identification. Our reported accuracy results are competitive and robust when compared to state-of-the-art single-model approaches. The results were obtained using 10-fold cross-validation over the entire corpus. We also report competitive results on the test portion of the MSR Paraphrase Corpus, which is the standard way to report results on this corpus.
Simulating Human Ratings on Word Concreteness
Feng, Shi (University of Memphis) | Cai, Zhiqiang (University of Memphis) | Crossley, Scott (Georgia State University) | McNamara, Danielle S ( University of Memphis )
However, word concreteness is not an attribute that a A single word in the human language has many complex computer can directly compute. One means of assessing dimensions such as semantics, parts of speech, lexical type, the characteristics of words is by having humans rate them imagability, concreteness, familiarity, etc. It is important to on the dimensions of interest. Humans are proficient in know the dimensions of words in languages so that we can categorizing words into linguistic dimensions, but it is develop a better theoretical understanding of language and impractical to have humans rating tens of thousands of also to build tools that simulate human intelligence and words that we would need for psycholinguistic research.
Automatic Natural Language Processing and the Detection of Reading Skills and Reading Comprehension
Boonthum-Denecke, Chutima (Hampton University) | McCarthy, Philip (University of Memphis) | Lamkin, Travis (University of Memphis) | Jackson, G. Tanner (University of Memphis) | Magliano, Joseph P. (Northern Illinois University) | McNamara, Danielle S. (University of Memphis)
The primary goal of this study is to assess two approaches for detecting comprehension processes in R-SAT (Reading Strategy Assessment Tool). One approach is based on Latent Semantic Analysis (LSA) while the other is a combination of literal word matching and soundex. A secondary goal is to assess the potential for detecting specific reading comprehension strategies, either in isolation or combination. Participants typed “think-aloud” protocols while reading texts presented on computers. Human judges rated these protocols for the presence of the various reading comprehension strategies. LSA, word, and combined algorithms were compared and the results showed that a combination of both approaches yielded the best results. However, performance of the combined algorithm varied in terms of the type of processes and the grain size of the human coding system. Lastly, the use of reading strategies (either in isolation or combination) is positivity related to students’ Gates–MacGinitie reading comprehension scores, which illustrates the merit of this approach for assessing comprehension skill.
Consensus Clustering + Meta Clustering = Multiple Consensus Clustering
Zhang, Yi (Florida International University) | Li, Tao (Florida International University)
Consensus clustering and meta clustering are two important extensions of the classical clustering problem. Given a set of input clusterings of a given dataset, consensus clustering aims to find a single final clustering which is a better fit in some sense than the existing clusterings, and meta clustering aims to group similar input clusterings together so that users only need to examine a small number of different clusterings. In this paper, we present a new approach, MCC (stands for multiple consensus clustering), to explore multiple clustering views of a given dataset from the input clusterings by combining consensus clustering and meta clustering. In particular, given a set of input clusterings of a particular data set, MCC employs meta clustering to cluster the input clusterings and then uses consensus clustering to generate a consensus for each cluster of the input clusterings. Extensive experimental results on 11 real world data sets demonstrate the effectiveness of our proposed method.
How Many Software Metrics Should be Selected for Defect Prediction?
Wang, Huanjing (Western Kentucky University) | Khoshgoftaar, Taghi M. (Florida Atlantic University) | Seliya, Naeem (University of Michigan, Dearborn)
A software practitioner is interested in the solution to “for a given project, what is the minimum number of software metrics that should be considered for building an effective defect prediction model?” During the development life cycle various software metrics are collected for different reasons. In the case of a metricsbased defect prediction model, an intelligent selection of software metrics prior to building defect predictors is likely to improve model performance. This study utilizes the proposed threshold-based feature selection technique to remove irrelevant and redundant software metrics (a.k.a. features or attributes). A comparative investigation is presented for evaluating the size of the selected feature subsets. The case study is based on software measurement data obtained from a real-world project, and the defect predictors are trained using three commonly used classifiers. The empirical case study results demonstrate that an effective defect predictor can be built with only three metrics; and moreover, model performances improved when over 98.5% of the software metrics were eliminated.