Incorporating semantic features from the WordNet lexical database is among one of the many approaches that have been tried to improve the predictive performance of text classification models. The intuition behind this is that keywords in the training set alone may not be extensive enough to enable generation of a universal model for a category, but if we incorporate the word relationships in WordNet, a more accurate model may be possible. Other researchers have previously evaluated the effectiveness of incorporating WordNet synonyms, hypernyms, and hyponyms into text classification models. Generally, they have found that improvements in accuracy using features derived from these relationships are dependent upon the nature of the text corpora from which the document collections are extracted. In this paper, we not only reconsider the role of WordNet synonyms, hypernyms, and hyponyms in text classification models, we also consider the role of WordNet meronyms and holonyms. Incorporating these WordNet relationships into a Coordinate Matching classifier, a Naive Bayes classifier, and a Support Vector Machine classifier, we evaluate our approach on six document collections extracted from the Reuters-21578, USENET, and Digi-Trad text corpora. Experimental results show that none of the WordNet relationships were effective at increasing the accuracy of the Naive Bayes classifier. Synonyms, hypernyms, and holonyms were effective at increasing the accuracy of the Coordinate Matching classifier, and hypernyms were effective at increasing the accuracy of the SVM classifier.
George, Clint Pazhayidam (University of Florida) | Puri, Sahil (University of Florida) | Wang, Daisy Zhe (University of Florida) | Wilson, Joseph N. (University of Florida) | Hamilton, William F. (University of Florida)
Electronic discovery is an interesting subproblem of information retrieval in which one identifies documents that are potentially relevant to issues and facts of a legal case from an electronically stored document collection (a corpus). In this paper, we consider representing documents in a topic space using the well-known topic models such as latent Dirichlet allocation and latent semantic indexing, and solving the information retrieval problem via finding document similarities in the topic space rather doing it in the corpus vocabulary space. We also develop an iterative SMART ranking and categorization framework including human-in-the-loop to label a set of seed (training) documents and using them to build a semi-supervised binary document classification model based on Support Vector Machines. To improve this model, we propose a method for choosing seed documents from the whole population via an active learning strategy. We report the results of our experiments on a real dataset in the electronic discovery domain.
Conversations in social media often contain the use of irony or sarcasm, when the users say the opposite of what they really mean. Irony markers are the meta-communicative clues that inform the reader that an utterance is ironic. We propose a thorough analysis of theoretically grounded irony markers in two social media platforms: Twitter and Reddit. Classification and frequency analysis shows that for Twitter, typographic markers such as emoticons and emojis are the most discriminative markers to recognize ironic utterances, while for Reddit the morphological markers (e.g., interjections, tag questions) are the most discriminative.
Mining opinions and sentiment from social networking sites is a popular application for social media systems. Common approaches use a machine learning system with a bag of words feature set. We present Delta TFIDF, an intuitive general purpose technique to efficiently weight word scores before classification. Delta TFIDF is easy to compute, implement, and understand. We use Support Vector Machines to show that Delta TFIDF significantly improves accuracy for sentiment analysis problems using three well known data sets.