Incorporating semantic features from the WordNet lexical database is among one of the many approaches that have been tried to improve the predictive performance of text classification models. The intuition behind this is that keywords in the training set alone may not be extensive enough to enable generation of a universal model for a category, but if we incorporate the word relationships in WordNet, a more accurate model may be possible. Other researchers have previously evaluated the effectiveness of incorporating WordNet synonyms, hypernyms, and hyponyms into text classification models. Generally, they have found that improvements in accuracy using features derived from these relationships are dependent upon the nature of the text corpora from which the document collections are extracted. In this paper, we not only reconsider the role of WordNet synonyms, hypernyms, and hyponyms in text classification models, we also consider the role of WordNet meronyms and holonyms. Incorporating these WordNet relationships into a Coordinate Matching classifier, a Naive Bayes classifier, and a Support Vector Machine classifier, we evaluate our approach on six document collections extracted from the Reuters-21578, USENET, and Digi-Trad text corpora. Experimental results show that none of the WordNet relationships were effective at increasing the accuracy of the Naive Bayes classifier. Synonyms, hypernyms, and holonyms were effective at increasing the accuracy of the Coordinate Matching classifier, and hypernyms were effective at increasing the accuracy of the SVM classifier.
The inherent nature of social media content poses serious challenges to practical applications of sentiment analysis. We present VADER, a simple rule-based model for general sentiment analysis, and compare its effectiveness to eleven typical state-of-practice benchmarks including LIWC, ANEW, the General Inquirer, SentiWordNet, and machine learning oriented techniques relying on Naive Bayes, Maximum Entropy, and Support Vector Machine (SVM) algorithms. Using a combination of qualitative and quantitative methods, we first construct and empirically validate a gold-standard list of lexical features (along with their associated sentiment intensity measures) which are specifically attuned to sentiment in microblog-like contexts. We then combine these lexical features with consideration for five general rules that embody grammatical and syntactical conventions for expressing and emphasizing sentiment intensity. Interestingly, using our parsimonious rule-based model to assess the sentiment of tweets, we find that VADER outperforms individual human raters (F1 Classification Accuracy = 0.96 and 0.84, respectively), and generalizes more favorably across contexts than any of our benchmarks.
In the literature, various approaches have been proposedto address the domain adaptation problem in sentiment classification (also called cross-domainsentiment classification). However, the adaptation performance normally much suffers when the data distributionsin the source and target domains differ significantly. In this paper, we suggest to perform activelearning for cross-domain sentiment classification by actively selecting a smallamount of labeled data in the target domain. Accordingly, we propose an novel activelearning approach for cross-domain sentiment classification. First, we traintwo individual classifiers, i.e., the source and target classifiers with thelabeled data from the source and target respectively. Then, the two classifiersare employed to select informative samples with the selection strategy of QueryBy Committee (QBC). Third, the two classifier is combined to make theclassification decision. Importantly, the two classifiers are trained by fullyexploiting the unlabeled data in the target domain with the label propagation(LP) algorithm. Empirical studies demonstrate the effectiveness of our active learning approach for cross-domainsentiment classification over some strong baselines.
Ruiz, Camille (Nara Institute of Science and Technology) | Ito, Kaoru (Nara Institute of Science and Technology) | Wakamiya, Shoko (Nara Institute of Science and Technology) | Aramaki, Eiji (Nara Institute of Science and Technology)
Although loneliness is a very familiar emotion, little is known about it. An aspect to explore is the prevalence of loneliness in the connected world that social media sites like Twitter provide. In light of this, this study investigates the Twitter data of users that have expressed loneliness to understand the phenomenon. Since our primary material are tweets, we developed various indices that can measure social activities reflected in online relationships and real life relationship solely through online Twitter data. Through these indices, the relations between social activity and loneliness were investigated. The results show that high lonely users seem to have low online activity, high positive expressions on real life relationships, and narrow ingroups.
Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise.