Text Classification
TweetTrader.net: Leveraging Crowd Wisdom in a Stock Microblogging Forum
Sprenger, Timm Oliver (Technische Universität München)
TweetTrader.net is a stock microblogging forum that leverages the wisdom of crowds to aggregate the information contained in stock-related tweets. Based on insights from academic research on stock microblogs, the application integrates inputs from text classification, user voting and a proprietary Stock Game in order to extract the sentiment (i.e., the bullishness) of online investors with respect to all publicly traded companies of the S&P 500.
Cause Identification from Aviation Safety Incident Reports via Weakly Supervised Semantic Lexicon Construction
Abedin, M. A., Ng, V., Khan, L.
The Aviation Safety Reporting System collects voluntarily submitted reports on aviation safety incidents to facilitate research work aiming to reduce such incidents. To effectively reduce these incidents, it is vital to accurately identify why these incidents occurred. More precisely, given a set of possible causes, or shaping factors, this task of cause identification involves identifying all and only those shaping factors that are responsible for the incidents described in a report. We investigate two approaches to cause identification. Both approaches exploit information provided by a semantic lexicon, which is automatically constructed via Thelen and Riloff's Basilisk framework augmented with our linguistic and algorithmic modifications. The first approach labels a report using a simple heuristic, which looks for the words and phrases acquired during the semantic lexicon learning process in the report. The second approach recasts cause identification as a text classification problem, employing supervised and transductive text classification algorithms to learn models from incident reports labeled with shaping factors and using the models to label unseen reports. Our experiments show that both the heuristic-based approach and the learning-based approach (when given sufficient training data) outperform the baseline system significantly.
Multi-Task Active Learning with Output Constraints
Zhang, Yi (Carnegie Mellon University)
Many problems in information extraction, text mining, natural language processing and other fields exhibit the same property: multiple prediction tasks are related in the sense that their outputs (labels) satisfy certain constraints. In this paper, we propose an active learning framework exploiting such relations among tasks. Intuitively, with task outputs coupled by constraints, active learning can utilize not only the uncertainty of the prediction in a single task but also the inconsistency of predictions across tasks. We formalize this idea as a cross-task value of information criteria, in which the reward of a labeling assignment is propagated and measured over all relevant tasks reachable through constraints. A specific example of our framework leads to the cross entropy measure on the predictions of coupled tasks, which generalizes the entropy in the classical single-task uncertain sampling. We conduct experiments on two real-world problems: web information extraction and document classification. Empirical results demonstrate the effectiveness of our framework in actively collecting labeled examples for multiple related tasks.
Generating Domain-Specific Clues Using News Corpus for Sentiment Classification
Kim, Youngho (University of Massachusetts Amherst) | Choi, Yoonjung (KAIST) | Myaeng, Sung-Hyon (KAIST)
This paper addresses the problem of automatically generating domain-specific sentiment clues. The main idea is to bootstrap from a small seed set and generate new clues by using dependencies and collocation information between sentiment clues and sentence-level topics that would be a primary subject of sentiment expression (e.g., event, company, and person). The experiments show that the aggregated clues are effective for sentiment classification.
Document Classification for Focused Topics
Power, Russell (New York University) | Chen, Jay (New York University) | Karthik, Trishank (New York University) | Subramanian, Lakshminarayanan (New York University)
Feature extraction is one of the fundamental challenges in improving the accuracy of document classification. While there has been a large body of research literature on document classification, most existing approaches either do not have a high classification accuracy or require massive training sets. In this paper, we propose a simple feature extraction algorithm that can achieve high document classification accuracy in the context of development-centric topics. Our feature extraction algorithm exploits two distinct aspects in development-centric topics: most of these topics tend to be very focused (unlike semantically hard classification topics such as chemistry or banks) due to local language and cultural underpinnings in these topics, the authentic pages tend to use several region specific features. Our algorithm uses a combination of popularity and rarity as two separate metrics to extract features that describe a topic. Given a topic, our output feature set comprises of: (i) a list of popular keywords closely related to the topic; (ii) a list of rare keywords closely related to the topic. We show that a simple joint classifier based on these two feature sets can achieve high classification accuracy while each feature sub-set in itself is insufficient. We have tested our algorithm across a wide range of development-centric topics.
Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
Is accurate classification possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction--where the probability of class membership increases monotonically with the MF's value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classification applications. On the classic "20 Newsgroups" data set, a learner given an MF and unlabeled data achieves classification accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data.
Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
Is accurate classification possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction--where the probability of class membership increases monotonically with the MF's value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classification applications. On the classic "20 Newsgroups" data set, a learner given an MF and unlabeled data achieves classification accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data.
Neuronal Spectral Analysis of EEG and Expert Knowledge Integration for Automatic Classification of Sleep Stages
Kerkeni, Nizar, Alexandre, Frederic, Bedoui, Mohamed Hedi, Bougrain, Laurent, Dogui, Mohamed
Being able to analyze and interpret signal coming from electroencephalogram (EEG) recording can be of high interest for many applications including medical diagnosis and Brain-Computer Interfaces. Indeed, human experts are today able to extract from this signal many hints related to physiological as well as cognitive states of the recorded subject and it would be very interesting to perform such task automatically but today no completely automatic system exists. In previous studies, we have compared human expertise and automatic processing tools, including artificial neural networks (ANN), to better understand the competences of each and determine which are the difficult aspects to integrate in a fully automatic system. In this paper, we bring more elements to that study in reporting the main results of a practical experiment which was carried out in an hospital for sleep pathology study. An EEG recording was studied and labeled by a human expert and an ANN. We describe here the characteristics of the experiment, both human and neuronal procedure of analysis, compare their performances and point out the main limitations which arise from this study.
Can Computers Create Humor?
Ritchie, Graeme (University of Aberdeen)
Despite the fact that AI has always been adventurous in trying to elucidate complex aspects of human behaviour, only recently has there been research into computational modelling of humor. One obstacle to progress is the lack of a precise and detailed theory of how humor operates. Nevertheless, since the early 1990s, there have been a number of small programs that create simple verbal humor, and more recently there have been studies of the automatic classification of the humorous status of texts. In addition, there are a number of advocates of the practical uses of computational humor: in user-interfaces, in education, and in advertising. Computer-generated humor is still quite basic, but it could be viewed as a form of exploratory creativity. For computational humor to improve, some hard problems in AI will have to be addressed.