Goto

Collaborating Authors

 Information Extraction


What Is an Opinion About? Exploring Political Standpoints Using Opinion Scoring Model

AAAI Conferences

In this paper, we propose a generative model to automatically discover the hidden associations between topics words and opinion words. By applying those discovered hidden associations, we construct the opinion scoring models to extract statements which best express opinionists’ standpoints on certain topics. For experiments, we apply our model to the political area. First, we visualize the similarities and dissimilarities between Republican and Democratic senators with respect to various topics. Second, we compare the performance of the opinion scoring models with 14 kinds of methods to find the best ones. We find that sentences extracted by our opinion scoring models can effectively express opinionists’ standpoints.


Constructing Reference Sets from Unstructured, Ungrammatical Text

Journal of Artificial Intelligence Research

Vast amounts of text on the Web are unstructured and ungrammatical, such as classified ads, auction listings, forum postings, etc. We call such text posts. Despite their inconsistent structure and lack of grammar, posts are full of useful information. This paper presents work on semi-automatically building tables of relational information, called reference sets, by analyzing such posts directly. Reference sets can be applied to a number of tasks such as ontology maintenance and information extraction. Our reference-set construction method starts with just a small amount of background knowledge, and constructs tuples representing the entities in the posts to form a reference set. We also describe an extension to this approach for the special case where even this small amount of background knowledge is impossible to discover and use. To evaluate the utility of the machine-constructed reference sets, we compare them to manually constructed reference sets in the context of reference-set-based information extraction. Our results show the reference sets constructed by our method outperform manually constructed reference sets. We also compare the reference-set-based extraction approach using the machine-constructed reference set to supervised extraction approaches using generic features. These results demonstrate that using machine-constructed reference sets outperforms the supervised methods, even though the supervised methods require training data.


From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series

AAAI Conferences

We connect measures of public opinion measured from polls with sentiment measured from text. We analyze several surveys on consumer confidence and political opinion over the 2008 to 2009 period, and find they correlate to sentiment word frequencies in contempora- neous Twitter messages. While our results vary across datasets, in several cases the correlations are as high as 80%, and capture important large-scale trends. The re- sults highlight the potential of text streams as a substi- tute and supplement for traditional polling. consumer confidence and political opinion, and can also pre- dict future movements in the polls. We find that temporal smoothing is a critically important issue to support a suc- cessful model.


Classifier Calibration for Multi-Domain Sentiment Classification

AAAI Conferences

Textual sentiment classifiers classify texts into a fixed number of affective classes, such as positive, negative or neutral sentiment, or subjective versus objective information. It has been observed that sentiment classifiers suffer from a lack of generalization capability: a classifier trained on a certain domain generally performs worse on data from another domain. This phenomenon has been attributed to domain-specific affective vocabulary. In this paper, we propose a voting-based thresholding approach, which calibrates a number of existing single-domain classifiers with respect to sentiment data from a new domain. The approach presupposes only a small amount of annotated data from the new domain. We evaluate three criteria for estimating thresholds, and discuss the ramifications of these criteria for the trade-off between classifier performance and manual annotation effort.


Generating Domain-Specific Clues Using News Corpus for Sentiment Classification

AAAI Conferences

This paper addresses the problem of automatically generating domain-specific sentiment clues. The main idea is to bootstrap from a small seed set and generate new clues by using dependencies and collocation information between sentiment clues and sentence-level topics that would be a primary subject of sentiment expression (e.g., event, company, and person). The experiments show that the aggregated clues are effective for sentiment classification.


The Wisdom of Bookies? Sentiment Analysis Versus. the NFL Point Spread

AAAI Conferences

The American Football betting market provides a particularly attractive domain to study the nexus between public sentiment and the wisdom of crowds. In this paper, we present the first substantial study of the relationship between the NFL betting line and public opinion expressed in blogs and microblogs (Twitter). We perform a large-scale study of four distinct text streams: LiveJournal blogs, RSS blog feeds captured by Spinn3r, Twitter, and traditional news media. Our results show interesting disparities between the first and second halves of each season. We present evidence showing usefulness of sentiment on NFL betting. We demonstrate that a strategy betting roughly 30 games per year identified winner roughly 60% of the time from 2006 to 2009, well beyond what is needed to overcome the bookie's typical commission(53%).


Socio-Legal Analysis of Criminal Sentences: A Preliminary Study

AAAI Conferences

This paper discusses a research based on analyzing criminal sentences on criminal trials on organized crime activity in Sicily pronounced from 2000 through 2006. Large criminal sentences related dataset collection activity in Italy is severely constrained for various reasons such as difficulty of data collection at the courthouses, unavailability of data in digital format, and classification criteria used in the public archives. Thus, in general, judicial statistics suffer from lack of reliability and informativeness. The objective of this research is to analyze the text of criminal sentences in a revisable and verifiable way, so that information is extracted on the trial leading to the sentence, the socio-economic environment in which the relevant events occurred, and the differences between the various districts conducting the trials. The purpose is to elaborate a tool of automated analysis of the text of the sentences that is generalizable to other areas of jurisprudence, and, outside of jurisprudence, to other temporal and geographical contexts. The 726 criminal sentences that have been converted into text files have been pronounced at all judicial levels in the four Sicilian districts for mafia-related crimes. This research is relevant because, for the first time in Italy, we aim to empirically describe the juridical response to the phenomenon of organized crime, by using a large and extendable database of criminal sentences that can be analyzed with data mining techniques, rather than deriving general conclusions from a focused small set of sentences.


“How Incredibly Awesome!” — Click Here to Read More

AAAI Conferences

We investigate the impact of a discussion snippet's overall sentiment on a user's willingness to read more of a discussion. Using sentiment analysis, we constructed positive, neutral, and negative discussion snippets using the discussion topic and a sample comment from discussions taking place around content on an enterprise social networking site. We computed personalized snippet recommendations for a subset of users and conducted a survey to test how these recommendations were perceived. Our experimental results show that snippets with high sentiments are better discussion "teasers."


Regularized Learning with Networks of Features

Neural Information Processing Systems

For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, or when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should therefore be expected to have similar weights in a good model. Here we present a framework for regularized learning in settings where one has prior knowledge about which features are expected to have similar and dissimilar weights. This prior knowledge is encoded as a graph whose vertices represent features and whose edges represent similarities and dissimilarities between them. During learning, each feature's weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using graphs of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature graphs constructed from declarative human knowledge, as well as from auxiliary task learning, significantly improve prediction accuracy.


Automatic annotation of multilingual text collections with a conceptual thesaurus

arXiv.org Artificial Intelligence

Automatic annotation of documents with controlled vocabulary terms (descriptors) from a conceptual thesaurus is not only useful for document indexing and retrieval. The mapping of texts onto the same thesaurus furthermore allows to establish links between similar documents. This is also a substantial requirement of the Semantic Web. This paper presents an almost language-independent system that maps documents written in different languages onto the same multilingual conceptual thesaurus, EUROVOC. Conceptual thesauri differ from Natural Language Thesauri in that they consist of relatively small controlled lists of words or phrases with a rather abstract meaning. To automatically identify which thesaurus descriptors describe the contents of a document best, we developed a statistical, associative system that is trained on texts that have previously been indexed manually. In addition to describing the large number of empirically optimised parameters of the fully functional application, we present the performance of the software according to a human evaluation by professional indexers.