Information Extraction
Visualizing Community Resilience Metrics from Twitter Data
Patton, Robert (Oak Ridge National Laboratory) | Steed, Chad (Oak Ridge National Laboratory) | Stahl, Chris (Oak Ridge National Laboratory)
The recent explosive growth of smart phones and social media creates a unique opportunity to view events from various unique perspectives. Unfortunately, this relatively new form of communication lacks the structural integrity, accuracy, and reduced noise of other forms of communication. Nevertheless, social media increasingly plays a vita role in the observation of societal actions before, during, and after significant events. In October 2012, Hurricane Sandy making landfall on the northeastern coasts of the United States demonstrated this role. This work provides a preliminary view into how social media could be used to monitor and gauge community resilience to such natural disasters. We observe, evaluate, and visualize how Twitter data evolves over time before, during, and after a natural disaster such as Hurricane Sandy and what opportunities there may be to leverage social media for situational awareness and emergency response.
Mining Facebook Data for Predictive Personality Modeling
Markovikj, Dejan (Saints Cyril and Methodius University in Skopje) | Gievska, Sonja (Saints Cyril and Methodius University in Skopje ) | Kosinski, Michal (University of Cambridge) | Stillwell, David J. (University of Cambridge)
Beyond being facilitators of human interactions, social networks have become an interesting target of research, providing rich information for studying and modeling user’s behavior. Identification of personality-related indicators encrypted in Facebook profiles and activities are of special concern in our current research efforts. This paper explores the feasibility of modeling user personality based on a proposed set of features extracted from the Facebook data. The encouraging results of our study, exploring the suitability and performance of several classification techniques, will also be presented.
Analyzing Political Sentiment on Twitter
Ringsquandl, Martin (University of Applied Sciences Rosenheim) | Petkovic, Dusan (University of Applied Sciences Rosenheim)
Due to the vast amount of user-generated content in the emerging Web 2.0, there is a growing need for computational processing of sentiment analysis in documents. Most of the current research in this field is devoted to product reviews from websites. Microblogs and social networks pose even a greater challenge to sentiment classification. However, especially marketing and political campaigns leverage from opinions expressed on Twitter or other social communication platforms. The objects of interest in this paper are the presidential candidates of the Republican Party in the USA and their campaign topics. In this paper we introduce the combination of the noun phrases’ frequency and their PMI measure as constraint on aspect extraction. This compensates for sparse phrases receiving a higher score than those composed of high-frequency words. Evaluation shows that the meronymy relationship between politicians and their topics holds and improves accuracy of aspect extraction.
A CCG-Based Approach to Fine-Grained Sentiment Analysis in Microtext
Smith, Phillip (University of Birmingham) | Lee, Mark (University of Birmingham)
In this paper, we present a Combinatory Categorial Grammar (CCG) based approach to the classification of emotion in microtext. We develop a method that makes use of the notion put forward by Ortony, Clore, and Collins (1988), that emotions are valenced reactions. This hypothesis sits central to our system, in which we adapt contextual valence shifters to infer the emotional content of a text. We integrate this with an augmented version of WordNet-Affect, which acts as our lexicon. Finally, we experiment with a corpus of headlines proposed in the 2007 SemEval Affective Task (Strapparava and Mihalcea 2007) as our microtext corpus, and by taking the other competing systems as a baseline, demonstrate that our approach to emotion categorisation performs favourably.
Automatic Aggregation by Joint Modeling of Aspects and Values
We present a model for aggregation of product review snippets by joint aspect identification and sentiment analysis. Our model simultaneously identifies an underlying set of ratable aspects presented in the reviews of a product (e.g., sushi and miso for a Japanese restaurant) and determines the corresponding sentiment of each aspect. This approach directly enables discovery of highly-rated or inconsistent aspects of a product. Our generative model admits an efficient variational mean-field inference algorithm. It is also easily extensible, and we describe several modifications and their effects on model structure and inference. We test our model on two tasks, joint aspect identification and sentiment analysis on a set of Yelp reviews and aspect identification alone on a set of medical summaries. We evaluate the performance of the model on aspect identification, sentiment analysis, and per-word labeling accuracy. We demonstrate that our model outperforms applicable baselines by a considerable margin, yielding up to 32% relative error reduction on aspect identification and up to 20% relative error reduction on sentiment analysis.
Discovering Basic Emotion Sets via Semantic Clustering on a Twitter Corpus
A plethora of words are used to describe the spectrum of human emotions, but how many emotions are there really, and how do they interact? Over the past few decades, several theories of emotion have been proposed, each based around the existence of a set of 'basic emotions', and each supported by an extensive variety of research including studies in facial expression, ethology, neurology and physiology. Here we present research based on a theory that people transmit their understanding of emotions through the language they use surrounding emotion keywords. Using a labelled corpus of over 21,000 tweets, six of the basic emotion sets proposed in existing literature were analysed using Latent Semantic Clustering (LSC), evaluating the distinctiveness of the semantic meaning attached to the emotional label. We hypothesise that the more distinct the language is used to express a certain emotion, then the more distinct the perception (including proprioception) of that emotion is, and thus more 'basic'. This allows us to select the dimensions best representing the entire spectrum of emotion. We find that Ekman's set, arguably the most frequently used for classifying emotions, is in fact the most semantically distinct overall. Next, taking all analysed (that is, previously proposed) emotion terms into account, we determine the optimal semantically irreducible basic emotion set using an iterative LSC algorithm. Our newly-derived set (Accepting, Ashamed, Contempt, Interested, Joyful, Pleased, Sleepy, Stressed) generates a 6.1% increase in distinctiveness over Ekman's set (Angry, Disgusted, Joyful, Sad, Scared). We also demonstrate how using LSC data can help visualise emotions. We introduce the concept of an Emotion Profile and briefly analyse compound emotions both visually and mathematically.
Learning to Predict from Textual Data
Radinsky, K., Davidovich, S., Markovitch, S.
Given a current news event, we tackle the problem of generating plausible predictions of future events it might cause. We present a new methodology for modeling and predicting such future news events using machine learning and data mining techniques. Our Pundit algorithm generalizes examples of causality pairs to infer a causality predictor. To obtain precisely labeled causality examples, we mine 150 years of news articles and apply semantic natural language modeling techniques to headlines containing certain predefined causality patterns. For generalization, the model uses a vast number of world knowledge ontologies. Empirical evaluation on real news articles shows that our Pundit algorithm performs as well as non-expert humans.
Learning with Scope, with Application to Information Extraction and Classification
Blei, David, Bagnell, J Andrew, McCallum, Andrew
In probabilistic approaches to classification and information extraction, one typically builds a statistical model of words under the assumption that future data will exhibit the same regularities as the training data. In many data sets, however, there are scope-limited features whose predictive power is only applicable to a certain subset of the data. For example, in information extraction from web pages, word formatting may be indicative of extraction category in different ways on different web pages. The difficulty with using such features is capturing and exploiting the new regularities encountered in previously unseen data. In this paper, we propose a hierarchical probabilistic model that uses both local/scope-limited features, such as word formatting, and global features, such as word content. The local regularities are modeled as an unobserved random parameter which is drawn once for each local data set. This random parameter is estimated during the inference process and then used to perform classification with both the local and global features--- a procedure which is akin to automatically retuning the classifier to the local regularities on each newly encountered web page. Exact inference is intractable and we present approximations via point estimates and variational methods. Empirical results on large collections of web data demonstrate that this method significantly improves performance from traditional models of global features alone.
Notes about the OntoGene Pipeline
Rinaldi, Fabio (University of Zurich) | Clematide, Simon (University of Zurich) | Schneider, Gerold (University of Zurich) | Grigonyte, Gintare (University of Zurich)
In this paper we describe the architecture of the OntoGene Relation mining pipeline and some of its recent applications. With this research overview paper we intend to provide a contribution towards the recently started discussion towards standards for information extraction architectures in the biomedical domain. Our approach delivers domain entities mentioned in each input document, as well as candidate relationships, both ranked according to a confidency score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation.
Discovering Health Beliefs in Twitter
Bhattacharya, Sanmitra (The University of Iowa) | Tran, Hung (The University of Iowa) | Srinivasan, Padmini (The University of Iowa)
Social networking websites such as Twitter have invigorated a wide range of studies in recent years ranging from consumer opinions on products to tracking the spread of diseases. While sentiment analysis and opinion mining from tweets have been studied extensively, surveillance of beliefs, especially those related to public health, have received considerably less attention. In our previous work, we proposed a model for surveillance of health beliefs on Twitter relying on the use of hand-picked probe statements expressing various health-related propositions. In this work we extend our model to automatically discover various probes related to public health beliefs. We present a data driven approach based on two distinct datasets and study the prevalence of public belief, disbelief or doubt for newly discovered probe statements.