Goto

Collaborating Authors

 Discourse & Dialogue


Learning Topical Translation Model for Microblog Hashtag Suggestion

AAAI Conferences

Hashtags can be viewed as an indication to the context of the tweet or as the core idea expressed in the tweet. They provide valuable information for many applications, such as information retrieval, opinion mining, text classification, and so on. However, only a small number of microblogs are manually tagged. To address this problem, in this work, we propose a topical translation model for microblog hashtag suggestion. We assume that the content and hashtags of the tweet are talking about the same themes but written in different languages. Under the assumption, hashtag suggestion is modeled as a translation process from content to hashtags. Moreover, in order to cover the topic of tweets, the proposed model regards the translation probability to be topic-specific. It uses topic-specific word trigger to bridge the vocabulary gap between the words in tweets and hashtags, and discovers the topics of tweets by a topic model designed for microblogs. Experimental results on the dataset crawled from real world microblogging service demonstrate that the proposed method outperforms state-of-the-art methods.


Leveraging Multi-Domain Prior Knowledge in Topic Models

AAAI Conferences

Topic models have been widely used to identify topics in text corpora. It is also known that purely unsupervised models often result in topics that are not comprehensible in applications. In recent years, a number of knowledge-based models have been proposed, which allow the user to input prior knowledge of the domain to produce more coherent and meaningful topics. In this paper, we go one step further to study how the prior knowledge from other domains can be exploited to help topic modeling in the new domain. This problem setting is important from both the application and the learning perspectives because knowledge is inherently accumulative. We human beings gain knowledge gradually and use the old knowledge to help solve new problems. To achieve this objective, existing models have some major difficulties. In this paper, we propose a novel knowledge-based model, called MDK-LDA, which is capable of using prior knowledge from multiple domains. Our evaluation results will demonstrate its effectiveness.


Discovering Different Types of Topics: Factored Topic Models

AAAI Conferences

In traditional topic models such as LDA, a word is generated by choosing a topic from a collection. However, existing topic models do not identify different types of topics in a document, such as topics that represent the content and topics that represent the sentiment. In this paper, our goal is to discover such different types of topics, if they exist. We represent our model as several parallel topic models (called topic factors), where each word is generated from topics from these factors jointly. Since the latent membership of the word is now a vector, the learning algorithms become challenging. We show that using a variational approximation still allows us to keep the algorithm tractable. Our experiments over several datasets show that our approach consistently outperforms many classic topic models while also discovering fewer, more meaningful, topics.


Generalized Relational Topic Models with Data Augmentation

AAAI Conferences

Relational topic models have shown promise on analyzing document network structures and discovering latent topic representations. This paper presents three extensions: 1) unlike the common link likelihood with a diagonal weight matrix that allows the-same-topic interactions only, we generalize it to use a full weight matrix that captures all pairwise topic interactions and is applicable to asymmetric networks; 2) instead of doing standard Bayesian inference, we perform regularized Bayesian inference with a regularization parameter to deal with the imbalanced link structure issue in common real networks; and 3) instead of doing variational approximation with strict mean-field assumptions, we present a collapsed Gibbs sampling algorithm for the generalized relational topic models without making restricting assumptions. Experimental results demonstrate the significance of these extensions on improving the prediction performance, and the time efficiency can be dramatically improved with a simple fast approximation method.


Topic Segmentation and Labeling in Asynchronous Conversations

Journal of Artificial Intelligence Research

Topic segmentation and labeling is often considered a prerequisite for higher-level conversation analysis and has been shown to be useful in many Natural Language Processing (NLP) applications. We present two new corpora of email and blog conversations annotated with topics, and evaluate annotator reliability for the segmentation and labeling tasks in these asynchronous conversations. We propose a complete computational framework for topic segmentation and labeling in asynchronous conversations. Our approach extends state-of-the-art methods by considering a fine-grained structure of an asynchronous conversation, along with other conversational features by applying recent graph-based methods for NLP. For topic segmentation, we propose two novel unsupervised models that exploit the fine-grained conversational structure, and a novel graph-theoretic supervised model that combines lexical, conversational and topic features. For topic labeling, we propose two novel (unsupervised) random walk models that respectively capture conversation specific clues from two different sources: the leading sentences and the fine-grained conversational structure. Empirical evaluation shows that the segmentation and the labeling performed by our best models beat the state-of-the-art, and are highly correlated with human annotations.


Multiple Outcome Supervised Latent Dirichlet Allocation for Expert Discovery in Online Forums

AAAI Conferences

This paper presents a supervised bayesian approach to model expertise in online forums with application to question routing. The proposed method extends the well-known sLDA model to the multi-task case, accounting for a supervised stage with multiple outputs per document corresponding to the users of the system. A study of the characteristics of real world data revealed a number of challenges in the practical application of this model, relevant to the research community.


From Semantic to Emotional Space in Probabilistic Sense Sentiment Analysis

AAAI Conferences

This paper proposes an effective approach to model the emotional space of words to infer their Sense Sentiment Similarity (SSS). SSS reflects the distance between the words regarding their senses and underlying sentiments. We propose a probabilistic approach that is built on a hidden emotional model in which the basic human emotions are considered as hidden. This leads to predict a vector of emotions for each sense of the words, and then to infer the sense sentiment similarity. The effectiveness of the proposed approach is investigated in two Natural Language Processing tasks: Indirect yes/no Question Answer Pairs Inference and Sentiment Orientation Prediction.


Supervised Topic Model with Consideration of User and Item

AAAI Conferences

In this paper, we propose a new supervised topic model by incorporating the user and the item information. The proposed model can simultaneously utilize the textual topic and user-item factors for label prediction. We conduct prediction experiment with a public review dataset. The results demonstrate the advantages of our model. It shows clear improvement compared with traditional supervised topic model and recommendation method.


Co-Training Based Bilingual Sentiment Lexicon Learning

AAAI Conferences

In this paper, we address the issue of bilingual sentiment lexicon learning(BSLL) which aims to automatically and simultaneously generate sentiment words for two languages. The underlying motivation is that sentiment information from two languages can perform iterative mutual-teaching in the learning procedure. We propose to develop two classifiers to determine the sentiment polarities of words under a co-training framework, which makes full use of the two-view sentiment information from the two languages. The word alignment derived from the parallel corpus is leveraged to design effective features and to bridge the learning of the two classifiers. The experimental results on English and Chinese languages show the effectiveness of our approach in BSLL.


Exploring the Contribution of Unlabeled Data in Financial Sentiment Analysis

AAAI Conferences

With the proliferation of its applications in various industries, sentiment analysis by using publicly available web data has become an active research area in text classification during these years. It is argued by researchers that semi-supervised learning is an effective approach to this problem since it is capable to mitigate the manual labeling effort which is usually expensive and time-consuming. However, there was a long-term debate on the effectiveness of unlabeled data in text classification. This was partially caused by the fact that many assumptions in theoretic analysis often do not hold in practice. We argue that this problem may be further understood by adding an additional dimension in the experiment. This allows us to address this problem in the perspective of bias and variance in a broader view. We show that the well-known performance degradation issue caused by unlabeled data can be reproduced as a subset of the whole scenario. We argue that if the bias-variance trade-off is to be better balanced by a more effective feature selection method unlabeled data is very likely to boost the classification performance. We then propose a feature selection framework in which labeled and unlabeled training samples are both considered. We discuss its potential in achieving such a balance. Besides, the application in financial sentiment analysis is chosen because it not only exemplifies an important application, the data possesses better illustrative power as well. The implications of this study in text classification and financial sentiment analysis are both discussed.