Discourse & Dialogue
Prior-Based Dual Additive Latent Dirichlet Allocation for User-Item Connected Documents
Zhang, Wei (Tsinghua University and Tsinghua National Laboratory for Information Science and Technology) | Wang, Jianyong (Tsinghua University and Tsinghua National Laboratory for Information Science and Technology)
User-item connected documents, such as customer reviews for specific items in online shopping website and user tips in location-based social networks, have become more and more prevalent recently. Inferring the topic distributions of user-item connected documents is beneficial for many applications, including document classification and summarization of users and items. While many different topic models have been proposed for modeling multiple text, most of them cannot account for the dual role of user-item connected documents (each document is related to one user and one item simultaneously) in topic distribution generation process. In this paper, we propose a novel probabilistic topic model called Prior-based Dual Additive Latent Dirichlet Allocation (PDA-LDA). It addresses the dual role of each document by associating its Dirichlet prior for topic distribution with user and item topic factors, which leads to a document-level asymmetric Dirichlet prior. In the experiments, we evaluate PDA-LDA on several real datasets and the results demonstrate that our model is effective in comparison to several other models, including held-out perplexity on modeling text and document classification application.
Target-Dependent Twitter Sentiment Classification with Rich Automatic Features
Vo, Duy-Tin (Singapore University of Technology and Design) | Zhang, Yue (Singapore University of Technology and Design)
Target-dependent sentiment analysis on Twitter has attracted increasing research attention. Most previous work relies on syntax, such as automatic parse trees, which are subject to noise for informal text such as tweets. In this paper, we show that competitive results can be achieved without the use of syntax, by extracting a rich set of automatic features. In particular, we split a tweet into a left context and a right context according to a given target, using distributed word representations and neural pooling functions to extract features. Both sentiment-driven and standard embeddings are used, and a rich set of neural pooling functions are explored. Sentiment lexicons are used as an additional source of information for feature extraction. In standard evaluation, the conceptually simple method gives a 4.8% absolute improvement over the state-of-the-art on three-way targeted sentiment classification, achieving the best reported results for this task.
Positive, Negative, or Neutral: Learning an Expanded Opinion Lexicon from Emoticon-Annotated Tweets
Bravo-Marquez, Felipe (The University of Waikato) | Frank, Eibe (The University of Waikato) | Pfahringer, Bernhard (The University of Waikato)
We present a supervised framework for expanding an opinion lexicon for tweets. The lexicon contains part-of-speech (POS) disambiguated entries with a three-dimensional probability distribution for positive, negative, and neutral polarities. To obtain this distribution using machine learning, we propose word-level attributes based on POS tags and information calculated from streams of emoticon-annotated tweets. Our experimental results show that our method outperforms the three-dimensional word-level polarity classification performance obtained by semantic orientation, a state-of-the-art measure for establishing world-level sentiment.
Efficient Learning for Undirected Topic Models
Replicated Softmax model, a well-known undirected topic model, is powerful in extracting semantic representations of documents. Traditional learning strategies such as Contrastive Divergence are very inefficient. This paper provides a novel estimator to speed up the learning based on Noise Contrastive Estimate, extended for documents of variant lengths and weighted inputs. Experiments on two benchmarks show that the new estimator achieves great learning efficiency and high accuracy on document retrieval and classification.
Linguistic Harbingers of Betrayal: A Case Study on an Online Strategy Game
Niculae, Vlad, Kumar, Srijan, Boyd-Graber, Jordan, Danescu-Niculescu-Mizil, Cristian
Interpersonal relations are fickle, with close friendships often dissolving into enmity. In this work, we explore linguistic cues that presage such transitions by studying dyadic interactions in an online strategy game where players form alliances and break those alliances through betrayal. We characterize friendships that are unlikely to last and examine temporal patterns that foretell betrayal. We reveal that subtle signs of imminent betrayal are encoded in the conversational patterns of the dyad, even if the victim is not aware of the relationship's fate. In particular, we find that lasting friendships exhibit a form of balance that manifests itself through language. In contrast, sudden changes in the balance of certain conversational attributes---such as positive sentiment, politeness, or focus on future planning---signal impending betrayal.
Continuous Time Dynamic Topic Models
Wang, Chong, Blei, David, Heckerman, David
In this paper, we develop the continuous time dynamic topic model (cDTM). The cDTM is a dynamic topic model that uses Brownian motion to model the latent topics through a sequential collection of documents, where a "topic" is a pattern of word use that we expect to evolve over the course of the collection. We derive an efficient variational approximate inference algorithm that takes advantage of the sparsity of observations in text, a property that lets us easily handle many time points. In contrast to the cDTM, the original discrete-time dynamic topic model (dDTM) requires that time be discretized. Moreover, the complexity of variational inference for the dDTM grows quickly as time granularity increases, a drawback which limits fine-grained discretization. We demonstrate the cDTM on two news corpora, reporting both predictive perplexity and the novel task of time stamp prediction.
Infinite Author Topic Model based on Mixed Gamma-Negative Binomial Process
Xuan, Junyu, Lu, Jie, Zhang, Guangquan, Da Xu, Richard Yi, Luo, Xiangfeng
Incorporating the side information of text corpus, i.e., authors, time stamps, and emotional tags, into the traditional text mining models has gained significant interests in the area of information retrieval, statistical natural language processing, and machine learning. One branch of these works is the so-called Author Topic Model (ATM), which incorporates the authors's interests as side information into the classical topic model. However, the existing ATM needs to predefine the number of topics, which is difficult and inappropriate in many real-world settings. In this paper, we propose an Infinite Author Topic (IAT) model to resolve this issue. Instead of assigning a discrete probability on fixed number of topics, we use a stochastic process to determine the number of topics from the data itself. To be specific, we extend a gamma-negative binomial process to three levels in order to capture the author-document-keyword hierarchical structure. Furthermore, each document is assigned a mixed gamma process that accounts for the multi-author's contribution towards this document. An efficient Gibbs sampling inference algorithm with each conditional distribution being closed-form is developed for the IAT model. Experiments on several real-world datasets show the capabilities of our IAT model to learn the hidden topics, authors' interests on these topics and the number of topics simultaneously.
Nonparametric Relational Topic Models through Dependent Gamma Processes
Xuan, Junyu, Lu, Jie, Zhang, Guangquan, Da Xu, Richard Yi, Luo, Xiangfeng
Traditional Relational Topic Models provide a way to discover the hidden topics from a document network. Many theoretical and practical tasks, such as dimensional reduction, document clustering, link prediction, benefit from this revealed knowledge. However, existing relational topic models are based on an assumption that the number of hidden topics is known in advance, and this is impractical in many real-world applications. Therefore, in order to relax this assumption, we propose a nonparametric relational topic model in this paper. Instead of using fixed-dimensional probability distributions in its generative model, we use stochastic processes. Specifically, a gamma process is assigned to each document, which represents the topic interest of this document. Although this method provides an elegant solution, it brings additional challenges when mathematically modeling the inherent network structure of typical document network, i.e., two spatially closer documents tend to have more similar topics. Furthermore, we require that the topics are shared by all the documents. In order to resolve these challenges, we use a subsampling strategy to assign each document a different gamma process from the global gamma process, and the subsampling probabilities of documents are assigned with a Markov Random Field constraint that inherits the document network structure. Through the designed posterior inference algorithm, we can discover the hidden topics and its number simultaneously. Experimental results on both synthetic and real-world network datasets demonstrate the capabilities of learning the hidden topics and, more importantly, the number of topics.
TickTock: A Non-Goal-Oriented Multimodal Dialog System with Engagement Awareness
Yu, Zhou (Carnegie Mellon University) | Papangelis, Alexandros (Carnegie Mellon University) | Rudnicky, Alexander (Carnegie Mellon University)
We describe TickTock, a conversational agent designed to engage humans on topics of its choosing and to carry on an interaction for as long as possible. Our prototype uses a database of talk show transcripts featuring guests from the film industry. To be an interesting companion Tick Tock uses immediate context from the last two turns to formulate queries into a database of utterances. The process is automatic. TickTock monitors user engagement and performs certain moves, such as topic shifts, based on its assessment of user state. Initially we used utterance content for monitoring and subsequently we begun to investigate non-language cues, such as prosody and visual cues to create a more robust engagement model based on multiple human communication channels.