Goto

Collaborating Authors

 Discourse & Dialogue


Experimenting with Drugs (and Topic Models): Multi-Dimensional Exploration of Recreational Drug Discussions

AAAI Conferences

Clinical research of new recreational drugs and trends requires mining current information from non-traditional text sources. In this work we support such research through the use of multi-dimensional latent text models, such as factorial LDA, that capture orthogonal factors of corpora, creating structured output for researchers to better understand the contents of a corpus. Since a purely unsupervised model is unlikely to discover specific factors of interests to clinical researchers, we modify the structure of factorial LDA to incorporate prior knowledge, including the use of of observed variables, informative priors and background components. The resulting model learns factors that correspond to drug type, delivery method (smoking, injection, etc.), and aspect (chemistry, culture, effects, health, usage). We demonstrate that the improved model yields better quantitative and more interpretable results.


Discovering Health Beliefs in Twitter

AAAI Conferences

Social networking websites such as Twitter have invigorated a wide range of studies in recent years ranging from consumer opinions on products to tracking the spread of diseases. While sentiment analysis and opinion mining from tweets have been studied extensively, surveillance of beliefs, especially those related to public health, have received considerably less attention. In our previous work, we proposed a model for surveillance of health beliefs on Twitter relying on the use of hand-picked probe statements expressing various health-related propositions. In this work we extend our model to automatically discover various probes related to public health beliefs. We present a data driven approach based on two distinct datasets and study the prevalence of public belief, disbelief or doubt for newly discovered probe statements.


Factorized Multi-Modal Topic Model

arXiv.org Machine Learning

Multi-modal data collections, such as corpora of paired images and text snippets, require analysis methods beyond single-view component and topic models. For continuous observations the current dominant approach is based on extensions of canonical correlation analysis, factorizing the variation into components shared by the different modalities and those private to each of them. For count data, multiple variants of topic models attempting to tie the modalities together have been presented. All of these, however, lack the ability to learn components private to one modality, and consequently will try to force dependencies even between minimally correlating modalities. In this work we combine the two approaches by presenting a novel HDP-based topic model that automatically learns both shared and private topics. The model is shown to be especially useful for querying the contents of one domain given samples of the other.


Latent Dirichlet Allocation Uncovers Spectral Characteristics of Drought Stressed Plants

arXiv.org Machine Learning

Understanding the adaptation process of plants to drought stress is essential in improving management practices, breeding strategies as well as engineering viable crops for a sustainable agriculture in the coming decades. Hyper-spectral imaging provides a particularly promising approach to gain such understanding since it allows to discover non-destructively spectral characteristics of plants governed primarily by scattering and absorption characteristics of the leaf internal structure and biochemical constituents. Several drought stress indices have been derived using hyper-spectral imaging. However, they are typically based on few hyper-spectral images only, rely on interpretations of experts, and consider few wavelengths only. In this study, we present the first data-driven approach to discovering spectral drought stress indices, treating it as an unsupervised labeling problem at massive scale. To make use of short range dependencies of spectral wavelengths, we develop an online variational Bayes algorithm for latent Dirichlet allocation with convolved Dirichlet regularizer. This approach scales to massive datasets and, hence, provides a more objective complement to plant physiological practices. The spectral topics found conform to plant physiological knowledge and can be computed in a fraction of the time compared to existing LDA approaches.


Opinion Mining for Relating Subjective Expressions and Annual Earnings in US Financial Statements

arXiv.org Artificial Intelligence

Financial statements contain quantitative information and manager's subjective evaluation of firm's financial status. Using information released in U.S. 10-K filings. Both qualitative and quantitative appraisals are crucial for quality financial decisions. To extract such opinioned statements from the reports, we built tagging models based on the conditional random field (CRF) techniques, considering a variety of combinations of linguistic factors including morphology, orthography, predicate-argument structure, syntax, and simple semantics. Our results show that the CRF models are reasonably effective to find opinion holders in experiments when we adopted the popular MPQA corpus for training and testing. The contribution of our paper is to identify opinion patterns in multiword expressions (MWEs) forms rather than in single word forms. We find that the managers of corporations attempt to use more optimistic words to obfuscate negative financial performance and to accentuate the positive financial performance. Our results also show that decreasing earnings were often accompanied by ambiguous and mild statements in the reporting year and that increasing earnings were stated in assertive and positive way.


A non-parametric mixture model for topic modeling over time

arXiv.org Machine Learning

A single, stationary topic model such as latent Dirichlet allocation is inappropriate for modeling corpora that span long time periods, as the popularity of topics is likely to change over time. A number of models that incorporate time have been proposed, but in general they either exhibit limited forms of temporal variation, or require computationally expensive inference methods. In this paper we propose nonparametric Topics over Time (npTOT), a model for time-varying topics that allows an unbounded number of topics and flexible distribution over the temporal variations in those topics' popularity. We develop a collapsed Gibbs sampler for the proposed model and compare against existing models on synthetic and real document sets.


Toward Habitable Assistance from Spoken Dialogue Systems

AAAI Conferences

Spoken dialogue is increasingly central to systems that assist people. As the tasks that people and machines speak about together become more complex, however, usersโ€™ dissatisfaction with those systems is an important concern. This paper presents a novel approach to learning for spoken dialogue systems. It describes embedded wizardry, a methodology for learning from skilled people, and applies it to a library whose patrons order books by telephone. To address the challenges inherent in this application, we introduce RFW+, a domain-independent, feature-selection method that considers feature categories. Models learned with RFW+ on embedded-wizard data improve the performance of a traditional spoken dialogue system.


Heart Rate Topic Models

AAAI Conferences

A key challenge in reducing the burden of cardiovascular disease is matching patients to treatments that are most appropriate for them. Different cardiac assessment tools have been developed to address this goal. Recent research has focused on heart rate motifs, i.e., short-term heart rate sequences that are over- or under-represented in long-term electrocardiogram (ECG) recordings of patients experiencing cardiovascular outcomes, which provide novel and valuable information for risk stratification. However, this approach can leverage only a small number of motifs for prediction and results in difficult to interpret models. We address these limitations by identifying latent structure in the large numbers of motifs found in long-term ECG recordings. In particular, we explore the application of topic models to heart rate time series to identify functional sets of heart rate sequences and to concisely describe patients using task-independent features for various cardiovascular outcomes. We evaluate the approach on a large collection of real-world ECG data, and investigate the performance of topic mixture features for the prediction of cardiovascular mortality. The topics provided an interpretable representation of the recordings and maintained valuable information for clinical assessment when compared with motif frequencies, even after accounting for commonly used clinical risk scores.


Sentiment Classification Using the Meaning of Words

AAAI Conferences

Sentiment Classification (SC) is about assigning a positive, negative or neutral label to a piece of text based on its overall opinion. This paper describes our in-progress work on extracting the meaning of words for SC. In particular, we investigate the utility of sense-level polarity information for SC. We first show that methods based on common classification features are not robust and their performance varies widely across different domains. We then show that sense-level polarity information features can significantly improve the performance of SC. We use datasets in different domains to study the robustness of the designated features. Our preliminary results show that the most common sense of the words result in the most robust results across different domains. In addition our observation shows that the sense-level polarity information is useful for producing a set of high-quality seed words which can be used for further improvement of SC task.


Emoticon Smoothed Language Models for Twitter Sentiment Analysis

AAAI Conferences

Twitter sentiment analysis (TSA) has become a hot research topic in recent years. The goal of this task is to discover the attitude or opinion of the tweets, which is typically formulated as a machine learning based text classification problem. Some methods use manually labeled data to train fully supervised models, while others use some noisy labels, such as emoticons and hashtags, for model training. In general, we can only get a limited number of training data for the fully supervised models because it is very labor-intensive and time-consuming to manually label the tweets. As for the models with noisy labels, it is hard for them to achieve satisfactory performance due to the noise in the labels although it is easy to get a large amount of data for training. Hence, the best strategy is to utilize both manually labeled data and noisy labeled data for training. However, how to seamlessly integrate these two different kinds of data into the same learning framework is still a challenge. In this paper, we present a novel model, called emoticon smoothed language model (ESLAM), to handle this challenge. The basic idea is to train a language model based on the manually labeled data, and then use the noisy emoticon data for smoothing. Experiments on real data sets demonstrate that ESLAM can effectively integrate both kinds of data to outperform those methods using only one of them.