Sparse Stochastic Inference for Latent Dirichlet allocation

arXiv.org Machine Learning

We present a hybrid algorithm for Bayesian topic models that combines the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. We used our algorithm to analyze a corpus of 1.2 million books (33 billion words) with thousands of topics. Our approach reduces the bias of variational inference and generalizes to many Bayesian hidden-variable models.


Scalable Inference for Logistic-Normal Topic Models

Neural Information Processing Systems

Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise.



Time-Aware Latent Concept Expansion for Microblog Search

AAAI Conferences

Incorporating the temporal property of words into query expansion methods based on relevance feedback has been shown to have a significant positive effect on microblog search.In contrast to such word-based query expansion methods, we propose a concept-based query expansion method based on a temporal relevance model that uses the temporal variation of concepts (e.g., terms and phrases) on microblogs. Our model naturally extends an extremely effective existing concept-based relevance model by tracking the concept frequency over time.Moreover, the proposed model produces important concepts that are frequently used within a particular time periodassociated with a given topic, which better discriminate between relevant and non-relevant microblog documents than words.Our experiments using a corpus of microblog data (Tweets2011 corpus) show that the proposed concept-based query expansion method improves search performance significantly, especially for highly relevant documents.


Artificial Intelligence and Risk Communication

AAAI Conferences

The challenges of effective health risk communication are well known. This paper provides pointers to the health communication literature that discuss these problems. Tailoring printed information, visual displays, and interactive multimedia have been proposed in the health communication literature as promising approaches. On-line risk communication applications are increasing on the internet. However, potential effectiveness of applications using conventional computer technology is limited. We propose that use of artificial intelligence, building upon research in Intelligent Tutoring Systems, might be able to overcome these limitations.