Discourse & Dialogue
Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed-Membership Modeling
The beta-negative binomial process (BNBP), an integer-valued stochastic process, is employed to partition a count vector into a latent random count matrix. As the marginal probability distribution of the BNBP that governs the exchangeable random partitions of grouped data has not yet been developed, current inference for the BNBP has to truncate the number of atoms of the beta process. This paper introduces an exchangeable partition probability function to explicitly describe how the BNBP clusters the data points of each group into a random number of exchangeable partitions, which are shared across all the groups. A fully collapsed Gibbs sampler is developed for the BNBP, leading to a novel nonparametric Bayesian topic model that is distinct from existing ones, with simple implementation, fast convergence, good mixing, and state-of-the-art predictive performance.
LightLDA: Big Topic Models on Modest Compute Clusters
Yuan, Jinhui, Gao, Fei, Ho, Qirong, Dai, Wei, Wei, Jinliang, Zheng, Xun, Xing, Eric P., Liu, Tie-Yan, Ma, Wei-Ying
When building large-scale machine learning (ML) programs, such as big topic models or deep neural nets, one usually assumes such tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude faster than current state-of-the-art Gibbs samplers; 2) a structure-aware model-parallel scheme, which leverages dependencies within the topic model, yielding a sampling strategy that is frugal on machine memory and network communication; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed; and 4) a bounded asynchronous data-parallel scheme, which allows efficient distributed processing of massive data via a parameter server. Our distribution strategy is an instance of the model-and-data-parallel programming model underlying the Petuum framework for general distributed ML, and was implemented on top of the Petuum open-source system. We provide experimental evidence showing how this development puts massive models within reach on a small cluster while still enjoying proportional time cost reductions with increasing cluster size, in comparison with alternative options.
Graph-Sparse LDA: A Topic Model with Structured Sparsity
Doshi-Velez, Finale, Wallace, Byron, Adams, Ryan
Originally designed to model text, topic modeling has become a powerful tool for uncovering latent structure in domains including medicine, finance, and vision. The goals for the model vary depending on the application: in some cases, the discovered topics may be used for prediction or some other downstream task. In other cases, the content of the topic itself may be of intrinsic scientific interest. Unfortunately, even using modern sparse techniques, the discovered topics are often difficult to interpret due to the high dimensionality of the underlying space. To improve topic interpretability, we introduce Graph-Sparse LDA, a hierarchical topic model that leverages knowledge of relationships between words (e.g., as encoded by an ontology). In our model, topics are summarized by a few latent concept-words from the underlying graph that explain the observed words. Graph-Sparse LDA recovers sparse, interpretable summaries on two real-world biomedical datasets while matching state-of-the-art prediction performance.
Model-Parallel Inference for Big Topic Models
Zheng, Xun, Kim, Jin Kyu, Ho, Qirong, Xing, Eric P.
In real world industrial applications of topic modeling, the ability to capture gigantic conceptual space by learning an ultra-high dimensional topical representation, i.e., the so-called "big model", is becoming the next desideratum after enthusiasms on "big data", especially for fine-grained downstream tasks such as online advertising, where good performances are usually achieved by regression-based predictors built on millions if not billions of input features. The conventional data-parallel approach for training gigantic topic models turns out to be rather inefficient in utilizing the power of parallelism, due to the heavy dependency on a centralized image of "model". Big model size also poses another challenge on the storage, where available model size is bounded by the smallest RAM of nodes. To address these issues, we explore another type of parallelism, namely model-parallelism, which enables training of disjoint blocks of a big topic model in parallel. By integrating data-parallelism with model-parallelism, we show that dependencies between distributed elements can be handled seamlessly, achieving not only faster convergence but also an ability to tackle significantly bigger model size. We describe an architecture for model-parallel inference of LDA, and present a variant of collapsed Gibbs sampling algorithm tailored for it. Experimental results demonstrate the ability of this system to handle topic modeling with unprecedented amount of 200 billion model variables only on a low-end cluster with very limited computational resources and bandwidth.
A provable SVD-based algorithm for learning topics in dominant admixture corpus
Bansal, Trapit, Bhattacharyya, Chiranjib, Kannan, Ravindran
Topic models, such as Latent Dirichlet Allocation (LDA), posit that documents are drawn from admixtures of distributions over words, known as topics. The inference problem of recovering topics from admixtures, is NP-hard. Assuming separability, a strong assumption, [4] gave the first provable algorithm for inference. For LDA model, [6] gave a provable algorithm using tensor-methods. But [4,6] do not learn topic vectors with bounded $l_1$ error (a natural measure for probability vectors). Our aim is to develop a model which makes intuitive and empirically supported assumptions and to design an algorithm with natural, simple components such as SVD, which provably solves the inference problem for the model with bounded $l_1$ error. A topic in LDA and other models is essentially characterized by a group of co-occurring words. Motivated by this, we introduce topic specific Catchwords, group of words which occur with strictly greater frequency in a topic than any other topic individually and are required to have high frequency together rather than individually. A major contribution of the paper is to show that under this more realistic assumption, which is empirically verified on real corpora, a singular value decomposition (SVD) based algorithm with a crucial pre-processing step of thresholding, can provably recover the topics from a collection of documents drawn from Dominant admixtures. Dominant admixtures are convex combination of distributions in which one distribution has a significantly higher contribution than others. Apart from the simplicity of the algorithm, the sample complexity has near optimal dependence on $w_0$, the lowest probability that a topic is dominant, and is better than [4]. Empirical evidence shows that on several real world corpora, both Catchwords and Dominant admixture assumptions hold and the proposed algorithm substantially outperforms the state of the art [5].
Minimal Narrative Annotation Schemes and Their Applications
Rahimtoroghi, Elahe (University of California, Santa Cruz) | Corcoran, Thomas (University of California, Santa Cruz) | Swanson, Reid (University of California, Santa Cruz) | Walker, Marilyn A. (University of California, Santa Cruz) | Sagae, Kenji (Institute for Creative Technologies, University of Southern California) | Gordon, Andrew (Institute for Creative Technologies, University of Southern California)
The increased use of large corpora in narrative research has created new opportunities for empirical research and intelligent narrative technologies. To best exploit the value of these corpora, several research groups are eschewing complex discourse analysis techniques in favor of high-level minimalist narrative annotation schemes that can be quickly applied, achieve high inter-rater agreement, and are amenable to automation using machine-learning techniques. In this paper we compare different annotation schemes that have been employed by two groups of researchers to annotate large corpora of narrative text. Using a dual-annotation methodology, we investigate the correlation between narrative clauses distinguished by their structural role (orientation, action, evaluation), their subjectivity, and their narrative level within the discourse. We find that each simple narrative annotation scheme captures a structurally distinct characteristic of real-world narratives, and each combination of labels is evident in a corpus of 19 weblog narratives (951 narrative clauses). We discuss several potential applications of minimalist narrative annotation schemes, noting the combination of label across these two annotation schemes that best support each task.
Temporal and Object Relations in Plan and Activity Recognition for Robots Using Topic Models
Freedman, Richard Gabriel (University of Massachusetts Amherst) | Jung, Hee-Tae (University of Massachusetts Amherst) | Zilberstein, Shlomo (University of Massachusetts Amherst)
For robots to effectively interact with human users, it is necessary that they recognize what people in the environment are doing. This is especially the case when robots are performing complementary tasks since the human users are not following any specific process. There is much uncertainty in how people act and the duration of time they need to perform their actions. In this work, we discuss the use of topic models for such plan and activity recognition tasks. We begin with the development of a domain-independent representation of human postural information obtained from RGB-D sensor data. This representation may be used with Latent Dirichlet Allocation (LDA) topic models as an integration of plan and activity recognition. This is followed by a proposition of extensions to LDA that allow temporal and object relational information to also be used in plan and activity recognition tasks.
Humanoid Robots and Spoken Dialog Systems for Brief Health Interventions
Abeyruwan, Saminda (University of Miami) | Baral, Ramesh (Florida International University) | Yasavur, Ugan (Florida International University) | Lisetti, Christine (Florida International University) | Visser, Ubbo (University of Miami)
We combined a spoken dialog system that we developed to deliver brief health interventions with the fully autonomous humanoid robot (NAO).ย The dialog system is based on a framework facilitating Markov decision processes (MDP). It is optimized using reinforcement learning (RL) algorithms with data we collected from real user interactions. The system begins to learn optimal dialog strategies for initiative selection and for the type of confirmations that it uses during theinteraction.ย The health intervention, delivered by a 3D character instead of the NAO, has already been evaluated, with positive results in terms of task completion, ease of use, and future intention to use the system. ย The current spoken dialog system for the humanoid robot is a novelty and exists so far as a proof ofconcept.
Combining Non-Expert and Expert Crowd Work to Convert Web APIs to Dialog Systems
Huang, Ting-Hao K. (Carnegie Mellon University) | Lasecki, Walter S. (University of Rochester) | Ritter, Alan L. (The Ohio State University) | Bigham, Jeffrey P. (Carnegie Mellon University)
Thousands of web APIs expose data and services that would be useful to access with natural dialog, from weather and sports to Twitter and movies. The process of adapting each API to a robust dialog system is difficult and time-consuming, as it requires not only programming but also anticipating what is mostly likely to be asked and how it is likely to be asked. We present a crowd-powered system able to generate a natural languageinterface for arbitrary web APIs from scratch without domain-dependent training data or knowledge.Our approach combines two types of crowd workers: non-expert Mechanical Turk workers interpret the functions of the API and elicit information from the user, and expert oDesk workers provide a minimal sufficient scaffolding around the API to allow us to make general queries.We describe our multi-stage process and present results for each stage.
Zero-Shot Object Recognition System based on Topic Model
Object recognition systems usually require fully complete manually labeled training data to train the classifier. In this paper, we study the problem of object recognition where the training samples are missing during the classifier learning stage, a task also known as zero-shot learning. We propose a novel zero-shot learning strategy that utilizes the topic model and hierarchical class concept. Our proposed method advanced where cumbersome human annotation stage (i.e. attribute-based classification) is eliminated. We achieve comparable performance with state-of-the-art algorithms in four public datasets: PubFig (67.09%), Cifar-100 (54.85%), Caltech-256 (52.14%), and Animals with Attributes (49.65%) when unseen classes exist in the classification task.