Goto

Collaborating Authors

Smith, David A.


Contrastive Training for Models of Information Cascades

arXiv.org Machine Learning

This paper proposes a model of information cascades as directed spanning trees (DSTs) over observed documents. In addition, we propose a contrastive training procedure that exploits partial temporal ordering of node infections in lieu of labeled training links. This combination of model and unsupervised training makes it possible to improve on models that use infection times alone and to exploit arbitrary features of the nodes and of the text content of messages in information cascades. With only basic node and time lag features similar to previous models, the DST model achieves performance with unsupervised training comparable to strong baselines on a blog network inference task. Unsupervised training with additional content features achieves significantly better results, reaching half the accuracy of a fully supervised model.


Contrastive Training for Models of Information Cascades

AAAI Conferences

This paper proposes a model of information cascades as directed spanning trees (DSTs) over observed documents. In addition, we propose a contrastive training procedure that exploits partial temporal ordering of node infections in lieu of labeled training links. This combination of model and unsupervised training makes it possible to improve on models that use infection times alone and to exploit arbitrary features of the nodes and of the text content of messages in information cascades. With only basic node and time lag features similar to previous models, the DST model achieves performance with unsupervised training comparable to strong baselines on a blog network inference task. Unsupervised training with additional content features achieves significantly better results, reaching half the accuracy of a fully supervised model.


Multilingual Topic Models

arXiv.org Machine Learning

Scientific publications have evolved several features for mitigating vocabulary mismatch when indexing, retrieving, and computing similarity between articles. These mitigation strategies range from simply focusing on high-value article sections, such as titles and abstracts, to assigning keywords, often from controlled vocabularies, either manually or through automatic annotation. Various document representation schemes possess different cost-benefit tradeoffs. In this paper, we propose to model different representations of the same article as translations of each other, all generated from a common latent representation in a multilingual topic model. We start with a methodological overview on latent variable models for parallel document representations that could be used across many information science tasks. We then show how solving the inference problem of mapping diverse representations into a shared topic space allows us to evaluate representations based on how topically similar they are to the original article. In addition, our proposed approach provides means to discover where different concept vocabularies require improvement.


Automatic Construction of Evaluation Sets and Evaluation of Document Similarity Models in Large Scholarly Retrieval Systems

AAAI Conferences

Retrieval systems for scholarly literature offer the ability for the scientific community to search, explore and download scholarly articles across various scientific disciplines. Mostly used by the experts in the particular field, these systems contain user community logs including information on user specific downloaded articles. In this paper we present a novel approach for automatically evaluating document similarity models in large collections of scholarly publications. Unlike typical evaluation settings that use test collections consisting of query documents and human annotated relevance judgments, we use download logs to automatically generate pseudo-relevant set of similar document pairs. More specifically we show that consecutively downloaded document pairs, extracted from a scholarly information retrieval (IR) system, could be utilized as a test collection for evaluating document similarity models. Another novel aspect of our approach lies in the method that we employ for evaluating the performance of the model by comparing the distribution of consecutively downloaded document pairs and random document pairs in log space. Across two families of similarity models, that represent documents in the term vector and topic spaces, we show that our evaluation approach achieves very high correlation with traditional performance metrics such as Mean Average Precision (MAP), while being more efficient to compute.


Inference by Minimizing Size, Divergence, or their Sum

arXiv.org Machine Learning

We speed up marginal inference by ignoring factors that do not significantly contribute to overall accuracy. In order to pick a suitable subset of factors to ignore, we propose three schemes: minimizing the number of model factors under a bound on the KL divergence between pruned and full models; minimizing the KL divergence under a bound on factor count; and minimizing the weighted sum of KL divergence and factor count. All three problems are solved using an approximation of the KL divergence than can be calculated in terms of marginals computed on a simple seed graph. Applied to synthetic image denoising and to three different types of NLP parsing models, this technique performs marginal inference up to 11 times faster than loopy BP, with graph sizes reduced up to 98%-at comparable error in marginals and parsing accuracy. We also show that minimizing the weighted sum of divergence and size is substantially faster than minimizing either of the other objectives based on the approximation to divergence presented here.