Methods for analysis of principal components in discrete data have existed for some time under various names such as grade of membership modelling, probabilistic latent semantic analysis, and genotype inference with admixture. In this paper we explore a number of extensions to the common theory, and present some application of these methods to some common statistical tasks. We show that these methods can be interpreted as a discrete version of ICA. We develop a hierarchical version yielding components at different levels of detail, and additional techniques for Gibbs sampling. We compare the algorithms on a text prediction task using support vector machines, and to information retrieval.
George, Clint Pazhayidam (University of Florida) | Puri, Sahil (University of Florida) | Wang, Daisy Zhe (University of Florida) | Wilson, Joseph N. (University of Florida) | Hamilton, William F. (University of Florida)
Electronic discovery is an interesting subproblem of information retrieval in which one identifies documents that are potentially relevant to issues and facts of a legal case from an electronically stored document collection (a corpus). In this paper, we consider representing documents in a topic space using the well-known topic models such as latent Dirichlet allocation and latent semantic indexing, and solving the information retrieval problem via finding document similarities in the topic space rather doing it in the corpus vocabulary space. We also develop an iterative SMART ranking and categorization framework including human-in-the-loop to label a set of seed (training) documents and using them to build a semi-supervised binary document classification model based on Support Vector Machines. To improve this model, we propose a method for choosing seed documents from the whole population via an active learning strategy. We report the results of our experiments on a real dataset in the electronic discovery domain.
A web page may be relevant to multiple topics; even when nominally on a single topic, the page may attract attention (and thus links) from multiple communities. Instead of indiscriminately summing the authority provided by all pages, we decompose a web page into separate subnodes with respect to each community pointing to it. Utilizing the relevance of such communities allows us to better model the semantic structure of the Web, leading to better estimates of authority for a given query. We apply a total of eighty queries over two real-world datasets to demonstrate that the use of community decomposition can consistently and significantly improve upon Page-Rank's top-ten results.
Cimiano, Philipp (Delft University of Technology) | Schultz, Antje (University of Koblenz-Landau) | Sizov, Sergej (University of Koblenz-Landau) | Sorg, Philipp (Technical University of Karlsruhe) | Staab, Steffen (University of Koblenz-Landau)
The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself—as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA)—to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on "latent" concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (cross-language) information retrieval research.