Due to recent explosion of text data, researchers have been overwhelmed by ever-increasing volume of articles produced by different research communities. Various scholarly search websites, citation recommendation engines, and research databases have been created to simplify the text search tasks. However, it is still difficult for researchers to be able to identify potential research topics without doing intensive reviews on a tremendous number of articles published by journals, conferences, meetings, and workshops. In this paper, we consider a novel topic diffusion discovery technique that incorporates sparseness-constrained Non-negative Matrix Factorization with generalized Jensen-Shannon divergence to help understand term-topic evolutions and identify topic diffusions. Our experimental result shows that this approach can extract more prominent topics from large article databases, visualize relationships between terms of interest and abstract topics, and further help researchers understand whether given terms/topics have been widely explored or whether new topics are emerging from literature.
This article is a comprehensive overview of Topic Modeling and its associated techniques. In natural language understanding (NLU) tasks, there is a hierarchy of lenses through which we can extract meaning -- from words to sentences to paragraphs to documents. At the document level, one of the most useful ways to understand text is by analyzing its topics. The process of learning, recognizing, and extracting these topics across a collection of documents is called topic modeling. In this post, we will explore topic modeling through 4 of the most popular techniques today: LSA, pLSA, LDA, and the newer, deep learning-based lda2vec.
We develop necessary and sufficient conditions and a novel provably consistent and efficient algorithm for discovering topics (latent factors) from observations (documents) that are realized from a probabilistic mixture of shared latent factors that have certain properties. Our focus is on the class of topic models in which each shared latent factor contains a novel word that is unique to that factor, a property that has come to be known as separability. Our algorithm is based on the key insight that the novel words correspond to the extreme points of the convex hull formed by the row-vectors of a suitably normalized word co-occurrence matrix. We leverage this geometric insight to establish polynomial computation and sample complexity bounds based on a few isotropic random projections of the rows of the normalized word co-occurrence matrix. Our proposed random-projections-based algorithm is naturally amenable to an efficient distributed implementation and is attractive for modern web-scale distributed data mining applications.
Latent Dirichlet Allocation (LDA) is a algorithms used to discover the topics that are present in a corpus. A few open source libraries exist, but if you are using Python then the main contender is Gensim. Gensim is an awesome library and scales really well to large text corpuses. Gensim, however does not include Non-negative Matrix Factorization (NMF), which can also be used to find topics in text. The mathematical basis underpinning NMF is quite different from LDA. I have found it interesting to compare the results of both of the algorithms and have found that NMF sometimes produces more meaningful topics for smaller datasets. NMF has been included in Scikit Learn for quite a while but LDA has only recently (late 2015) been included. The great thing about using Scikit Learn is that it brings API consistency which makes it almost trivial to perform Topic Modeling using both LDA and NMF. Scikit Learn also includes seeding options for NMF which greatly helps with algorithm convergence and offers both online and batch variants of LDA.
The simplicial condition and other stronger conditions that imply it have recently played a central role in developing polynomial time algorithms with provable asymptotic consistency and sample complexity guarantees for topic estimation in separable topic models. Of these algorithms, those that rely solely on the simplicial condition are impractical while the practical ones need stronger conditions. In this paper, we demonstrate, for the first time, that the simplicial condition is a fundamental, algorithm-independent, information-theoretic necessary condition for consistent separable topic estimation. Furthermore, under solely the simplicial condition, we present a practical quadratic-complexity algorithm based on random projections which consistently detects all novel words of all topics using only up to second-order empirical word moments. This algorithm is amenable to distributed implementation making it attractive for 'big-data' scenarios involving a network of large distributed databases.