octis
Improving Contextualized Topic Models with Negative Sampling
Adhya, Suman, Lahiri, Avishek, Sanyal, Debarshi Kumar, Das, Partha Pratim
Topic modeling has emerged as a dominant method for exploring large document collections. Recent approaches to topic modeling use large contextualized language models and variational autoencoders. In this paper, we propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics. In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document and dissimilar to the document reconstructed from the perturbed vector. Experiments for different topic counts on three publicly available benchmark datasets show that in most cases, our approach leads to an increase in topic coherence over that of the baselines. Our model also achieves very high topic diversity.
A beginner's guide to OCTIS: Optimizing and Comparing Topic Models Is Simple
Topic models are promising generative statistical methods that aim to extract the hidden topics underlying a collection of documents. Typically, topic models have two matrices as output. Then, the top-n words from this matrix with the highest probability are then used to represent a topic. The most popular topic modeling method is Latent Dirichlet Allocation, and many articles are written about its workings and implementations. However, focusing on LDA only is restrictive and might be suboptimal for a given corpus.