Graph2topic: an opensource topic modeling framework based on sentence embedding and community detection
Zhang, Leihang, Liu, Jiapeng, Yan, Qiang
–arXiv.org Artificial Intelligence
Topic modelling is a text-mining method for discovering hidden semantic structures in a collection of documents. It has been widely used outside of computer science, including social and cultural studies[1], bioinformatics[2], and political science[3, 4]. The most popular and classic topic modelling method is latent Dirichlet allocation (LDA)[5], which provides a mathematically rigorous probabilistic model for topic modelling. The probabilistic model can offer a quantitative expression of the correlation between words with topics and topics with document, which makes it applicable to various quantitative analyses. However, LDA suffers from several conceptual and practical flaws: (1) LDA represents text as bag-of-words, which ignores the contextual and sequential correlation between words; (2) there is no justification for modelling the distributions of topics in text and words in topics with the Dirichlet prior besides mathematical convenience[6]; (3) the inability to choose the appropriate number of topics; and (4) the quality of topics, such as coherence and diversity, leaves much to be desired. Fortunately, contextual embedding techniques provide a new paradigm for representing text and further help alleviate the flaws of conventional topic models, such as LDA. Bidirectional encoder representations from transformers (BERT)[7] and its variations (e.g., RoBERTa[8], sentence-BERT[9], SimCSE[10]), can generate high-quality contextual word and sentence vector representations, which allow the meaning of texts to be encoded in such a way that similar texts are located close to each other in vector space. Researchers have made many fruitful attempts and significant progress in adopting these contextual representations for topic modelling. BERTopic[11] and CETopic[12] are the state-of-the-art topic models.
arXiv.org Artificial Intelligence
Jun-6-2023
- Country:
- North America
- Dominican Republic (0.04)
- United States
- Washington > King County
- Seattle (0.04)
- New York > New York County
- New York City (0.14)
- Washington > King County
- Asia
- Middle East > Jordan (0.04)
- Vietnam > Khánh Hòa Province
- Nha Trang (0.04)
- China
- North America
- Genre:
- Research Report (0.50)
- Technology: