Analysing the Impact of Removing Infrequent Words on Topic Quality in LDA Models

Bystrov, Victor, Naboka-Krell, Viktoriia, Staszewska-Bystrova, Anna, Winker, Peter

Nov-24-2023–arXiv.org Artificial Intelligence

The use of topic modelling techniques, especially Latent Dirichlet Allocation (LDA) introduced by Blei et al. (2003), is growing fast. The methods find application in a broad variety of domains. In text-as-data applications, LDA enables the analysis of large collections of text in an unsupervised manner by uncovering latent structures behind the data. Given this increasing use of LDA as a standard tool for empirical analysis, also the interest in details of the method and, in particular, in parameter settings for its implementation is rising. Thus, since the introduction of the LDA approach in 2003 by Blei et al., different methodological components of LDA have already been studied in more detail as, for example, the choice of the number of topics (Cao et al., 2009; Mimno et al., 2011; Lewis and Grossetti, 2022; Bystrov et al., 2022a), hyper-parameter settings (Wallach et al., 2009), model design (e.g.

cut-off value, frequency, vocabulary pruning, (16 more...)

arXiv.org Artificial Intelligence

Nov-24-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Florida > Palm Beach County
    - Boca Raton (0.04)
- Europe
  - Germany (0.04)
  - United Kingdom > Scotland
    - City of Edinburgh > Edinburgh (0.04)
  - Spain > Valencian Community
    - Valencia Province > Valencia (0.04)
  - Poland > Łódź Province
    - Łódź (0.05)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Government (0.68)
- Banking & Finance (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Text Processing (0.92)