Learning Topic Models: Identifiability and Finite-Sample Analysis
Chen, Yinyin, He, Shishuang, Yang, Yun, Liang, Feng
Topic models provide a useful text-mining tool for learning, extracting and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, a formal theoretical investigation on the statistical identifiability and accuracy of latent topic estimation is lacking in the literature. In this paper, we propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood, which is naturally connected to the concept of volume minimization in computational geometry. Theoretically, we introduce a new set of geometric conditions for topic model identifiability, which are weaker than conventional separability conditions relying on the existence of anchor words or pure topic documents. We conduct finite-sample error analysis for the proposed estimator and discuss the connection of our results with existing ones. We conclude with empirical studies on both simulated and real datasets.
Oct-8-2021
- Country:
- Asia > Middle East
- Iraq > Baghdad Governorate
- Baghdad (0.04)
- Jordan (0.04)
- Iraq > Baghdad Governorate
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America > United States
- Illinois (0.04)
- Iowa (0.04)
- New York > Richmond County
- New York City (0.04)
- Ohio (0.04)
- Texas (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.48)
- Industry:
- Technology: