A New Geometric Approach to Latent Topic Modeling and Discovery

Ding, Weicong, Rohban, Mohammad H., Ishwar, Prakash, Saligrama, Venkatesh

arXiv.org Machine Learning 

ABSTRACT A new geometrically-motivated algorithm for nonnegative matrix factorization is developed and applied to the discovery of latent "topics" for text and image "document" corpora. The algorithm is based on robustly finding and clustering extreme-points of empirical cross-document wordfrequencies that correspond to novel "words" unique to each topic. In contrast to related approaches that are based on solving non-convex optimization problems using suboptimal approximations, locally-optimal methods, or heuristics, the new algorithm is convex, has polynomial complexity, and has competitive qualitative and quantitative performance compared to the current state-of-the-art approaches on synthetic and real-world datasets. Index Terms-- Topic modeling, nonnegative matrix factorization (NMF), extreme points, subspace clustering. 1. INTRODUCTION Topic modeling is a statistical tool for the automatic discovery and comprehension of latent thematic structure or topics, assumed to pervade a corpus of documents. Suppose that we have a corpus of M documents composed of words from a vocabulary of W distinct words indexed byw 1,...,W.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found