Topic Discovery through Data Dependent and Random Projections
Ding, Weicong, Rohban, Mohammad H., Ishwar, Prakash, Saligrama, Venkatesh
We present algorithms for topic modeling based on the geometry of cross-document word-frequency patterns. This perspective gains significance under the so called separability condition. This is a condition on existence of novel-words that are unique to each topic. We present a suite of highly efficient algorithms based on data-dependent and random projections of word-frequency patterns to identify novel words and associated topics. We will also discuss the statistical guarantees of the data-dependent projections method based on two mild assumptions on the prior density of topic document matrix. Our key insight here is that the maximum and minimum values of cross-document frequency patterns projected along any direction are associated with novel words. While our sample complexity bounds for topic recovery are similar to the state-of-art, the computational complexity of our random projection scheme scales linearly with the number of documents and the number of words per document. We present several experiments on synthetic and real-world datasets to demonstrate qualitative and quantitative merits of our scheme.
Mar-18-2013
- Country:
- Asia
- Afghanistan (0.04)
- China (0.04)
- India (0.04)
- Japan (0.04)
- Middle East
- Pakistan (0.04)
- Russia (0.04)
- Taiwan (0.04)
- Europe
- France (0.04)
- Germany (0.04)
- Russia (0.04)
- United Kingdom > England (0.04)
- North America
- Central America (0.04)
- Mexico (0.04)
- United States
- California
- Los Angeles County > Los Angeles (0.04)
- San Francisco County > San Francisco (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Kansas (0.04)
- Louisiana (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- New York > New York County
- New York City (0.04)
- Texas (0.04)
- California
- South America
- Asia
- Genre:
- Research Report (1.00)
- Industry:
- Energy (0.67)
- Government > Regional Government
- Law (0.67)
- Law Enforcement & Public Safety (0.67)
- Leisure & Entertainment > Sports (1.00)
- Media (1.00)
- Technology: