AITopics | lightlda

Collaborating Authors

lightlda

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Computing Web-scale Topic Models using an Asynchronous Parameter Server

Jagerman, Rolf, Eickhoff, Carsten, de Rijke, Maarten

arXiv.org Machine LearningJun-18-2017

Topic models such as Latent Dirichlet Allocation (LDA) have been widely used in information retrieval for tasks ranging from smoothing and feedback methods to tools for exploratory search and discovery. However, classical methods for inferring topic models do not scale up to the massive size of today's publicly available Web-scale data sets. The state-of-the-art approaches rely on custom strategies, implementations and hardware to facilitate their asynchronous, communication-intensive workloads. We present APS-LDA, which integrates state-of-the-art topic modeling with cluster computing frameworks such as Spark using a novel asynchronous parameter server. Advantages of this integration include convenient usage of existing data processing pipelines and eliminating the need for disk writes as data can be kept in memory from start to finish. Our goal is not to outperform highly customized implementations, but to propose a general high-performance topic modeling framework that can easily be used in today's data processing pipelines. We compare APS-LDA to the existing Spark LDA implementations and show that our system can, on a 480-core cluster, process up to 135 times more data and 10 times more topics without sacrificing model quality.

artificial intelligence, natural language, parameter server, (13 more...)

arXiv.org Machine Learning

doi: 10.1145/3077136.3084135

1605.07422

Country:

Europe (0.70)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.29)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)

Add feedback

WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation

Chen, Jianfei, Li, Kaiwei, Zhu, Jun, Chen, Wenguang

arXiv.org Machine LearningMar-2-2016

Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest for many applications. Previous work has developed an O(1) Metropolis-Hastings sampling method for each token. However, the performance is far from being optimal due to random accesses to the parameter matrices and frequent cache misses. In this paper, we first carefully analyze the memory access efficiency of existing algorithms for LDA by the scope of random access, which is the size of the memory region in which random accesses fall, within a short period of time. We then develop WarpLDA, an LDA sampler which achieves both the best O(1) time complexity per token and the best O(K) scope of random access. Our empirical results in a wide range of testing conditions demonstrate that WarpLDA is consistently 5-15x faster than the state-of-the-art Metropolis-Hastings based LightLDA, and is comparable or faster than the sparsity aware F+LDA. With WarpLDA, users can learn up to one million topics from hundreds of millions of documents in a few hours, at an unprecedentedly throughput of 11G tokens per second.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

1510.08628

Country: Asia (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.84)
(2 more...)

Add feedback

LightLDA: Big Topic Models on Modest Compute Clusters

Yuan, Jinhui, Gao, Fei, Ho, Qirong, Dai, Wei, Wei, Jinliang, Zheng, Xun, Xing, Eric P., Liu, Tie-Yan, Ma, Wei-Ying

arXiv.org Machine LearningDec-4-2014

When building large-scale machine learning (ML) programs, such as big topic models or deep neural nets, one usually assumes such tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude faster than current state-of-the-art Gibbs samplers; 2) a structure-aware model-parallel scheme, which leverages dependencies within the topic model, yielding a sampling strategy that is frugal on machine memory and network communication; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed; and 4) a bounded asynchronous data-parallel scheme, which allows efficient distributed processing of massive data via a parameter server. Our distribution strategy is an instance of the model-and-data-parallel programming model underlying the Petuum framework for general distributed ML, and was implemented on top of the Petuum open-source system. We provide experimental evidence showing how this development puts massive models within reach on a small cluster while still enjoying proportional time cost reductions with increasing cluster size, in comparison with alternative options.

lightlda, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

1412.1576

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback