bidmach
Parallelizing Word2Vec in Shared and Distributed Memory
Ji, Shihao, Satish, Nadathur, Li, Sheng, Dubey, Pradeep
Word2Vec is a widely used algorithm for extracting low-dimensional vector representations of words. It generated considerable excitement in the machine learning and natural language processing (NLP) communities recently due to its exceptional performance in many NLP applications such as named entity recognition, sentiment analysis, machine translation and question answering. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures but are based on vector-vector operations that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we improve reuse of various data structures in the algorithm through the use of minibatching, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. In combination, these techniques allow us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.
BIDData/BIDMach
We recently ran some fresh benchmarks for Spark v1.1 and v1.2 and Graphlab clusters, and included some updated numbers from other recent published benchmarks. RCV1-v2 (Reuters news data, LYR2004 distribution) benchmarks are for OAA (One Against All) classification, since RCV1-v2 has 103 independent topic labels. RCV1-v2 is a small dataset (0.5 GB). BIDMach was run on a single Amazon g2.xlarge instance, while Spark was run on a cluster of m3.2xlarge high-memory instances. The other systems were run on an 8-code Intel E-2660 system.
Fast Parallel SAME Gibbs Sampling on General Discrete Bayesian Networks
Seita, Daniel, Chen, Haoyu, Canny, John
A fundamental task in machine learning and related fields is to perform inference on Bayesian networks. Since exact inference takes exponential time in general, a variety of approximate methods are used. Gibbs sampling is one of the most accurate approaches and provides unbiased samples from the posterior but it has historically been too expensive for large models. In this paper, we present an optimized, parallel Gibbs sampler augmented with state replication (SAME or State Augmented Marginal Estimation) to decrease convergence time. We find that SAME can improve the quality of parameter estimates while accelerating convergence. Experiments on both synthetic and real data show that our Gibbs sampler is substantially faster than the state of the art sampler, JAGS, without sacrificing accuracy. Our ultimate objective is to introduce the Gibbs sampler to researchers in many fields to expand their range of feasible inference problems.