Over the past few years the CES trade show has become a familiar post-holidays pilgrimage for many of the country's biggest marketers. They see the event as a way to get a sneak peek at the latest tech gadgets and technologies that can help them engage with their customers. This year marketing executives from companies such as Coca-Cola, Unilever, Johnson & Johnson, Campbell Soup and PepsiCo Inc. made their way to Las Vegas for the gathering. The convention was jam-packed with everything from self-driving cars to robots that play chess to Procter & Gamble's air-freshener spray that can connect with Alphabet Inc.'s Nest home to automatically release pleasant scents in the home. But there was one category that seemed to especially win over marketers: virtual assistants.
Nested Chinese Restaurant Process (nCRP) topic models are powerful nonparametric Bayesian methods to extract a topic hierarchy from a given text corpus, where the hierarchical structure is automatically determined by the data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of nCRP topic models. However, hLDA has only been evaluated at small scale, because the existing collapsed Gibbs sampling and instantiated weight variational inference algorithms either are not scalable or sacrifice inference quality with mean-field assumptions. Moreover, an efficient distributed implementation of the data structures, such as dynamically growing count matrices and trees, is challenging. In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS) algorithm, which combines the advantages of collapsed and instantiated weight algorithms to achieve good scalability as well as high model quality. An initialization strategy is presented to further improve the model quality. Finally, we propose an efficient distributed implementation of PCGS through vectorization, pre-processing, and a careful design of the concurrent data structures and communication strategy. Empirical studies show that our algorithm is 111 times more efficient than the previous open-source implementation for hLDA, with comparable or even better model quality. Our distributed implementation can extract 1,722 topics from a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than the previous largest corpus, with 50 machines in 7 hours.
The term "machine learning" covers a grab bag of algorithms, techniques, and technology that are by now pretty much everywhere in modern life. However, machine intelligence has recently started to be used not just for identifying problems but to build better products. Amongst the first is the world's only beers brewed with the help of machine intelligence, which went on sale a few weeks ago. The machine learning algorithms uses a combination of reinforcement learning and bayesian optimisation to assist the brewer in deciding how to change the recipe of the beer, with the algorithms learning from experience and customer feedback. Perhaps the most obvious intrusion of machine learning into the physical world is the voice recognition that drives Apple's Siri, or Amazon's Alexa.