Because of the exponential growth of text data, enterprises need to work shifting from numeric towards text information. Making sense of text information is becoming a key asset for businesses. Take an insurance company for instance: its whole business is dependent on text data since all its products are defined verbosely. All customer interactions happen in natural language. At the moment, the only way to deal with this mass of textual information is to use a human understanding of language.
Jim Simons is the greatest moneymaker in modern financial history, and no other investor – Warren Buffett, Peter Lynch, Ray Dalio, Steve Cohen, or George Soros – can touch his record. His firm has earned profits of more than $100 billion, and between 1994 and 2004, its signature fund, The Medallion Fund, averaged 70 per cent annual return. Medallion's returns don't seem to correlate with known factors and the only thing most people get to know is that the strategy is "statistical arbitrage". People are confounded by the fact that the proliferation of other quantitative hedge funds in recent years hasn't caused Medallion's performance to deteriorate. Last year, there was a very readable book about Jim Simons: On the man who solved the markets – How Jim Simons Launched the Quant Revolution, Penguin 2019.
In modern text data analysis, NLP tools and NLP libraries are indispensable. Researchers and businesses use natural language processing tools to draw information from text data analysis. This analysis includes analyzing customer feedback, automating support systems, improving search and recommendation algorithms, and monitoring social media. There are a wide array of NLP tools and services available, and knowing their features is key to good results. While some tools are perfect for small projects, others are better for experts working on big data.
Distributed learning is a problem of fundamental interest in machine learning and cognitive science. In this paper, we present asynchronous distributed learning algorithms for two well-known unsupervised learning frameworks: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes (HDP). In the proposed approach, the data are distributed across P processors, and processors independently perform Gibbs sampling on their local data and communicate their information in a local asynchronous manner with other processors. We demonstrate that our asynchronous algorithms are able to learn global topic models that are statistically as accurate as those learned by the standard LDA and HDP samplers, but with significant improvements in computation time and memory. We show speedup results on a 730-million-word text corpus using 32 processors, and we provide perplexity results for up to 1500 virtual processors.
In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely appled in the computer vision field. However, many of these applications have difficulty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a bag-of-words''. It is also critical to properly design words'' and "documents" when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structure among visual words that are essential for solving many vision problems. The spatial information is not encoded in the value of visual words but in the design of documents.
The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. Furthermore, the partitioning scheme balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We also use data streaming to handle extremely large datasets.
Semantic hashing seeks compact binary codes of datapoints so that the Hamming distance between codewords correlates with semantic similarity. Hinton et al. used a clever implementation of autoencoders to find such codes. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresh- olded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigen- functions of manifolds, we show how to efficiently calculate the code of a novel datapoint.
We show how to model documents as bags of words using family of two-layer, undirected graphical models. Each member of the family has the same number of binary hidden units but a different number of softmax visible units. All of the softmax units in all of the models in the family share the same weights to the binary hidden units. We describe efficient inference and learning procedures for such a family. Each member of the family models the probability distribution of documents of a specific length as a product of topic-specific distributions rather than as a mixture and this gives much better generalization than Latent Dirichlet Allocation for modeling the log probabilities of held-out documents.
We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. Papers published at the Neural Information Processing Systems Conference.
Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean field variational inference (VB) in latent Dirichlet allocation.