Word embedding -- the mapping of words into numerical vector spaces -- has proved to be an incredibly important method for natural language processing (NLP) tasks in recent years, enabling various machine learning models that rely on vector representation as input to enjoy richer representations of text input. These representations preserve more semantic and syntactic information on words, leading to improved performance in almost every imaginable NLP task. Both the novel idea itself and its tremendous impact have led researchers to consider the problem of how to provide this boon of richer vector representations to larger units of texts -- from sentences to books. This effort has resulted in a slew of new methods to produce these mappings, with various innovative solutions to the problem and some notable breakthroughs. This post is meant to present the different ways practitioners have come up with to produce document embeddings. Note: I use the word document here to refer to any sequence of words, ranging from sentences and paragraphs through social media posts all way up to articles, books and more complexly structured text documents (e.g. In this post, I will touch upon not only approaches that are direct extensions of word embedding techniques (e.g., in the way doc2vec extends word2vec), but also other notable techniques that produce -- sometimes among other outputs -- a mapping of documents to vectors in ℝⁿ. I will also try to provide links and references to both the original papers and code implementations of the reviewed methods whenever possible. Note: This topic is somewhat related, but not equivalent, to the problem of learning structured text representations (e.g., Liu & Lapata, 2018). The ability to map documents to informative vector representations has a wide range of applications.
This tutorial introduces a new and powerful set of techniques variously called "neural machine translation" or "neural sequence-to-sequence models". These techniques have been used in a number of tasks regarding the handling of human language, and can be a powerful tool in the toolbox of anyone who wants to model sequential data of some sort. The tutorial assumes that the reader knows the basics of math and programming, but does not assume any particular experience with neural networks or natural language processing. It attempts to explain the intuition behind the various methods covered, then delves into them with enough mathematical detail to understand them concretely, and culiminates with a suggestion for an implementation exercise, where readers can test that they understood the content in practice.
Feature extraction has gained increasing attention in the field of machine learning, as in order to detect patterns, extract information, or predict future observations from big data, the urge of informative features is crucial. The process of extracting features is highly linked to dimensionality reduction as it implies the transformation of the data from a sparse high-dimensional space, to higher level meaningful abstractions. This dissertation employs Neural Networks for distributed paragraph representations, and Latent Dirichlet Allocation to capture higher level features of paragraph vectors. Although Neural Networks for distributed paragraph representations are considered the state of the art for extracting paragraph vectors, we show that a quick topic analysis model such as Latent Dirichlet Allocation can provide meaningful features too. We evaluate the two methods on the CMU Movie Summary Corpus, a collection of 25,203 movie plot summaries extracted from Wikipedia. Finally, for both approaches, we use K-Nearest Neighbors to discover similar movies, and plot the projected representations using T-Distributed Stochastic Neighbor Embedding to depict the context similarities. These similarities, expressed as movie distances, can be used for movies recommendation. The recommended movies of this approach are compared with the recommended movies from IMDB, which use a collaborative filtering recommendation approach, to show that our two models could constitute either an alternative or a supplementary recommendation approach.
In this post, I walk through how to build and train an neural translation model with attention. This model will be used to translate French to English. This post will focus on the conceptual explanation, while a detailed walk through of the project code can be found in the associated Jupyter notebook. This notebook can be viewed here or cloned from the project Github repository, here. This project closely follows the PyTorch Sequence to Sequence tutorial, while attempting to go more in depth with both the model implementation and the explanation. We are trying to build a translation model. One model that has been successful in this task is an Encoder-Decoder network.
Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose a bidirectional recurrent neural network based approach to extract parallel sentences from collections of multilingual texts. Our experiments with noisy parallel corpora show that we can achieve promising results against a competitive baseline by removing the need of specific feature engineering or additional external resources. To justify the utility of our approach, we extract sentence pairs from Wikipedia articles to train machine translation systems and show significant improvements in translation performance.