Continuing the tour of older papers that started with our ResNet blog post, we now take on Skip-Thought Vectors by Kiros et al. Their goal was to come up with a useful embedding for sentences that was not tuned for a single task and did not require labeled data to train. They took inspiration from Word2Vec skip-gram (you can find my explanation of that algorithm here) and attempt to extend it to sentences. Skip-thought vectors are created using an encoder-decoder model. The encoder takes in the training sentence and outputs a vector.
Storytelling is at the heart of human nature. We were storytellers long before we were able to write, we shared our values and created our societies mostly through oral storytelling. Then, we managed to find the way to record and share our stories, and certainly more advanced ways to broadly share our stories; from Gutenberg's printing press to television, and the internet. Writing stories is not easy, especially if one must write a story just by looking at a picture in different literary genres. Natural Language Processing (NLP) is a field that is driving a revolution in the computer-human interaction.
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.
The Skip-Thoughts model is a sentence encoder. It learns to encode input sentences into a fixed-dimensional vector representation that is useful for many tasks, for example to detect paraphrases or to classify whether a product review is positive or negative. See the Skip-Thought Vectors paper for details of the model architecture and more example applications. A trained Skip-Thoughts model will encode similar sentences nearby each other in the embedding vector space. The following examples show the nearest neighbor by cosine similarity of some sentences from the movie review dataset.
Word embedding -- the mapping of words into numerical vector spaces -- has proved to be an incredibly important method for natural language processing (NLP) tasks in recent years, enabling various machine learning models that rely on vector representation as input to enjoy richer representations of text input. These representations preserve more semantic and syntactic information on words, leading to improved performance in almost every imaginable NLP task. Both the novel idea itself and its tremendous impact have led researchers to consider the problem of how to provide this boon of richer vector representations to larger units of texts -- from sentences to books. This effort has resulted in a slew of new methods to produce these mappings, with various innovative solutions to the problem and some notable breakthroughs. This post is meant to present the different ways practitioners have come up with to produce document embeddings. Note: I use the word document here to refer to any sequence of words, ranging from sentences and paragraphs through social media posts all way up to articles, books and more complexly structured text documents (e.g. In this post, I will touch upon not only approaches that are direct extensions of word embedding techniques (e.g., in the way doc2vec extends word2vec), but also other notable techniques that produce -- sometimes among other outputs -- a mapping of documents to vectors in ℝⁿ. I will also try to provide links and references to both the original papers and code implementations of the reviewed methods whenever possible. Note: This topic is somewhat related, but not equivalent, to the problem of learning structured text representations (e.g., Liu & Lapata, 2018). The ability to map documents to informative vector representations has a wide range of applications.