In the first post, we learned how to use the term-frequency to represent textual information in the vector space. However, the main problem with the term-frequency approach is that it scales up frequent terms and scales down rare terms which are empirically more informative than the high frequency terms. The basic intuition is that a term that occurs frequently in many documents is not a good discriminator, and really makes sense (at least in many experimental tests); the important question here is: why would you, in a classification problem for instance, emphasize a term which is almost present in the entire corpus of your documents? The tf-idf weight comes to solve this problem. What tf-idf gives is how important is a word to a document in a collection, and that's why tf-idf incorporates local and global parameters, because it takes in consideration not only the isolated term but also the term within the document collection.
In the previous post Word Embeddings and Document Vectors: Part 1. Similarity we laid the groundwork for using bag-of-words based document vectors in conjunction with word embeddings (pre-trained or custom-trained) for computing document similarity, as a precursor to classification. It seemed that document word vectors were better at picking up on similarities (or the lack) in toy documents we looked at. We want to carry through with it and apply the approach against actual document repositories to see how the document word vectors do for classification. This post focuses on the approach, the mechanics, and the code snippets to get there. The results will be covered in the next post in this series.
Text data must be encoded as numbers to be used as input or output for machine learning and deep learning models. The Keras deep learning library provides some basic tools to help you prepare your text data. In this tutorial, you will discover how you can use Keras to prepare your text data. How to Prepare Text Data for Deep Learning with Keras Photo by ActiveSteve, some rights reserved. A good first step when working with text is to split it into words.