Word embeddings are a type of word representation that allows words with similar meanings to have a similar representation. A word is characterized by the company it keeps -- J.R.Firth (1957) All the NLP applications we build today have a single purpose and that is to make the computers understand human language but the biggest challenge to do that makes the machines understand how we understand human language in the form of reading, writing, or speaking. To start with we first train our machine learning or deep learning algorithms to understand textual data. As machines do not understand the text we need to make the input to a machine-readable format. For example, Imagine I'm trying to describe my dog -- With all of this dog's features and precise description, anyone could draw it, even though we have never seen it.
This is the third in a series of blog posts sharing my experiences working with algorithms and data structures for machine learning. These experiences were gained whilst building out the nlp project for LSA (Latent Semantic Analysis) of text documents. In Part 2 of this series, I explored sparse matrix formats as a set of data structures for more efficiently storing and manipulating sparsely populated matrices (matrices where most elements contain zero values). We tested the impact of using sparse formats, over the originally implemented dense matrix formats, using Go's inbuilt benchmark functionality and found that our optimisations led to a reduction in memory consumption and processing time from 1 GB to 150 MB and 3.3 seconds to 1.2 seconds respectively. The Golang sparse matrix format implementations used in the article are available on Github along with all the benchmarks and sample code used in this series.
In the field of social media data analytics, one popular area of research is the sentiment analysis of twitter data. Twitter is one of the most popular social media platforms in the world, with 330 million monthly active users and 500 million tweets sent each day. By carefully analyzing the sentiment of these tweets--whether they are positive, negative, or neutral, for example--we can learn a lot about how people feel about certain topics. Understanding the sentiment of tweets is important for a variety of reasons: business marketing, politics, public behavior analysis, and information gathering are just a few examples. Sentiment analysis of twitter data can help marketers understand the customer response to product launches and marketing campaigns, and it can also help political parties understand the public response to policy changes or announcements.
Text data requires special preparation before you can start using it for predictive modeling. The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization). The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data. In this tutorial, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.