Collaborating Authors

Evolution of Word to Vector


Word embeddings are a type of word representation that allows words with similar meanings to have a similar representation. A word is characterized by the company it keeps -- J.R.Firth (1957) All the NLP applications we build today have a single purpose and that is to make the computers understand human language but the biggest challenge to do that makes the machines understand how we understand human language in the form of reading, writing, or speaking. To start with we first train our machine learning or deep learning algorithms to understand textual data. As machines do not understand the text we need to make the input to a machine-readable format. For example, Imagine I'm trying to describe my dog -- With all of this dog's features and precise description, anyone could draw it, even though we have never seen it.

Optimising algorithms in Go for machine learning - Part 3: The hashing trick · James Bowman


This is the third in a series of blog posts sharing my experiences working with algorithms and data structures for machine learning. These experiences were gained whilst building out the nlp project for LSA (Latent Semantic Analysis) of text documents. In Part 2 of this series, I explored sparse matrix formats as a set of data structures for more efficiently storing and manipulating sparsely populated matrices (matrices where most elements contain zero values). We tested the impact of using sparse formats, over the originally implemented dense matrix formats, using Go's inbuilt benchmark functionality and found that our optimisations led to a reduction in memory consumption and processing time from 1 GB to 150 MB and 3.3 seconds to 1.2 seconds respectively. The Golang sparse matrix format implementations used in the article are available on Github along with all the benchmarks and sample code used in this series.

A Gentle Introduction to the Bag-of-Words Model - Machine Learning Mastery


The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on your specific text data. It has been used with great success on prediction problems like language modeling and documentation classification.

How to Build a Twitter Sentiment Analysis System


In the field of social media data analytics, one popular area of research is the sentiment analysis of twitter data. Twitter is one of the most popular social media platforms in the world, with 330 million monthly active users and 500 million tweets sent each day. By carefully analyzing the sentiment of these tweets--whether they are positive, negative, or neutral, for example--we can learn a lot about how people feel about certain topics. Understanding the sentiment of tweets is important for a variety of reasons: business marketing, politics, public behavior analysis, and information gathering are just a few examples. Sentiment analysis of twitter data can help marketers understand the customer response to product launches and marketing campaigns, and it can also help political parties understand the public response to policy changes or announcements.

How to Prepare Text Data for Machine Learning with scikit-learn - Machine Learning Mastery


Text data requires special preparation before you can start using it for predictive modeling. The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization). The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data. In this tutorial, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.