In this post we will show how in just a few minutes the Prevision.io It is known that textual data is usually more tricker and harder to process than the linear or categorical features. In fact, the linear features sometimes need to be scaled. Categorical features are scalar straightly encoded, but transforming texts into machine readable format requires a lot of pre-processing and feature engineering. Moreover, there are many other challenges that have to be addressed: how to cover different languages? How is it possible to preserve the semantic relationship between the words' vocabulary?
One of the hot topics on Machine Learning is, with no doubts, feature engineering. In fact, it comes before the buzz on this topic, simple when we talk about Data Mining. Remembering the CRISP-DM process, feature engineering (and, consequently, feature selection) is the core of a great data mining project – it comes to life on the Data Preparation phase, that is the task to have constructive data preparation operations such as the production of derived attributes or entire new records, or transformed values for existing attributes. A very good definition, elegant in its simplicity, is that feature engineering is the process to create features that make machine learning algorithms work. And what makes it so important?
Many unexploited opportunities have been evident to quants for decades, but although the solutions may be clear statistically at some level, the limiting factor has been computation. A simple example is incorporating unstructured data like online content, or semi-structured data like company reports and transaction data, into predictive models. Feature engineering is the process machine learning folks use to generate inputs to statistical models from raw input data. There are approaches for automated feature learning with techniques like deep learning -- recently, this has allowed us to unlock the potential of understanding and labeling images. Then there are approaches that require collaboration with subject matter experts.
In the world of Natural Language Processing (NLP), the most basic models are based on Bag of Words. But such models fail to capture the syntactic relations between words. For example, suppose we build a sentiment analyser based on only Bag of Words. Such a model will not be able to capture the difference between "I like you", where "like" is a verb with a positive sentiment, and "I am like you", where "like" is a preposition with a neutral sentiment. So this leaves us with a question -- how do we improve on this Bag of Words technique?