Good Features are the backbone of any machine learning model. And good feature creation often needs domain knowledge, creativity, and lots of time. TLDR; this post is about useful feature engineering methods and tricks that I have learned and end up using often. Have you read about featuretools yet? If not, then you are going to be delighted.
Good Features are the backbone of any machine learning model. And good feature creation often needs domain knowledge, creativity, and lots of time. And some other ideas to think about feature creation. TLDR; this post is about useful feature engineering methods and tricks that I have learned and end up using often. Have you read about featuretools yet? If not, then you are going to be delighted.
This post is by Gal Oshri, a Program Manager in the Data Group at Microsoft. RFM is a simple and intuitive technique for segmenting customers and has been used by marketers for decades. RFM also has surprising value in machine learning applications despite its simplicity. This blog post describes how a generic technique has allowed us to come within 1% accuracy of winning solutions in various ML competitions, such as placing in the top 30 entries of the KDD Cup 2015 and getting a boost of 502 positions on the leaderboard of an AirBnB Kaggle competition. RFM has been widely used in direct marketing and database marketing for identifying the customers who are most likely to respond or make a purchase .
Malware recognition modules decide if an object is a threat, based on the data they have collected on it. This data may be collected at different phases: – Pre-execution phase data is anything you can tell about a file without executing it. This may include executable file format descriptions, code descriptions, binary data statistics, text strings and information extracted via code emulation and other similar data. In the early epochs of the cyber era, the number of malware threats was relatively low, and simple handcrafted pre-execution rules were often enough to detect threats. But a decade ago, the tremendous growth of the malware stream did not allow anti-malware solutions to rely solely on the expensive manual creation of detection rules.
It is midnight on January 18, 2017, and the Outbrain Click Prediction machine learning competition has just finished. It has been three and a half months of working late. As I scroll through the leaderboard page, I found my name in the 19th position, which was the top 2% from nearly 1,000 competitors. Not bad for the first Kaggle competition I had decided to put a real effort in! One of the reasons why I managed to score well was the fact that Google Cloud Platform (GCP) made my life easier and I could focus on the data.