Anyone who has participated in machine learning hackathons and competitions can attest to how crucial feature engineering can be. It is often the difference between getting into the top 10 of the leaderboard and finishing outside the top 50! I have been a huge advocate of feature engineering ever since I realized it's immense potential. But it can be a slow and arduous process when done manually. I have to spend time brainstorming over what features to come up, and analyze their usability them from different angles.
In this paper, we present a data science automation system called Prediction Factory. The system uses several key automation algorithms to enable data scientists to rapidly develop predictive models and share them with domain experts. To assess the system's impact, we implemented 3 different interfaces for creating predictive modeling projects: baseline automation, full automation, and optional automation. With a dataset of online grocery shopper behaviors, we divided data scientists among the interfaces to specify prediction problems, learn and evaluate models, and write a report for domain experts to judge whether or not to fund to continue working on. In total, 22 data scientists created 94 reports that were judged 296 times by 26 experts. In a head-to-head trial, reports generated utilizing full data science automation interface reports were funded 57.5% of the time, while the ones that used baseline automation were only funded 42.5% of the time. An intermediate interface which supports optional automation generated reports were funded 58.6% more often compared to the baseline. Full automation and optional automation reports were funded about equally when put head-to-head. These results demonstrate that Prediction Factory has implemented a critical amount of automation to augment the role of data scientists and improve business outcomes.
Most automation in machine learning focuses on model selection and hyper parameter tuning, and many overlook the challenge of automatically defining predictive tasks. We still heavily rely on human experts to define prediction tasks, and generate labels by aggregating raw data. In this paper, we tackle the challenge of defining useful prediction problems on event-driven time-series data. We introduce MLFriend to address this challenge. MLFriend first generates all possible prediction tasks under a predefined space, then interacts with a data scientist to learn the context of the data and recommend good prediction tasks from all the tasks in the space. We evaluate our system on three different datasets and generate a total of 2885 prediction tasks and solve them. Out of these 722 were deemed useful by expert data scientists. We also show that an automatic prediction task discovery system is able to identify top 10 tasks that a user may like within a batch of 100 tasks.
The artificial intelligence market is fueled by the potential to use data to change the world. While many organizations have already successfully adapted to this paradigm, applying machine learning to new problems is still challenging. The single biggest technical hurdle that machine learning algorithms must overcome is their need for processed data in order to work -- they can only make predictions from numeric data. This data is composed of relevant variables, known as "features." The process for extracting these numeric features is called "feature engineering."