Predictive modeling is a formula that transforms a list of input fields or variables into some output of interest. Feature engineering is simply a thoughtful creation of new input fields from existing input fields, either in an automated fashion or manually, with valuable inputs from domain expertise, logical reasoning, or intuition. The new input fields could result in better inferences and insights from data and exponentially increase the performance of predictive models. Feature engineering is one of the most important parts of the data preparation process, where deriving new and meaningful variables takes place. Feature engineering enhances and enriches the ingredients needed for creating a robust model.
Lots of people have different definitions for feature engineering and preprocessing, so how does HyperparameterHunter define it? We're working with a very broad definition for "feature engineering", hence the blurred line between itself and "preprocessing". We consider "feature engineering" to be any modifications applied to data before model fitting -- whether performed once on Experiment start, or repeated for every fold in cross-validation. Technically, though, HyperparameterHunter lets you define the particulars of "feature engineering" for yourself, which we'll see soon. Here are a few things that fall under our umbrella of "feature engineering": A fair question since Feature Engineering is rarely a topic in hyperparameter optimization.
I think you are putting your cart in front of the horse a little bit... That is before stating the problem mathematically you need to have an idea of what data is there. This means both talking to the people who already use it on a day-to-day basis in order to understand their process, and seeing where and how it is stored. Then figuring out if it is suitable for the task at hand and what the business wants. Only then can you bother with stating it as a math problem, feature engineering, etc. IMO most of your time will go into figuring out the problem, understanding the data and cleaning the data, so you should put some more attention to that.
"Apply Machine Learning like the great engineer you are, not like the great Machine Learning expert you aren't." This is the first sentence in a Google-internal document I read about how to apply ML. In my limited experience working as a server/analytics guy, data (and how to store/process it) has always been the source of most consideration and impact on the overall pipeline. Ask any Kaggle winner, and they will always say that the biggest gains usually come from being smart about representing data, rather than using some sort of complex algorithm. Even the CRISP data mining process has not one, but two stages dedicated solely to data understanding and preparation.
In the two previous Kaggle tutorials, you learned all about how to get your data in a form to build your first machine learning model, using Exploratory Data Analysis and baseline machine learning models. Next, you successfully managed to build your first machine learning model, a decision tree classifier. You submitted all these models to Kaggle and interpreted their accuracy.