"Apply Machine Learning like the great engineer you are, not like the great Machine Learning expert you aren't." This is the first sentence in a Google-internal document I read about how to apply ML. In my limited experience working as a server/analytics guy, data (and how to store/process it) has always been the source of most consideration and impact on the overall pipeline. Ask any Kaggle winner, and they will always say that the biggest gains usually come from being smart about representing data, rather than using some sort of complex algorithm. Even the CRISP data mining process has not one, but two stages dedicated solely to data understanding and preparation.
This paper describes the autofeat Python library, which provides a scikit-learn style linear regression model with automatic feature engineering and selection capabilities. Complex non-linear machine learning models such as neural networks are in practice often difficult to train and even harder to explain to non-statisticians, who require transparent analysis results as a basis for important business decisions. While linear models are efficient and intuitive, they generally provide lower prediction accuracies. Our library provides a multi-step feature engineering and selection process, where first a large pool of non-linear features is generated, from which then a small and robust set of meaningful features is selected, which improve the prediction accuracy of a linear model while retaining its interpretability.
I was browsing twitter yesterday (follow me!) when I came across this tweet by Data Science Renee linking to this Medium article called "Top 6 Errors Novice Machine Learning Engineers Make" by Christopher Dossman. This drew my attention because I'm somewhat new to the field (and even if I weren't, it's always worth reviewing the fundamentals).
As you are no doubt aware, simple date fields are potential treasure troves of data. While, at first glance, a date gives us nothing more than a specific point on a timeline, knowing where this point on the line is relative to other points can generate all sort of insights into a dataset. What you want out of a date is dependent on what it is you are doing. Having external resources containing the answer to some of the less-intrinsic questions above ("Were the Olympics taking place on that date?" -- perhaps a perfectly valid question given your project) would certainly be necessary, but even sussing out the more elementary questions could prove immensely useful. Simple feature engineering on dates can mindlessly take care of the latter.
Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However, most studies consider credit card transactions as isolated events and not as a sequence of transactions. In this article, we model a sequence of credit card transactions from three different perspectives, namely (i) does the sequence contain a Fraud? (ii) Is the sequence obtained by fixing the card-holder or the payment terminal? (iii) Is it a sequence of spent amount or of elapsed time between the current and previous transactions? Combinations of the three binary perspectives give eight sets of sequences from the (training) set of transactions. Each one of these sets is modelled with a Hidden Markov Model (HMM). Each HMM associates a likelihood to a transaction given its sequence of previous transactions. These likelihoods are used as additional features in a Random Forest classifier for fraud detection. This multiple perspectives HMM-based approach enables an automatic feature engineering in order to model the sequential properties of the dataset with respect to the classification task. This strategy allows for a 15% increase in the precision-recall AUC compared to the state of the art feature engineering strategy for credit card fraud detection.