"Machine Learning" (ML) methods have been around for ages but Big Data revolution and plummeting cost of computing power are now making them truly excellent and practical analytical tools in banking, across a variety of use cases, including credit risk. ML algorithms may sound complex and futuristic but the way they work is quite simple. Essentially they combine a massive set of decision trees (i.e., a decision-making model that breaks out individual decisions and possible consequences, as known as "learners") to create an accurate model. By churning through these learners at high speeds, ML models are able to find "hidden" patterns, particularly in unstructured data that common statistical tools miss. Overfitting (the analytical description of random errors instead of underlying relationships) of the model is a typical concern that comes up in regards to ML. Overfitting of ML models can be avoided by carefully choosing input variables and the specific algorithm used.
To tackle this issue and make it much more insightful, let's transform the correlation matrix into a correlation plot. A correlation plot, also referred as a correlogram, allows to highlight the variables that are most (positively and negatively) correlated. The correlogram represents the correlations for all pairs of variables. Positive correlations are displayed in blue and negative correlations in red. The intensity of the color is proportional to the correlation coefficient so the stronger the correlation (i.e., the closer to -1 or 1), the darker the boxes.
So far, we have investigated if Father Age and Mother Age were impacting Gestation Week, and we know that both Father Age and Mother Age influence the changes in Gestation Week. But since we have done the investigation separately, one for Father Age's influence on Gestation Week and another for Mother's Age's influence on Gestation Week, we still don't know which of Father Age and Mother Age is the direct cause of the influence. In this post, I'm going to investigate further to find this out. So far, we know that the increases in Father Age would make Gestation Week shorter. And, the increases in Mother Age would also make Gestation Week shorter.
Over-fitting.If you perform a regression with 200 predictors (with strong cross-correlations among predictors), use meta regression coefficients: that is, use coefficients of the form f[Corr(Var, Response), a,b, c] where a, b, c are three meta-parameters (e.g. This will reduce your number of parameters from 200 to 3, and eliminate most of the over-fitting Perform the right type of cross-validation. If your training set has 400,000 observations distributed across 50 clients, and your test data set (used for cross-validation) has 200,000 observations but only 3 clients or 5 days worth of historical data, then your cross-validation methodology is very flawed. Better, split your cross-validation data set in 5 subsets to compute confidence intervals. Make sure you've eliminated outliers and cleaned your data set.