Goto

Collaborating Authors

 vtreat


When Cross-Validation is More Powerful than Regularization

#artificialintelligence

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that smaller coefficients are less sensitive to idiosyncracies in the training data, and hence, less likely to overfit. Cross-validation is a way to safely reuse training data in nested model situations. This includes both the case of setting hyperparameters before fitting a model, and the case of fitting models (let's call them base learners) that are then used as variables in downstream models, as shown in Figure 1.


Encoding categorical variables: one-hot and beyond

#artificialintelligence

R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can't point to it as it is everywhere. Much of the encoding in R is essentially based on "contrasts" implemented in stats::model.matrix() Note: do not use base::data.matrix() The above mal-coding can be a critical flaw when you are building a model and then later using the model on new data (be it cross-validation data, test data, or future application data). Many R users are not familiar with the above issue as encoding is hidden in model training, and how to encode new data is stored as part of the model.


Encoding categorical variables: one-hot and beyond

#artificialintelligence

R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can't point to it as it is everywhere. Much of the encoding in R is essentially based on "contrasts" implemented in stats::model.matrix() Note: do not use base::data.matrix() The above mal-coding can be a critical flaw when you are building a model and then later using the model on new data (be it cross-validation data, test data, or future application data).


vtreat: prepare data

#artificialintelligence

This article is on preparing data for modeling in R using vtreat. Suppose we wish to work with some data. Our example task is to train a classification model for credit approval using the ranger implementation of the random forests method. We will take our data from John Ross Quinlan's re-processed "credit approval" dataset hosted at Lichman, M. (2013). For convenience we have copied the data to our working directory here.