chemmodlab: A Cheminformatics Modeling Laboratory for Fitting and Assessing Machine Learning Models
Hughes-Oliver, Jeremy R. Ash Jacqueline M.
It is now commonplace for researchers across a variety of fields to fit machine learning models on complex data to make predictions. The complexity of these data (e.g., large number of features, nonlinear relationships with the response) often means it is difficult to determine a priori what machine learning modeling routine and what descriptors (also known as features, predictors, or covariates) will result in the best performance. A common approach to this problem is to fit many descriptor set and modeling routine (DM) combinations, and then compute measures of prediction performance for held out data to choose a DM combination by assessing relative performance. Often in a particular domain, there are only a few modeling routines that are widely accepted, and researchers tend to use these methods exclusively. Unfortunately, this will not always work well for every data set and researchers might learn from other fields where different modeling methods tend to be more succesful. There are a myraid of modeling methods implemented in R that may be worthwhile for researchers to try (see Hastie et al. (2009) and Kuhn and Johnson (2013) for an overview of these methods). However, the lack of knowledge of the syntactic minutiae and statistical methodology that is required to fit and compare different modeling routines in R often prohibits users from attempting them.
Jul-11-2018