The correct use of model evaluation, model selection, and algorithm selection techniques is vital in academic machine learning research as well as in many industrial settings. This article reviews different techniques that can be used for each of these three subtasks and discusses the main advantages and disadvantages of each technique with references to theoretical and empirical studies. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning. Common methods such as the holdout method for model evaluation and selection are covered, which are not recommended when working with small datasets. Different flavors of the bootstrap technique are introduced for estimating the uncertainty of performance estimates, as an alternative to confidence intervals via normal approximation if bootstrapping is computationally feasible. Common cross-validation techniques such as leave-one-out cross-validation and k-fold cross-validation are reviewed, the bias-variance trade-off for choosing k is discussed, and practical tips for the optimal choice of k are given based on empirical evidence. Different statistical tests for algorithm comparisons are presented, and strategies for dealing with multiple comparisons such as omnibus tests and multiple-comparison corrections are discussed. Finally, alternative methods for algorithm selection, such as the combined F-test 5x2 cross-validation and nested cross-validation, are recommended for comparing machine learning algorithms when datasets are small.
Deep learning models have the capacity to fundamentally revolutionize medical imaging analysis, and they have particularly interesting applications in computer-aided diagnosis. We attempt to use deep learning neural networks to diagnose funduscopic images, visual representations of the interior of the eye. Recently, a few robust deep learning approaches have performed binary classification to infer the presence of a specific ocular disease, such as glaucoma or diabetic retinopathy. In an effort to broaden the applications of computer-aided ocular disease diagnosis, we propose a unifying model for disease classification: low-cost inference of a fundus image to determine whether it is healthy or diseased. To achieve this, we use transfer learning techniques, which retain the more overarching capabilities of a pre-trained base architecture but can adapt to another dataset. For comparisons, we then develop a custom heuristic equation and evaluation metric ranking system to determine the optimal base architecture and hyperparameters. The Xception base architecture, Adam optimizer, and mean squared error loss function perform best, achieving 90% accuracy, 94% sensitivity, and 86% specificity. For additional ease of use, we contain the model in a web interface whose file chooser can access the local filesystem, allowing for use on any internet-connected device: mobile, PC, or otherwise.
I've spent the majority of the last two years working almost exclusively in the deep learning space. It's been quite an experience – worked on multiple projects including image and video data related ones. Before that, I was on the fringes – I skirted around deep learning concepts like object detection and face recognition – but didn't take a deep dive until late 2017. I've come across a variety of challenges during this time. And I want to talk about four very common ones that most deep learning practitioners and enthusiasts face in their journey.
In contrast to k-nearest neighbors, a simple example of a parametric method would be logistic regression, a generalized linear model with a fixed number of model parameters: a weight coefficient for each feature variable in the dataset plus a bias (or intercept) unit. While the learning algorithm optimizes an objective function on the training set (with exception to lazy learners), hyperparameter optimization is yet another task on top of it; here, we typically want to optimize a performance metric such as classification accuracy or the area under a Receiver Operating Characteristic curve. Thinking back of our discussion about learning curves and pessimistic biases in Part II, we noted that a machine learning algorithm often benefits from more labeled data; the smaller the dataset, the higher the pessimistic bias and the variance -- the sensitivity of our model towards the way we partition the data. We start by splitting our dataset into three parts, a training set for model fitting, a validation set for model selection, and a test set for the final evaluation of the selected model.
Almost every machine learning algorithm comes with a large number of settings that we, the machine learning researchers and practitioners, need to specify. These tuning knobs, the so-called hyperparameters, help us control the behavior of machine learning algorithms when optimizing for performance, finding the right balance between bias and variance. Hyperparameter tuning for performance optimization is an art in itself, and there are no hard-and-fast rules that guarantee best performance on a given dataset. In Part I and Part II, we saw different holdout and bootstrap techniques for estimating the generalization performance of a model. We learned about the bias-variance trade-off, and we computed the uncertainty of our estimates. In this third part, we will focus on different methods of cross-validation for model evaluation and model selection. We will use these cross-validation techniques to rank models from several hyperparameter configurations and estimate how well they generalize to independent datasets. Previously, we used the holdout method or different flavors of bootstrapping to estimate the generalization performance of our predictive models.