Data preprocessing for deep learning with nuts-ml

#artificialintelligence 

Data preprocessing is a fundamental part of any machine learning project and often more time is spent on the data preparation than on the actual machine learning. While some preprocessing tasks are problem specific many others such as partitioning data into training and test folds, stratifying samples or building mini-batches are generic. The following Canonical Pipeline shows the processing steps common for deep-learning in vision. A Reader reads sample data stored in text files, Excel or Pandas tables. The Splitter then partitions data into training, validation and test folds and performs stratification if needed. Usually not all image data can be loaded into memory and a Loader loads images on demand. These images are often processed by a Transformer, for resizing, cropping or other adjustments. Furthermore, to increase the training set additional images are synthesized by randomly augmenting (flipping, rotating, …) images using an Augmenter. Efficient, GPU-based machine learning demands that image and label data are grouped in mini-batches via a Batcher before passed on to the Network for training or inference. Finally, to keep track of the training progress, usually a Logger is employed to write training losses or accuracies to a log file. Some machine learning frameworks such as Keras provide (some of) these preprocessing components hidden behind an API that considerably simplify network training if it fits the task at hand. See the following excerpt of a Keras example to train a model with augmentation. However, what if an image format, augmentation or other preprocessing capability is needed that is not provided by the API? Extending a library such as Keras or others is not trivial and the common approach is to (re)implement the required functionality – often in a quick-and-dirty fashion. But implementing a robust data pipeline that loads, transforms, augments and processes images on demand is challenging and time consuming. The following excerpt from a nuts-ml example shows a pipeline for network training, where the operator defines the flow of data. In the example above training images are augmented, pixel values re-ranged, and the samples shuffled before building batches for network training. Finally, the mean over the batch-wise training losses is computed and printed. The resulting code is more readable and can readily be modified to experiment with different preprocessing schemes. Task-specific functions can easily be implement as nuts and added to the data flow. Any machine learning library that accepts mini-batches of Numpy arrays for training or inference is compatible. For more information about nuts-ml see the Introduction and have a look at the Tutorial. Bio: Stefan Maetschke (PhD) is a research scientist at IBM Research Australia where he develops machine learning infrastructure and models for medical image analysis.