Collaborating Authors

Kaggle Image Competitions! How to Deal with Large Datasets


When I have to deal with Huge image datasets, this is what I do. Working with image datasets in Kaggle competitions can be quite problematic, your computer could just freeze and don't care about you anymore. To stop this things from happening, I'm going to be sharing with you here the 5 Major Steps to work with Image datasets.

Correcting Improperly White-Balanced Images


We have generated a dataset of 65,416 sRGB images rendered using different white-balance presets in the camera (e.g., Fluorescent, Incandescent, Dayligh) with different camera picture styles (e.g., Vivid, Standard, Neutral, Landscape). For each sRGB rendered image, we provide a target white-balanced image. To produce the correct target image, we manually select the "ground-truth" white from the middle gray patches in the color rendition chart, followed by applying a camera-independent rendering style (namely, Adobe Standard). The dataset is divided into two sets: intrinsic set (Set 1) and extrinsic set (Set 2). In addition to our main dataset, the Rendered WB dataset, we rendered the Cube dataset in the same manner.

A Poisson-Gaussian Denoising Dataset with Real Fluorescence Microscopy Images Machine Learning

Fluorescence microscopy has enabled a dramatic development in modern biology. Due to its inherently weak signal, fluorescence microscopy is not only much noisier than photography, but also presented with Poisson-Gaussian noise where Poisson noise, or shot noise, is the dominating noise source, instead of Gaussian noise that dominates in photography. To get clean fluorescence microscopy images, it is highly desirable to have effective denoising algorithms and datasets that are specifically designed to denoise fluorescence microscopy images. While such algorithms exist, there are no such datasets available. In this paper, we fill this gap by constructing a dataset - the Fluorescence Microscopy Denoising (FMD) dataset - that is dedicated to Poisson-Gaussian denoising. The dataset consists 12,000 real fluorescence microscopy images obtained with commercial confocal, two-photon, and wide-field microscopes and representative biological samples such as cells, zebrafish, and mouse brain tissues. We use imaging averaging to effectively obtain ground truth images and 60,000 noisy images with different noise levels. We use this dataset to benchmark 10 representative denoising algorithms and find that deep learning methods have the best performance. To our knowledge, this is the first microscopy image dataset for Poisson-Gaussian denoising purposes and it could be an important tool for high-quality, real-time denoising applications in biomedical research.

Synthetic vs Real: Deep Learning on Controlled Noise Machine Learning

A BSTRACT Performing controlled experiments on noisy data is essential in thoroughly understanding deep learning across a spectrum of noise levels. Due to the lack of suitable datasets, previous research have only examined deep learning on controlled synthetic noise, and real-world noise has never been systematically studied in a controlled setting. To this end, this paper establishes a benchmark of real-world noisy labels at 10 controlled noise levels. As real-world noise possesses unique properties, to understand the difference, we conduct a large-scale study across a variety of noise levels and types, architectures, methods, and training settings. Our study shows that: (1) Deep Neural Networks (DNNs) generalize much better on real-world noise. We hope our benchmark, as well as our findings, will facilitate deep learning research on noisy data. 1 I NTRODUCTION Y ou take the blue pill you wake up in your bed and believe whatever you want to believe. Y ou take the red pill and I show you how deep the rabbit hole goes. Remember, all I'm offering is the truth. Morpheus (The Matrix 1999) Deep Neural Networks (DNNs) trained on noisy data demonstrate intriguing properties. For example, DNNs are capable of memorizing completely random training labels but generalize poorly on clean test data Zhang et al. (2017). When trained with stochastic gradient descent, DNNs learn patterns first before memorizing the label noise Arpit et al. (2017). These findings inspired recent research on noisy data. As training data are usually noisy, the fact that DNNs are able to memorize the noisy labels highlights the importance of deep learning research on noisy data. To study DNNs on noisy data, previous work often performs controlled experiments by injecting a series of synthetic noises into a well-annotated dataset. The noise level p may vary in the range of 0%- 100%, where p 0% is the clean dataset whereas p 100% represents the dataset of zero correct labels.

A Committee of Convolutional Neural Networks for Image Classication in the Concurrent Presence of Feature and Label Noise Machine Learning

Image classification has become a ubiquitous task. Models trained on good quality data achieve accuracy which in some application domains is already above human-level performance. Unfortunately, real-world data are quite often degenerated by the noise existing in features and/or labels. There are quite many papers that handle the problem of either feature or label noise, separately. However, to the best of our knowledge, this piece of research is the first attempt to address the problem of concurrent occurrence of both types of noise. Basing on the MNIST, CIFAR-10 and CIFAR-100 datasets, we experimentally proved that the difference by which committees beat single models increases along with noise level, no matter it is an attribute or label disruption. Thus, it makes ensembles legitimate to be applied to noisy images with noisy labels. The aforementioned committees' advantage over single models is positively correlated with dataset difficulty level as well. We propose three committee selection algorithms that outperform a strong baseline algorithm which relies on an ensemble of individual (nonassociated) best models.