When I have to deal with Huge image datasets, this is what I do. Working with image datasets in Kaggle competitions can be quite problematic, your computer could just freeze and don't care about you anymore. To stop this things from happening, I'm going to be sharing with you here the 5 Major Steps to work with Image datasets.
We have generated a dataset of 65,416 sRGB images rendered using different white-balance presets in the camera (e.g., Fluorescent, Incandescent, Dayligh) with different camera picture styles (e.g., Vivid, Standard, Neutral, Landscape). For each sRGB rendered image, we provide a target white-balanced image. To produce the correct target image, we manually select the "ground-truth" white from the middle gray patches in the color rendition chart, followed by applying a camera-independent rendering style (namely, Adobe Standard). The dataset is divided into two sets: intrinsic set (Set 1) and extrinsic set (Set 2). In addition to our main dataset, the Rendered WB dataset, we rendered the Cube dataset in the same manner.
Fluorescence microscopy has enabled a dramatic development in modern biology. Due to its inherently weak signal, fluorescence microscopy is not only much noisier than photography, but also presented with Poisson-Gaussian noise where Poisson noise, or shot noise, is the dominating noise source, instead of Gaussian noise that dominates in photography. To get clean fluorescence microscopy images, it is highly desirable to have effective denoising algorithms and datasets that are specifically designed to denoise fluorescence microscopy images. While such algorithms exist, there are no such datasets available. In this paper, we fill this gap by constructing a dataset - the Fluorescence Microscopy Denoising (FMD) dataset - that is dedicated to Poisson-Gaussian denoising. The dataset consists 12,000 real fluorescence microscopy images obtained with commercial confocal, two-photon, and wide-field microscopes and representative biological samples such as cells, zebrafish, and mouse brain tissues. We use imaging averaging to effectively obtain ground truth images and 60,000 noisy images with different noise levels. We use this dataset to benchmark 10 representative denoising algorithms and find that deep learning methods have the best performance. To our knowledge, this is the first microscopy image dataset for Poisson-Gaussian denoising purposes and it could be an important tool for high-quality, real-time denoising applications in biomedical research.
A BSTRACT Performing controlled experiments on noisy data is essential in thoroughly understanding deep learning across a spectrum of noise levels. Due to the lack of suitable datasets, previous research have only examined deep learning on controlled synthetic noise, and real-world noise has never been systematically studied in a controlled setting. To this end, this paper establishes a benchmark of real-world noisy labels at 10 controlled noise levels. As real-world noise possesses unique properties, to understand the difference, we conduct a large-scale study across a variety of noise levels and types, architectures, methods, and training settings. Our study shows that: (1) Deep Neural Networks (DNNs) generalize much better on real-world noise. We hope our benchmark, as well as our findings, will facilitate deep learning research on noisy data. 1 I NTRODUCTION Y ou take the blue pill you wake up in your bed and believe whatever you want to believe. Y ou take the red pill and I show you how deep the rabbit hole goes. Remember, all I'm offering is the truth. Morpheus (The Matrix 1999) Deep Neural Networks (DNNs) trained on noisy data demonstrate intriguing properties. For example, DNNs are capable of memorizing completely random training labels but generalize poorly on clean test data Zhang et al. (2017). When trained with stochastic gradient descent, DNNs learn patterns first before memorizing the label noise Arpit et al. (2017). These findings inspired recent research on noisy data. As training data are usually noisy, the fact that DNNs are able to memorize the noisy labels highlights the importance of deep learning research on noisy data. To study DNNs on noisy data, previous work often performs controlled experiments by injecting a series of synthetic noises into a well-annotated dataset. The noise level p may vary in the range of 0%- 100%, where p 0% is the clean dataset whereas p 100% represents the dataset of zero correct labels.
Using standardized datasets is great for benchmarking new models/pipelines or for competitions. But for me at least a lot of fun of data science comes when you get to apply things to a project of your own choosing. One of the key parts of this process is building a dataset. So there are a lot of ways to build image datasets. For certain things I have legitimately just taken screenshots like when I was sick and built a facial recognition dataset using season 4 of the Flash and annotated it with labelimg.