We have generated a dataset of 65,416 sRGB images rendered using different white-balance presets in the camera (e.g., Fluorescent, Incandescent, Dayligh) with different camera picture styles (e.g., Vivid, Standard, Neutral, Landscape). For each sRGB rendered image, we provide a target white-balanced image. To produce the correct target image, we manually select the "ground-truth" white from the middle gray patches in the color rendition chart, followed by applying a camera-independent rendering style (namely, Adobe Standard). The dataset is divided into two sets: intrinsic set (Set 1) and extrinsic set (Set 2). In addition to our main dataset, the Rendered WB dataset, we rendered the Cube dataset in the same manner.
I just got done scraping over 20,000 images from instagram, and I noticed that they were pretty much all the same size. I got an idea: Would it be possible to develop a classifier to determine the percentage of an attribute is an image? Let me explain: On instagram, generally, one account will have one topic: guns, edm, cars, lgbt, etc. If I took 2,000 images from festivals, clubs, raves and whatnot, could develop a system that would detect, for example, how "EDM" an image is?
There is a lot of techniques to deal with unbalanced data. One of them is oversampling, which consists of re-sampling less frequent samples to adjust their amount in comparison with predominant samples. Although the idea is simple, implementing it is not so easy. The imbalanced-learn is a python package offering several re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.
Using standardized datasets is great for benchmarking new models/pipelines or for competitions. But for me at least a lot of fun of data science comes when you get to apply things to a project of your own choosing. One of the key parts of this process is building a dataset. So there are a lot of ways to build image datasets. For certain things I have legitimately just taken screenshots like when I was sick and built a facial recognition dataset using season 4 of the Flash and annotated it with labelimg.