When I have to deal with Huge image datasets, this is what I do. Working with image datasets in Kaggle competitions can be quite problematic, your computer could just freeze and don't care about you anymore. To stop this things from happening, I'm going to be sharing with you here the 5 Major Steps to work with Image datasets.
I just got done scraping over 20,000 images from instagram, and I noticed that they were pretty much all the same size. I got an idea: Would it be possible to develop a classifier to determine the percentage of an attribute is an image? Let me explain: On instagram, generally, one account will have one topic: guns, edm, cars, lgbt, etc. If I took 2,000 images from festivals, clubs, raves and whatnot, could develop a system that would detect, for example, how "EDM" an image is?
There is a lot of techniques to deal with unbalanced data. One of them is oversampling, which consists of re-sampling less frequent samples to adjust their amount in comparison with predominant samples. Although the idea is simple, implementing it is not so easy. The imbalanced-learn is a python package offering several re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.
Using standardized datasets is great for benchmarking new models/pipelines or for competitions. But for me at least a lot of fun of data science comes when you get to apply things to a project of your own choosing. One of the key parts of this process is building a dataset. So there are a lot of ways to build image datasets. For certain things I have legitimately just taken screenshots like when I was sick and built a facial recognition dataset using season 4 of the Flash and annotated it with labelimg.