Collaborating Authors

Kaggle Image Competitions! How to Deal with Large Datasets


When I have to deal with Huge image datasets, this is what I do. Working with image datasets in Kaggle competitions can be quite problematic, your computer could just freeze and don't care about you anymore. To stop this things from happening, I'm going to be sharing with you here the 5 Major Steps to work with Image datasets.

Could I use an image dataset to define a concept, and then rank images based on how close they are to that concept? • /r/MachineLearning


I just got done scraping over 20,000 images from instagram, and I noticed that they were pretty much all the same size. I got an idea: Would it be possible to develop a classifier to determine the percentage of an attribute is an image? Let me explain: On instagram, generally, one account will have one topic: guns, edm, cars, lgbt, etc. If I took 2,000 images from festivals, clubs, raves and whatnot, could develop a system that would detect, for example, how "EDM" an image is?

How to deal with Unbalanced Image Datasets in less than 20 lines of code


There is a lot of techniques to deal with unbalanced data. One of them is oversampling, which consists of re-sampling less frequent samples to adjust their amount in comparison with predominant samples. Although the idea is simple, implementing it is not so easy. The imbalanced-learn is a python package offering several re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

Building and Labeling Image Datasets for Data Science Projects


Using standardized datasets is great for benchmarking new models/pipelines or for competitions. But for me at least a lot of fun of data science comes when you get to apply things to a project of your own choosing. One of the key parts of this process is building a dataset. So there are a lot of ways to build image datasets. For certain things I have legitimately just taken screenshots like when I was sick and built a facial recognition dataset using season 4 of the Flash and annotated it with labelimg.