Collaborating Authors

Enrich your train fold with a custom sampler inside an imblearn pipeline


Once you have a set of augmented data to enrich your original data set, you will ask yourself how and at which point to merge them. Typically you are using sklearn and its modules to evaluate your estimator or search for optimal hyper-parameters. By utilizing a cross validation method to measure the performance of your estimator, your data is split in a train and a test set. This is done dynamically under the hood of the sklearn methods. This is usually fine and it means that you don't have to bother with it more than necessary. There is just one problem when you want to use augmented data with a cross validation method -- you don't want to have augmented data in your test fold.

Foundations of data imbalance and solutions for a data democracy Artificial Intelligence

Dealing with imbalanced data is a prevalent problem while performing classification on the datasets. Many times, this problem contributes to bias while making decisions or implementing policies. Thus, it is vital to understand the factors which causes imbalance in the data (or class imbalance). Such hidden biases and imbalances can lead to data tyranny, and a major challenge to a data democracy. In this chapter, two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept, solving such issues helps in building the foundations of a data democracy. Further, statistical measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset (car insurance claims). In the end, popular data-level methods such as Random Oversampling, Random Undersampling, SMOTE, Tomek Link, and others are implemented in Python, and their performance is compared. Keywords - Imbalanced Data, Degree of Class Imbalance, Complexity of the Concept, Statistical Assessment Metrics, Undersampling and Oversampling 1. Motivation & Introduction In the real-world, data are collected from various sources like social networks, websites, logs, and databases. Whilst dealing with data from different sources, it is very crucial to check the quality of the data [1]. Data with questionable quality can introduce different types of biases in various stages of the data science lifecycle. These biases sometime can affect the association between variables, and in many cases could represent the opposite of the actual behavior [2].

Machine Learning Classification with Python for Direct Marketing


How to make business more time-efficient, slash costs and drive up sales? The question is timeless but not rhetorical. In the next few minutes of your reading time, I will apply a few classification algorithms to demonstrate how the use of the data analytic approach can contribute to that end. Together we'll create a predictive model that will help us customise the client databases we hand over to the telemarketing team so that they could concentrate resources on more promising clients first. On course to that, we'll perform a number of actions on the dataset.

Stop using SMOTE to handle all your Imbalanced Data


In classification tasks, one may encounter a situation where the target class label is not equally distributed. Such a dataset can be termed Imbalanced data. Imbalance in data can be a blocker to train a data science model. In case of imbalance class problems, the model is trained mainly on the majority class and the model becomes biased towards the majority class prediction. Hence handling of imbalance class is essential before proceeding to the modeling pipeline.

IMBENS: Ensemble Class-imbalanced Learning in Python Artificial Intelligence

It provides access to multiple state-of-art ensemble imbalanced learning (EIL) methods, visualizer, and utility functions for dealing with the class imbalance problem. These ensemble methods include resampling-based, e.g., under/over-sampling, and reweighting-based ones, e.g., cost-sensitive learning. Beyond the implementation, we also extend conventional binary EIL algorithms with new functionalities like multi-class support and resampling scheduler, thereby enabling them to handle more complex tasks. The package was developed under a simple, well-documented API design follows that of scikit-learn for increased ease of use.