Decision Tree Learning
A survey of bias in Machine Learning through the prism of Statistical Parity for the Adult Data Set
Besse, Philippe, del Barrio, Eustasio, Gordaliza, Paula, Loubes, Jean-Michel, Risser, Laurent
Applications based on Machine Learning models have now become an indispensable part of the everyday life and the professional world. A critical question then recently arised among the population: Do algorithmic decisions convey any type of discrimination against specific groups of population or minorities? In this paper, we show the importance of understanding how a bias can be introduced into automatic decisions. We first present a mathematical framework for the fair learning problem, specifically in the binary classification setting. We then propose to quantify the presence of bias by using the standard Disparate Impact index on the real and well-known Adult income data set. Finally, we check the performance of different approaches aiming to reduce the bias in binary classification outcomes. Importantly, we show that some intuitive methods are ineffective. This sheds light on the fact trying to make fair machine learning models may be a particularly challenging task, in particular when the training observations contain a bias.
Probabilistic Diagnostic Tests for Degradation Problems in Supervised Learning
Valencia-Zapata, Gustavo A., Ersoy, Okan, Gonzalez-Canas, Carolina, Zentner, Michael G., Klimeck, Gerhard
Several studies point out different causes of performance degradation in supervised machine learning. Problems such as class imbalance, overlapping, small-disjuncts, noisy labels, and sparseness limit accuracy in classification algorithms. Even though a number of approaches either in the form of a methodology or an algorithm try to minimize performance degradation, they have been isolated efforts with limited scope. Most of these approaches focus on remediation of one among many problems, with experimental results coming from few datasets and classification algorithms, insufficient measures of prediction power, and lack of statistical validation for testing the real benefit of the proposed approach. This paper consists of two main parts: In the first part, a novel probabilistic diagnostic model based on identifying signs and symptoms of each problem is presented. Thereby, early and correct diagnosis of these problems is to be achieved in order to select not only the most convenient remediation treatment but also unbiased performance metrics. Secondly, the behavior and performance of several supervised algorithms are studied when training sets have such problems. Therefore, prediction of success for treatments can be estimated across classifiers.
XtracTree for Regulator Validation of Bagging Methods Used in Retail Banking
Charlier, Jeremy, Makarenkov, Vladimir
Bootstrap aggregation, known as bagging, is one of the most popular ensemble methods used in machine learning (ML). An ensemble method is a supervised ML method that combines multiple hypotheses to form a single hypothesis used for prediction. A bagging algorithm combines multiple classifiers modelled on different sub-samples of the same data set to build one large classifier. Large retail banks are nowadays using the power of ML algorithms, including decision trees and random forests, to optimize the retail banking activities. However, AI bank researchers face a strong challenge from their own model validation department as well as from national financial regulators. Each proposed ML model has to be validated and clear rules for every algorithm-based decision have to be established. In this context, we propose XtracTree, an algorithm that is capable of effectively converting an ML bagging classifier, such as a decision tree or a random forest, into simple "if-then" rules satisfying the requirements of model validation. Our algorithm is also capable of highlighting the decision path for each individual sample or a group of samples, addressing any concern from the regulators regarding ML "black-box". We use a public loan data set from Kaggle to illustrate the usefulness of our approach. Our experiments indicate that, using XtracTree, we are able to ensure a better understanding for our model, leading to an easier model validation by national financial regulators and the internal model validation department.
Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling Methods
Kerwin, Kathleen, Bastian, Nathaniel D.
This study uses stacked generalization, which is a two-step process of combining machine learning methods, called meta or super learners, for improving the performance of algorithms in step one (by minimizing the error rate of each individual algorithm to reduce its bias in the learning set) and then in step two inputting the results into the meta learner with its stacked blended output (demonstrating improved performance with the weakest algorithms learning better). The method is essentially an enhanced cross-validation strategy. Although the process uses great computational resources, the resulting performance metrics on resampled fraud data show that increased system cost can be justified. A fundamental key to fraud data is that it is inherently not systematic and, as of yet, the optimal resampling methodology has not been identified. Building a test harness that accounts for all permutations of algorithm sample set pairs demonstrates that the complex, intrinsic data structures are all thoroughly tested. Using a comparative analysis on fraud data that applies stacked generalizations provides useful insight needed to find the optimal mathematical formula to be used for imbalanced fraud data sets.
Unpack Local Model Interpretation for GBDT
Fang, Wenjing, Zhou, Jun, Li, Xiaolong, Zhu, Kenny Q.
Because GBDT inherits the good performance from its ensemble essence, much attention has been drawn to the optimization of this model. With its popularization, an increasing need for model interpretation arises. Besides the commonly used feature importance as a global interpretation, feature contribution is a local measure that reveals the relationship between a specific instance and the related output. This work focuses on the local interpretation and proposes an unified computation mechanism to get the instance-level feature contributions for GBDT in any version. Practicality of this mechanism is validated by the listed experiments as well as applications in real industry scenarios.
Machine Learning in GIS: Understand the Theory and Practice
This course is designed to equip you with the theoretical and practical knowledge of Machine Learning as applied for geospatial analysis, namely Geographic Information Systems (GIS) and Remote Sensing. By the end of the course, you will feel confident and completely understand the Machine Learning applications in GIS technology and how to use Machine Learning algorithms for various geospatial tasks, such as land use and land cover mapping (classifications) and object-based image analysis (segmentation). This course will also prepare you for using GIS with open source and free software tools. In the course, you will be able to apply such Machine Learning algorithms as Random Forest, Support Vector Machines and Decision Trees (and others) for classification of satellite imagery. On top of that, you will practice GIS by completing an entire GIS project by exploring the power of Machine Learning, cloud computing and Big Data analysis using Google Erath Engine for any geographic area in the world.
Sequential Feature Classification in the Context of Redundancies
Pfannschmidt, Lukas, Hammer, Barbara
The problem of all-relevant feature selection is concerned with finding a relevant feature set with preserved redundancies. There exist several approximations to solve this problem but only one could give a distinction between strong and weak relevance. This approach was limited to the case of linear problems. In this work, we present a new solution for this distinction in the non-linear case through the use of random forest models and statistical methods.
SAS and R Integration for Machine Learning
R first appeared in 1993 and has gained a steady and fiercely loyal fan base. But as data sets become both longer and wider, storage and processing speeds become an issue. Having spent weeks whipping an extremely wide and messy data set into shape using only R, I am so grateful for SAS Viya and not having to go through that again. SAS Viya is a cloud-enabled, in-memory analytics engine which allows for rapid analytics insights. SAS Viya utilizes the SAS Cloud Analytics Services (CAS) to perform various actions and tasks.
Machine Learning Advanced: Decision Trees in Python
Free Course - Machine Learning Advanced: Decision Trees in Python [2020] Use Decision Trees to solve business problems and build high accuracy prediction models in Python, Learn how to use decision trees to make predictions for business problems using python. Start with this advanced machine learning tutorial today! Instructor: Start Tes Enroll Now - Machine Learning Advanced: Decision Trees in Python About this Course The course is created on the basis of three pillars of learning: Know (Study) Do (Practice) Review (Self feedback) Know We have created a set of concise and comprehensive videos to teach you all the Excel related skills you will need in your professional career. Add To Cart - GET COUPON CODE Do With each lecture, we have provide a practice sheet to complement the learning in the lecture video. These sheets are carefully designed to further clarify the concepts and help you with implementing the concepts on practical problems faced on-the-job.
From unbiased MDI Feature Importance to Explainable AI for Trees
We attempt to give a unifying view of the various recent attempts to (i) improve the interpretability of tree-based models and (ii) debias the the default variable-importance measure in random Forests, Gini importance. In particular, we demonstrate a common thread among the out-of-bag based bias correction methods and their connection to local explanation for trees. In addition, we point out a bias caused by the inclusion of inbag data in the newly developed explainable AI for trees algorithms.