Goto

Collaborating Authors

 Decision Tree Learning


Guide To Ensemble Methods: Bagging vs Boosting

#artificialintelligence

Building a highly accurate prediction model is certainly a difficult task. Noise – Irreducible error i.e. the part of target value which the model is not able to predict / explain. As you know it is impossible to reduce the noise, hence the term irreducible error, we shift our focus on reducing Bias and Variance. So, Ensemble learning methods bring up the technique to reduce the Bias and Variance of the model by using multiple models together (hence the term Ensemble), in order to achieve better predictive performance, instead of a single model for prediction. There are a number of Ensemble methods, in this article I will be discussing about the two widely used Ensemble methods that are Bagging and Boosting. When we use different / single learning algorithm, multiple times for prediction.


Human vs. supervised machine learning: Who learns patterns faster?

arXiv.org Artificial Intelligence

The capabilities of supervised machine learning (SML), especially compared to human abilities, are being discussed in scientific research and in the usage of SML. This study provides an answer to how learning performance differs between humans and machines when there is limited training data. We have designed an experiment in which 44 humans and three different machine learning algorithms identify patterns in labeled training data and have to label instances according to the patterns they find. The results show a high dependency between performance and the underlying patterns of the task. Whereas humans perform relatively similarly across all patterns, machines show large performance differences for the various patterns in our experiment. After seeing 20 instances in the experiment, human performance does not improve anymore, which we relate to theories of cognitive overload. Machines learn slower but can reach the same level or may even outperform humans in 2 of the 4 of used patterns. However, machines need more instances compared to humans for the same results. The performance of machines is comparably lower for the other 2 patterns due to the difficulty of combining input features.


Data Preprocessing to Mitigate Bias with Boosted Fair Mollifiers

arXiv.org Machine Learning

In a recent paper, Celis et al. (2020) introduced a new approach to fairness that corrects the data distribution itself. The approach is computationally appealing, but its approximation guarantees with respect to the target distribution can be quite loose as they need to rely on a (typically limited) number of constraints on data-based aggregated statistics; also resulting on a fairness guarantee which can be data dependent. Our paper makes use of a mathematical object recently introduced in privacy -- mollifiers of distributions -- and a popular approach to machine learning -- boosting -- to get an approach in the same lineage as Celis et al. but without those impediments, including in particular, better guarantees in terms of accuracy and finer guarantees in terms of fairness. The approach involves learning the sufficient statistics of an exponential family. When training data is tabular, it is defined by decision trees whose interpretability can provide clues on the source of (un)fairness. Experiments display the quality of the results obtained for simulated and real-world data.


Python Decision Tree Guide: Make a Decision Tree Using Python

#artificialintelligence

Creating a decision tree in Python is a topic that raises a lot of questions for a beginner. What exactly is it, and what do we use it for? Where do we start building one, and what first steps do we take? Why do we use Python? Let's begin at the top. Simply put, a Python decision tree is a machine-learning method that we use for classification.


Decision Trees in Machine Learning (ML) with Python Tutorial

#artificialintelligence

This tutorial's code is available on Github and its full implementation as well on Google Colab. A decision tree is a vital and popular tool for classification and prediction problems in machine learning, statistics, data mining, and machine learning [4]. It describes rules that can be interpreted by humans and applied in a knowledge system such as databases. It classifies cases by commencing at the tree's root and passing through it unto a leaf node. A decision tree uses nodes and leaves to make a decision.


Conditional canonical correlation estimation based on covariates with random forests

arXiv.org Machine Learning

Investigating the relationships between two sets of variables helps to understand their interactions and can be done with canonical correlation analysis (CCA). However, the correlation between the two sets can sometimes depend on a third set of covariates, often subject-related ones such as age, gender, or other clinical measures. In this case, applying CCA to the whole population is not optimal and methods to estimate conditional CCA, given the covariates, can be useful. We propose a new method called Random Forest with Canonical Correlation Analysis (RFCCA) to estimate the conditional canonical correlations between two sets of variables given subject-related covariates. The individual trees in the forest are built with a splitting rule specifically designed to partition the data to maximize the canonical correlation heterogeneity between child nodes. We also propose a significance test to detect the global effect of the covariates on the relationship between two sets of variables. The performance of the proposed method and the global significance test is evaluated through simulation studies that show it provides accurate canonical correlation estimations and well-controlled Type-1 error. We also show an application of the proposed method with EEG data.


Meta-Learning for Time Series Forecasting Ensemble

arXiv.org Machine Learning

Amounts of historical data collected increase together with business intelligence applicability and demands for automatic forecasting of time series. While no single time series modeling method is universal to all types of dynamics, forecasting using ensemble of several methods is often seen as a compromise. Instead of fixing ensemble diversity and size we propose to adaptively predict these aspects using meta-learning. Meta-learning here considers two separate random forest regression models, built on 390 time series features, to rank 22 univariate forecasting methods and to recommend ensemble size. Forecasting ensemble is consequently formed from methods ranked as the best and forecasts are pooled using either simple or weighted average (with weight corresponding to reciprocal rank). Proposed approach was tested on 12561 micro-economic time series (expanded to 38633 for various forecasting horizons) of M4 competition where meta-learning outperformed Theta and Comb benchmarks by relative forecasting errors for all data types and horizons. Best overall results were achieved by weighted pooling with symmetric mean absolute percentage error of 9.21% versus 11.05% obtained using Theta method.


From Decision Trees and Random Forests to Gradient Boosting

#artificialintelligence

Suppose we wish to perform supervised learning on a classification problem to determine if an incoming email is spam or not spam. The spam dataset consists of 4601 emails, each labelled as real (or not spam) (0) or spam (1). The data also contains a large number of predictors (57), each of which is either a character count, or a frequency of occurrence of a certain word or symbol. In this short article, we will briefly cover the main concepts in tree based classification and compare and contrast the most popular methods. This dataset and several worked examples are covered in detail in The Elements of Statistical Learning, II edition.


Interpretability, Explainability, and Machine Learning

#artificialintelligence

Susan will present, "Understanding and Addressing Bias in Analytics" at CONVERGE, December 1-2. This article was originally published on KDnuggets. I use one of those credit monitoring services that regularly emails me about my credit score: "Congratulations, your score has gone up!" "Uh oh, your score has gone down! I shrug and delete the emails. Credit scores are just one example of the many automated decisions made about us as individuals on the basis of complex models.


How to Future-Proof Your Data Science Project - KDnuggets

#artificialintelligence

Nontechnical stakeholders struggle to define business requirements. Crossfunctional teams face an uphill battle to set up robust pipelines for replicable data delivery. Machine learning models can take on a life of their own. If you've been ignoring these critical elements in the past, you may find your deployment rate skyrockets. Your data products may depend on correctly deploying the tips from this article.