Goto

Collaborating Authors

 Decision Tree Learning


Giuliano Liguori on Twitter

#artificialintelligence

“🔝 #MachineLearning Prediction Algorithms {#infographic} by @DatumGuy Random regression Logistic regression Decision Tree Random forest Gradient Boosting @antgrasso @Ronald_vanLoon @KirkDBorne @SpirosMargaris @mvollmer1 @machinelearnflx @AISOMA_AG @andy_fitze @SwissCognitive”


A Comprehensive Guide to Ensemble Learning - What Exactly Do You Need to Know - neptune.ai

#artificialintelligence

Ensemble learning techniques have been proven to yield better performance on machine learning problems. We can use these techniques for regression as well as classification problems. The final prediction from these ensembling techniques is obtained by combining results from several base models. Averaging, voting and stacking are some of the ways the results are combined to obtain a final prediction. In this article, we will explore how ensemble learning can be used to come up with optimal machine learning models. Ensemble learning is a combination of several machine learning models in one problem.


Random Forest

#artificialintelligence

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression. One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification. It performs better results for classification problems. Let's dive into a real-life analogy to understand this concept further.


Chefboost -- an alternative Python library for tree-based models

#artificialintelligence

I randomly encountered chefboost in my Twitter feed and given that I never heard about it before, I decided to have a quick look into it and test it out. In this article, I will briefly present the library, mention the key differences from the go-to library which is scikit-learn, and show a quick example of chefboost in practice. I think the best description is provided in the library's GitHub repo: "chefboost is a lightweight decision tree framework for Python with categorical feature support". Following the last point, chefboost provides three algorithms for classification trees (ID3, C4.5, and CART) and one algorithm for regression trees. To be honest, I was not entirely sure which one is currently implemented in scikit-learn, so I checked the documentation (which also provides a nice and concise summary of the algorithms).


Importance measures derived from random forests: characterisation and extension

arXiv.org Machine Learning

Nowadays new technologies, and especially artificial intelligence, are more and more established in our society. Big data analysis and machine learning, two sub-fields of artificial intelligence, are at the core of many recent breakthroughs in many application fields (e.g., medicine, communication, finance, ...), including some that are strongly related to our day-to-day life (e.g., social networks, computers, smartphones, ...). In machine learning, significant improvements are usually achieved at the price of an increasing computational complexity and thanks to bigger datasets. Currently, cutting-edge models built by the most advanced machine learning algorithms typically became simultaneously very efficient and profitable but also extremely complex. Their complexity is to such an extent that these models are commonly seen as black-boxes providing a prediction or a decision which can not be interpreted or justified. Nevertheless, whether these models are used autonomously or as a simple decision-making support tool, they are already being used in machine learning applications where health and human life are at stake. Therefore, it appears to be an obvious necessity not to blindly believe everything coming out of those models without a detailed understanding of their predictions or decisions. Accordingly, this thesis aims at improving the interpretability of models built by a specific family of machine learning algorithms, the so-called tree-based methods. Several mechanisms have been proposed to interpret these models and we aim along this thesis to improve their understanding, study their properties, and define their limitations.


Random Forest Algorithm in Python from Scratch

#artificialintelligence

The intuition behind the random forest algorithm can be split into two big parts: the random part and the forest part. Let us start with the latter. A forest in real life is made up of a bunch of trees. A random forest classifier is made up of a bunch of decision tree classifiers (here and throughout the text -- DT). The exact amount of DTs that make up the whole forest is defined with the n_estimators variable mentioned earlier.


RFpredInterval: An R Package for Prediction Intervals with Random Forests and Boosted Forests

arXiv.org Machine Learning

Like many predictive models, random forests provide a point prediction for a new observation. Besides the point prediction, it is important to quantify the uncertainty in the prediction. Prediction intervals provide information about the reliability of the point predictions. We have developed a comprehensive R package, RFpredInterval, that integrates 16 methods to build prediction intervals with random forests and boosted forests. The methods implemented in the package are a new method to build prediction intervals with boosted forests (PIBF) and 15 different variants to produce prediction intervals with random forests proposed by Roy and Larocque (2020). We perform an extensive simulation study and apply real data analyses to compare the performance of the proposed method to ten existing methods to build prediction intervals with random forests. The results show that the proposed method is very competitive and, globally, it outperforms the competing methods.


Automated Machine Learning Techniques for Data Streams

arXiv.org Artificial Intelligence

Automated machine learning techniques benefited from tremendous research progress in recently. These developments and the continuous-growing demand for machine learning experts led to the development of numerous AutoML tools. However, these tools assume that the entire training dataset is available upfront and that the underlying distribution does not change over time. These assumptions do not hold in a data stream mining setting where an unbounded stream of data cannot be stored and is likely to manifest concept drift. Industry applications of machine learning on streaming data become more popular due to the increasing adoption of real-time streaming patterns in IoT, microservices architectures, web analytics, and other fields. The research summarized in this paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time. For comparative purposes, batch, batch incremental and instance incremental estimators are applied and compared. Moreover, a meta-learning technique for online algorithm selection based on meta-feature extraction is proposed and compared while model replacement and continual AutoML techniques are discussed. The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.


Tree-Values: selective inference for regression trees

arXiv.org Machine Learning

We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.


Analysis of the Evolution of Parametric Drivers of High-End Sea-Level Hazards

arXiv.org Artificial Intelligence

Climate models are critical tools for developing strategies to manage the risks posed by sea-level rise to coastal communities. While these models are necessary for understanding climate risks, there is a level of uncertainty inherent in each parameter in the models. This model parametric uncertainty leads to uncertainty in future climate risks. Consequently, there is a need to understand how those parameter uncertainties impact our assessment of future climate risks and the efficacy of strategies to manage them. Here, we use random forests to examine the parametric drivers of future climate risk and how the relative importances of those drivers change over time. We find that the equilibrium climate sensitivity and a factor that scales the effect of aerosols on radiative forcing are consistently the most important climate model parametric uncertainties throughout the 2020 to 2150 interval for both low and high radiative forcing scenarios. The near-term hazards of high-end sea-level rise are driven primarily by thermal expansion, while the longer-term hazards are associated with mass loss from the Antarctic and Greenland ice sheets. Our results highlight the practical importance of considering time-evolving parametric uncertainties when developing strategies to manage future climate risks.