Decision Tree Learning
Classification and Regression Analysis with Decision Trees
A decision tree is a supervised machine learning model used to predict a target by learning decision rules from features. As the name suggests, we can think of this model as breaking down our data by making a decision based on asking a series of questions. Let's consider the following example in which we use a decision tree to decide upon an activity on a particular day: Based on the features in our training set, the decision tree model learns a series of questions to infer the class labels of the samples. As we can see, decision trees are attractive models if we care about interpretability. Although the preceding figure illustrates the concept of a decision tree based on categorical targets (classification), the same concept applies if our targets are real numbers (regression).
Asymptotic Distributions and Rates of Convergence for Random Forests and other Resampled Ensemble Learners
Peng, Wei, Coleman, Tim, Mentch, Lucas
Random forests remain among the most popular off-the-shelf supervised learning algorithms. Despite their well-documented empirical success, however, until recently, few theoretical results were available to describe their performance and behavior. In this work we push beyond recent work on consistency and asymptotic normality by establishing rates of convergence for random forests and other supervised learning ensembles. We develop the notion of generalized U-statistics and show that within this framework, random forest predictions remain asymptotically normal for larger subsample sizes than previously established. We also provide Berry-Esseen bounds in order to quantify the rate at which this convergence occurs, making explicit the roles of the subsample size and the number of trees in determining the distribution of random forest predictions.
HDI-Forest: Highest Density Interval Regression Forest
Zhu, Lin, Lu, Jiaxin, Chen, Yihong
By seeking the narrowest prediction intervals (PIs) that satisfy the specified coverage probability requirements, the recently proposed quality-based PI learning principle can extract high-quality PIs that better summarize the predictive certainty in regression tasks, and has been widely applied to solve many practical problems. Currently, the state-of-the-art quality-based PI estimation methods are based on deep neural networks or linear models. In this paper, we propose Highest Density Interval Regression Forest (HDI-Forest), a novel quality-based PI estimation method that is instead based on Random Forest. HDI-Forest does not require additional model training, and directly reuses the trees learned in a standard Random Forest model. By utilizing the special properties of Random Forest, HDI-Forest could efficiently and more directly optimize the PI quality metrics. Extensive experiments on benchmark datasets show that HDI-Forest significantly outperforms previous approaches, reducing the average PI width by over 30\% while achieving the same or better coverage probability.
Federated Forest
Liu, Yang, Liu, Yingting, Liu, Zhijie, Zhang, Junbo, Meng, Chuishi, Zheng, Yu
Most real-world data are scattered across different companies or government organizations, and cannot be easily integrated under data privacy and related regulations such as the European Union's General Data Protection Regulation (GDPR) and China' Cyber Security Law. Such data islands situation and data privacy & security are two major challenges for applications of artificial intelligence. In this paper, we tackle these challenges and propose a privacy-preserving machine learning model, called Federated Forest, which is a lossless learning model of the traditional random forest method, i.e., achieving the same level of accuracy as the non-privacy-preserving approach. Based on it, we developed a secure cross-regional machine learning system that allows a learning process to be jointly trained over different regions' clients with the same user samples but different attribute sets, processing the data stored in each of them without exchanging their raw data. A novel prediction algorithm was also proposed which could largely reduce the communication overhead. Experiments on both real-world and UCI data sets demonstrate the performance of the Federated Forest is as accurate as the non-federated version. The efficiency and robustness of our proposed system had been verified. Overall, our model is practical, scalable and extensible for real-life tasks.
sabiha90/Random-Forest-Explainability-Pipeline
This toolkit serves to execute RFEX 2.0 "pipeline" e.g. a set of steps to produce information which comprises RFEX 2.0 summary namely information to enhance explainability of Random Forest classifier. It comes with the synthetically generated test database which helps to demonstrate how RFEX 2.0 works. Wth this toolkit users can also use their own data to generate RFEX 2.0 summary. Background of the RFEX 2.0 method, as well as the description and access to the synthetic test database convenient to test and demonstrate can be found in TR 18.01 at cs.sfsu.edu Users are strongly advised to read the above report before using this toolkit.
Enterprise AI: Diving into Machine Learning
Data in the real world, of course, isn't as simple as it is in the previous example. There are always complexities and nuances to data. To stick with our housing market example, the value of houses might also be influenced by dwelling type, lot size, recent upgrades, proximity to a neighborhood park and intangible variables like curbside appeal. And, in the real world, houses wouldn't all be in the same neighborhood, so your machine learning model must also consider the ZIP code for the property. To consider this wider range of variables, we need to dig deeper into the data scientist's toolbox and pull out some more sophisticated machine learning methods, including random forests and gradient boosting.
Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees
Devlin, Summer, Singh, Chandan, Murdoch, W. James, Yu, Bin
Tree ensembles, such as random forests and AdaBoost, are ubiquitous machine learning models known for achieving strong predictive performance across a wide variety of domains. However, this strong performance comes at the cost of interpretability (i.e. users are unable to understand the relationships a trained random forest has learned and why it is making its predictions). In particular, it is challenging to understand how the contribution of a particular feature, or group of features, varies as their value changes. To address this, we introduce Disentangled Attribution Curves (DAC), a method to provide interpretations of tree ensemble methods in the form of (multivariate) feature importance curves. For a given variable, or group of variables, DAC plots the importance of a variable(s) as their value changes. We validate DAC on real data by showing that the curves can be used to increase the accuracy of logistic regression while maintaining interpretability, by including DAC as an additional feature. In simulation studies, DAC is shown to out-perform competing methods in the recovery of conditional expectations. Finally, through a case-study on the bike-sharing dataset, we demonstrate the use of DAC to uncover novel insights into a dataset.
Gradient tree boosting with random output projections for multi-label classification and multi-output regression
Joly, Arnaud, Wehenkel, Louis, Geurts, Pierre
Multi-output supervised learning aims to model input-output relationships from observations of inputoutput pairs whenever the output space is a vector of random variables. Multi-output classification and regression tasks have numerous applications in domains ranging from biology to multimedia, and recent applications in this area correspond to very high dimensional output spaces (Agrawal et al, 2013; Dekel and Shamir, 2010). Classification and regression trees (Breiman et al, 1984) are popular supervised learning methods that provide state-of-the-art performance when exploited in the context of ensemble methods, namely Random forests (Breiman, 2001; Geurts et al, 2006) and Boosting (Freund and Schapire, 1997; Friedman, 2001). Classification and regression trees can obviously be exploited to handle multi-output problems. The most straightforward way to address multi-output tasks is to apply standard single output methods separately and independently on each output. Although simple, this method, called binary relevance (Tsoumakas et al, 2009) in multi-label classification or single target (Spyromitros-Xioufis et al, 2012) in multi-output regression is often suboptimal as it does not exploit potential correlations that might exist between the outputs. Tree ensemble methods have however been explicitely extended by several authors to the joint prediction of multiple outputs (e.g., Segal, 1992; Blockeel et al, 2000). These extensions build a single tree to predict all outputs at once. They adapt the score measure used to assess splits during the tree growth to take into account all outputs and label each tree leaf with a vector of values, one for each output.
Using EEG Features and Machine Learning to Predict Gifted Children
Ghali, Ramla (Université de Montréal) | Tato, Ange (Université de Montréal) | Nkambou, Roger (Université de Montréal)
Gifted students have a higher capabilities of understanding and learning. They are characterized by a high level of attention and a high performance in the classroom. Gifted children are defined in this paper as children who have a performance higher than the average group (59.64%). In order to predict gifted students from normal students, we conducted an experiment where 17 pupils have voluntarily participated in this study. We collected different types of data (gender, age, performance, initial average in math and EEG mental states) in a web platform to learn mathematics called NetMath. Participants were invited to respond to top-level exercises on the four basic operations in decimals. We trained different machine learning algorithms to predict gifted students. Our first results show that the decision tree could predict gifted students with an accuracy of 76.88%. Using J48 trees, we noticed also that two relevant features could determine gifted children: the relaxation extracted from EEG headset and the characteristic of strong student. A strong student is defined as a student who obtained a mean higher than the group’s mean in the first step evaluation in class.