Decision Tree Learning
Nonparametric Variable Screening with Optimal Decision Stumps
Klusowski, Jason M., Tian, Peter M.
Decision trees and their ensembles are endowed with a rich set of diagnostic tools for ranking and screening input variables in a predictive model. One of the most commonly used in practice is the Mean Decrease in Impurity (MDI), which calculates an importance score for a variable by summing the weighted impurity reductions over all non-terminal nodes split with that variable. Despite the widespread use of tree based variable importance measures such as MDI, pinning down their theoretical properties has been challenging and therefore largely unexplored. To address this gap between theory and practice, we derive rigorous finite sample performance guarantees for variable ranking and selection in nonparametric models with MDI for a single-level CART decision tree (decision stump). We find that the marginal signal strength of each variable and ambient dimensionality can be considerably weaker and higher, respectively, than state-of-the-art nonparametric variable selection methods. Furthermore, unlike previous marginal screening methods that attempt to directly estimate each marginal projection via a truncated basis expansion, the fitted model used here is a simple, parsimonious decision stump, thereby eliminating the need for tuning the number of basis terms. Thus, surprisingly, even though decision stumps are highly inaccurate for estimation purposes, they can still be used to perform consistent model selection.
Oblique Predictive Clustering Trees
Stepiลกnik, Tomaลพ, Kocev, Dragi
Predictive clustering trees (PCTs) are a well established generalization of standard decision trees, which can be used to solve a variety of predictive modeling tasks, including structured output prediction. Combining them into ensembles yields state-of-the-art performance. Furthermore, the ensembles of PCTs can be interpreted by calculating feature importance scores from the learned models. However, their learning time scales poorly with the dimensionality of the output space. This is often problematic, especially in (hierarchical) multi-label classification, where the output can consist of hundreds of potential labels. Also, learning of PCTs can not exploit the sparsity of data to improve the computational efficiency, which is common in both input (molecular fingerprints, bag of words representations) and output spaces (in multi-label classification, examples are often labeled with only a fraction of possible labels). In this paper, we propose oblique predictive clustering trees, capable of addressing these limitations. We design and implement two methods for learning oblique splits that contain linear combinations of features in the tests, hence a split corresponds to an arbitrary hyperplane in the input space. The methods are efficient for high dimensional data and capable of exploiting sparse data. We experimentally evaluate the proposed methods on 60 benchmark datasets for 6 predictive modeling tasks. The results of the experiments show that oblique predictive clustering trees achieve performance on-par with state-of-the-art methods and are orders of magnitude faster than standard PCTs. We also show that meaningful feature importance scores can be extracted from the models learned with the proposed methods.
Interpretability, Explainability, and Machine Learning โ What Data Scientists Need to Know - KDnuggets
I use one of those credit monitoring services that regularly emails me about my credit score: "Congratulations, your score has gone up!" "Uh oh, your score has gone down!" I shrug and delete the emails. Credit scores are just one example of the many automated decisions made about us as individuals on the basis of complex models. I don't know exactly what causes those little changes in my score. Some machine learning models are "black boxes," a term often used to describe models whose inner workings -- the ways different variables ended up related to one another by an algorithm -- may be impossible for even their designers to completely interpret and explain.
AI Clarified: Is AI More Biased Than Humans or Less?
Exploring bias in AI systems, and what we can do to prevent it. For business and non-profit leaders trying to understand AI, it can be surprisingly difficult to find good information in the sweet spot between high-level overview and technical jargon. The AI Clarified series attempts to fill this void and answer some of the most commonly asked AI questions with practical, easy-to-follow explanations. Question: Is AI more biased than humans, or less? I've heard both and am not sure which side to believe. Indeed it's hard to know what to believe about bias in Artificial Intelligence (AI) systems when just reading articles online -- there is plenty of support in both directions. With the growth of AI and the widespread adaption of AI models, there is a lot of noise on both sides, especially for high-stakes use cases like those affecting humans. Let's take hiring as an example.
Ensemble Methods for Survival Data with Time-Varying Covariates
Yao, Weichi, Frydman, Halina, Larocque, Denis, Simonoff, Jeffrey S.
Survival data with time-varying covariates are common in practice. However, the traditional survival forests - conditional inference forest, relative risk forest and random survival forest - have accommodated only time-invariant covariates. Similarly, the recently proposed transformation forest, which incorporates the split statistics suitable for non-proportional hazard settings, has employed only time-invariant covariates. We generalize the conditional inference and relative risk forests to allow time-varying covariates. We compare their performance with that of the Cox model and transformation forest, adapted to accommodate time-varying covariates, through a comprehensive simulation study in which the Kaplan-Meier estimate serves as a benchmark. In general, the performance of the two proposed forests substantially improves over the Kaplan-Meier estimate when the estimation conditions become more favorable. Taking into an account all other factors, under the PH setting, the best method is always one of the two proposed forests, while under the non-PH setting, it is the adapted transformation forest. The K-fold cross-validation can be an effective tool to choose between the methods in practice. Finally, the performance of the proposed forest methods for time-invariant covariate data is broadly similar to that found for time-varying covariate data. We also propose a general framework for estimation of a survival function in the presence of time-varying covariates, which can be applied to any method that uses the counting process (pseudo-subject) approach to handling time-varying covariates. This novel estimate of a single survival function takes multiple survival estimation outputs corresponding to each pseudo-subject, and combines them in a theoretically-justified way to form a proper monotone-decreasing survival function estimate.
How to Develop a Random Subspace Ensemble With Python
Random Subspace Ensemble is a machine learning algorithm that combines the predictions from multiple decision trees trained on different subsets of columns in the training dataset. Randomly varying the columns used to train each contributing member of the ensemble has the effect of introducing diversity into the ensemble and, in turn, can lift performance over using a single decision tree. It is related to other ensembles of decision trees such as bootstrap aggregation (bagging) that creates trees using different samples of rows from the training dataset, and random forest that combines ideas from bagging and the random subspace ensemble. Although decision trees are often used, the general random subspace method can be used with any machine learning model whose performance varies meaningfully with the choice of input features. In this tutorial, you will discover how to develop random subspace ensembles for classification and regression.
Decision Trees Explained With a Practical Example
A decision tree is one of the supervised machine learning algorithms. This algorithm can be used for regression and classification problems -- yet, is mostly used for classification problems. A decision tree follows a set of if-else conditions to visualize the data and classify it according to the conditions. Before we dive deep into the working principle of the decision tree's algorithm you need to know a few keywords related to it. Attribute Subset Selection Measure is a technique used in the data mining process for data reduction.
Measure Inducing Classification and Regression Trees for Functional Data
Belli, Edoardo, Vantini, Simone
We propose a tree-based algorithm for classification and regression problems in the context of functional data analysis, which allows to leverage representation learning and multiple splitting rules at the node level, reducing generalization error while retaining the interpretability of a tree. This is achieved by learning a weighted functional $L^{2}$ space by means of constrained convex optimization, which is then used to extract multiple weighted integral features from the input functions, in order to determine the binary split for each internal node of the tree. The approach is designed to manage multiple functional inputs and/or outputs, by defining suitable splitting rules and loss functions that can depend on the specific problem and can also be combined with scalar and categorical data, as the tree is grown with the original greedy CART algorithm. We focus on the case of scalar-valued functional inputs defined on unidimensional domains and illustrate the effectiveness of our method in both classification and regression tasks, through a simulation study and four real world applications.
Adapting Neural Networks for Uplift Models
Mouloud, Belbahri, Olivier, Gandouet, Ghaith, Kazma
Uplift is a particular case of individual treatment effect modeling. Such models deal with cause-and-effect inference for a specific factor, such as a marketing intervention. In practice, these models are built on customer data who purchased products or services to improve product marketing. Uplift is estimated using either i) conditional mean regression or ii) transformed outcome regression. Most existing approaches are adaptations of classification and regression trees for the uplift case. However, in practice, these conventional approaches are prone to overfitting. Here we propose a new method using neural networks. This representation allows to jointly optimize the difference in conditional means and the transformed outcome losses. As a consequence, the model not only estimates the uplift, but also ensures consistency in predicting the outcome. We focus on fully randomized experiments, which is the case of our data. We show our proposed method improves the state-of-the-art on synthetic and real data.
Inherent Trade-offs in the Fair Allocation of Treatments
He, Yuzi, Burghardt, Keith, Guo, Siyi, Lerman, Kristina
Explicit and implicit bias clouds human judgement, leading to discriminatory treatment of minority groups. A fundamental goal of algorithmic fairness is to avoid the pitfalls in human judgement by learning policies that improve the overall outcomes while providing fair treatment to protected classes. In this paper, we propose a causal framework that learns optimal intervention policies from data subject to fairness constraints. We define two measures of treatment bias and infer best treatment assignment that minimizes the bias while optimizing overall outcome. We demonstrate that there is a dilemma of balancing fairness and overall benefit; however, allowing preferential treatment to protected classes in certain circumstances (affirmative action) can dramatically improve the overall benefit while also preserving fairness. We apply our framework to data containing student outcomes on standardized tests and show how it can be used to design real-world policies that fairly improve student test scores. Our framework provides a principled way to learn fair treatment policies in real-world settings.