Goto

Collaborating Authors

 Decision Tree Learning


Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods

arXiv.org Machine Learning

Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors. The amount of shrinkage is controlled by a single regularization parameter and the number of data points in each ancestor. Since HS is a post-hoc method, it is extremely fast, compatible with any tree growing algorithm, and can be used synergistically with other regularization techniques. Extensive experiments over a wide variety of real-world datasets show that HS substantially increases the predictive performance of decision trees, even when used in conjunction with other regularization techniques. Moreover, we find that applying HS to each tree in an RF often improves accuracy, as well as its interpretability by simplifying and stabilizing its decision boundaries and SHAP values. We further explain the success of HS in improving prediction performance by showing its equivalence to ridge regression on a (supervised) basis constructed of decision stumps associated with the internal nodes of a tree. All code and models are released in a full-fledged package available on Github (github.com/csinva/imodels)


Building a Random Forest Classifier to Predict Neural Spikes

#artificialintelligence

A step-by-step guide to building a Random Forest classifier in Python to predict subtypes of neural extracellular spikes using a real data-set recorded from Human brain organoids. Given the heterogeneity of neurons within the human brain itself, classification tools are commonly utilised to correlate electrical activity with different cell types and/or morphologies. This is a long-standing question in Neuroscience circles, and can be considerably variable between different species, pathologies, brain regions and layers. Fortunately, with the readily increasing computational power allowing improvements in machine-learning and deep-learning algorithms, Neuroscientists are provided with the tools to dive further into asking these important questions. However, as stated by Juavinett et al., for the most part programming skills are underrepresented in the community and new resources to teach them are crucial to solving the complexity of the human brain.


Geometry- and Accuracy-Preserving Random Forest Proximities

arXiv.org Machine Learning

Abstract--Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest which measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry-and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry. ANDOM forests [1] are well-known, powerful predictors comprised of an ensemble of binary recursive was first defined by Leo Breiman as the proportion of decision trees. Random forests are easily adapted for both trees in which the observations reside in the same terminal classification and regression, are trivially parallelizable, can node [16].


How to know when AI is the right solution

#artificialintelligence

Artificial intelligence (AI) adoption is on the rise. According to a recent McKinsey survey, 55 per cent of companies use artificial intelligence in at least one function, and 27 per cent attribute at least 5 per cent of earnings before interest and taxes to AI, much of that in the form of cost savings. As AI will dramatically transform nearly every industry it touches, it's no surprise that vendors and enterprises are looking for opportunities to deploy AI everywhere they can. But not every project can benefit from AI and attempting to apply AI inappropriately can not only cost time and money but also sour employees, customers, and corporate leaders on future AI projects. The key factors for determining whether a project is suitable for AI are business value, availability of training data, and cultural readiness for change.


Fairness implications of encoding protected categorical attributes

arXiv.org Machine Learning

Protected attributes are often presented as categorical features that need to be encoded before feeding them into a machine learning algorithm. Encoding these attributes is paramount as they determine the way the algorithm will learn from the data. Categorical feature encoding has a direct impact on the model performance and fairness. In this work, we compare the accuracy and fairness implications of the two most well-known encoders: one-hot encoding and target encoding. We distinguish between two types of induced bias that can arise while using these encodings and can lead to unfair models. The first type, irreducible bias, is due to direct group category discrimination and a second type, reducible bias, is due to large variance in less statistically represented groups. We take a deeper look into how regularization methods for target encoding can improve the induced bias while encoding categorical features. Furthermore, we tackle the problem of intersectional fairness that arises when mixing two protected categorical features leading to higher cardinality. This practice is a powerful feature engineering technique used for boosting model performance. We study its implications on fairness as it can increase both types of induced bias


Learn To Predict Breast Cancer Using Machine Learning

#artificialintelligence

Learn to build three Machine Learning models (Logistic regression, Decision Tree, Random Forest) from scratch - Free Course. Here you will learn to build three models that are Logistic regression model, the Decision Tree model, and Random Forest Classifier model using Scikit-learn to classify breast cancer as either Malignant or Benign. We will use the Breast Cancer Wisconsin (Diagnostic) Data Set from Kaggle. You should be familiar with the Python Programming language and you should have a theoretical understanding of the three algorithms that is Logistic regression model, Decision Tree model, and Random Forest Classifier model.


Image Classification using Machine Learning - Analytics Vidhya

#artificialintelligence

This article was published as a part of the Data Science Blogathon. In this blog, we will be discussing how to perform image classification using four popular machine learning algorithms namely, Random Forest Classifier, KNN, Decision Tree Classifier, and Naive Bayes classifier. We will directly jump into implementation step-by-step. At the end of the article, you will understand why Deep Learning is preferred for image classification. However, the work demonstrated here will help serve research purposes if one desires to compare their CNN image classifier model with some machine learning algorithms.


Model Generalization in Arrival Runway Occupancy Time Prediction by Feature Equivalences

arXiv.org Artificial Intelligence

General real-time runway occupancy time prediction modelling for multiple airports is a current research gap. An attempt to generalize a real-time prediction model for Arrival Runway Occupancy Time (AROT) is presented in this paper by substituting categorical features by their numerical equivalences. Three days of data, collected from Saab Sensis' Aerobahn system at three US airports, has been used for this work. Three tree-based machine learning algorithms: Decision Tree, Random Forest and Gradient Boosting are used to assess the generalizability of the model using numerical equivalent features. We have shown that the model trained on numerical equivalent features not only have performances at least on par with models trained on categorical features but also can make predictions on unseen data from other airports.


Learning Optimal Fair Classification Trees

arXiv.org Artificial Intelligence

The increasing use of machine learning in high-stakes domains -- where people's livelihoods are impacted -- creates an urgent need for interpretable and fair algorithms. In these settings it is also critical for such algorithms to be accurate. With these needs in mind, we propose a mixed integer optimization (MIO) framework for learning optimal classification trees of fixed depth that can be conveniently augmented with arbitrary domain specific fairness constraints. We benchmark our method against the state-of-the-art approach for building fair trees on popular datasets; given a fixed discrimination threshold, our approach improves out-of-sample (OOS) accuracy by 2.3 percentage points on average and obtains a higher OOS accuracy on 88.9% of the experiments. We also incorporate various algorithmic fairness notions into our method, showcasing its versatile modeling power that allows decision makers to fine-tune the trade-off between accuracy and fairness.


Marginal Effects for Non-Linear Prediction Functions

arXiv.org Machine Learning

Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models and especially generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either in the shape of derivatives of the prediction function or forward differences in prediction due to a change in a feature value. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a model-agnostic interpretation method for machine learning models. This may stem from their inflexibility as a univariate feature effect and their inability to deal with the non-linearities found in black box models. We introduce a new class of marginal effects termed forward marginal effects. We argue to abandon derivatives in favor of better-interpretable forward differences. Furthermore, we generalize marginal effects based on forward differences to multivariate changes in feature values. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for marginal effects. We argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to partition the feature space to compute conditional average marginal effects on feature subspaces, which serve as conditional feature effect estimates.