Decision Tree Learning

Trump administration updates AI strategy, with emphasis on transparency, data integrity


In its update to its National Artificial Intelligence Research And Development Strategic Plan, the White House's Office of Science and Technology Policy has set new objectives for federal AI research. WHY IT MATTERS The strategic plan boils down to eight strategies for how government can better enable development of safe and effective AI and machine learning technologies for healthcare and other industries. The 50-page document takes special interest in ensuring that data used to power AI is trustworthy and that the algorithms used to process it are understandable – not least in healthcare. "A key research challenge is increasing the'explainability' or ''transparency' of AI," according to the report. "Many algorithms, including those based on deep learning, are opaque to users, with few existing mechanisms for explaining their results. This is especially problematic for domains such as healthcare, where doctors need explanations to justify a particular diagnosis or a course of treatment. AI techniques such as decision-tree induction provide built-in explanations but are generally less accurate. Thus, researchers must develop systems that are transparent, and intrinsically capable of explaining the reasons for their results to users."

Explaining Predictions: Random Forest Post-hoc Analysis (randomForestExplainer package)


We can further evaluate the variable interactions by plotting the probability of a prediction against the variables making up the interaction. However, there is an error when the input supplied is a model created with parsnip. There is no error when the model is created directly from the randomForest package. In this case, we can place it side by side with the ggplot of the distribution of heart disease in the test set.

Churn prediction


Customer churn, also known as customer attrition, occurs when customers stop doing business with a company. The companies are interested in identifying segments of these customers because the price for acquiring a new customer is usually higher than retaining the old one. For example, if Netflix knew a segment of customers who were at risk of churning they could proactively engage them with special offers instead of simply losing them. In this post, we will create a simple customer churn prediction model using Telco Customer Churn dataset. We chose a decision tree to model churned customers, pandas for data crunching and matplotlib for visualizations.

VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data


The demands on machine learning methods to cater for ultra high dimensional datasets, datasets with millions of features, have been increasing in domains like life sciences and the Internet of Things (IoT). While Random Forests are suitable for "wide" datasets, current implementations such as Google's PLANET lack the ability to scale to such dimensions. Recent improvements by Yggdrasil begin to address these limitations but do not extend to Random Forest. This paper introduces CursedForest, a novel Random Forest implementation on top of Apache Spark and part of the VariantSpark platform, which parallelises processing of all nodes over the entire forest. CursedForest is 9 and up to 89 times faster than Google's PLANET and Yggdrasil, respectively, and is the first method capable of scaling to millions of features.

Comparing Decision Tree Algorithms: Random Forest vs. XGBoost


This tutorial walks you through a comparison of XGBoost and Random Forest, two popular decision tree algorithms, and helps you identify the best use cases for ensemble techniques like bagging and boosting. By following the tutorial, you'll learn: Understanding the benefits of bagging and boosting--and knowing when to use which technique--will lead to less variance, lower bias, and more stability in your machine learning models.

Random Forests for Store Forecasting at Walmart Scale


The SMART Forecasting team at Walmart Labs is tasked with providing demand forecasts for over 70 million store-item combinations every week! For example, just how much of every type of ginger needs to go to every Walmart store in the U.S., every week for the next 52 weeks, with the goal of improving in stocks and reducing food waste. Our algorithm strategy was to build a suite of machine learning models and deploy them at scale to generate bespoke solutions for (oh so many!) store-item-week combinations. Random Forests would be part of this suite. We went through the traditional model development workflow of data discovery, identifying demand drivers, feature engineering, training, cross validation and testing.

SIRUS: making random forests interpretable Machine Learning

State-of-the-art learning algorithms, such as random forests or neural networks, are often qualified as "black-boxes" because of the high number and complexity of operations involved in their prediction mechanism. This lack of interpretability is a strong limitation for applications involving critical decisions, typically the analysis of production processes in the manufacturing industry. In such critical contexts, models have to be interpretable, i.e., simple, stable, and predictive. To address this issue, we design SIRUS (Stable and In-terpretable RUle Set), a new classification algorithm based on random forests, which takes the form of a short list of rules. While simple models are usually unstable with respect to data perturbation, SIRUS achieves a remarkable stability improvement over cutting-edge methods. Furthermore, SIRUS inherits a predictive accuracy close to random forests, combined with the simplicity of decision trees. These properties are assessed both from a theoretical and empirical point of view, through extensive numerical experiments based on our R/C++ software implementation sirus.

r/MachineLearning - [P] Updates to Incredicat, my attempt at a 20 questions style game powered by Cat AI


I posted this a few months ago and had some great feedback. I've put some work into the model and have just released the latest update. It uses a modified version of C4.5 decision trees and a load of other adjustments. Think it is working better now after some changes around the classification process.

Detecting Heterogeneous Treatment Effect with Instrumental Variables Machine Learning

Under the usual IV assumptions, our method discovers and tests heterogeneity in H-CATEs by using matching, CART, and closed testing, all without the need to do sample splitting. The latter is achieved by taking the absolute value of the adjusted pairwise differences to conceal the instrument assignment. Our method was shown to strongly control the familywise error rate. We conducted a simulation study to examine the power of our method under varying degrees of 18 compliance and effect heterogeneity and showed that our method can detect wide variety of heterogeneity. Our method was used to study the effect of Medicaid on the number of days an individual's physical or mental health did not prevent their usual activities where we used the lottery selection as an instrument. It was found that Medicaid has a larger impact on improving the number of days not impeded upon by their health for complying, older, non-Asian men who selected English materials at lottery sign-up and for complying, younger, less educated individuals who selected English materials at lottery sign-up.

The Use of Binary Choice Forests to Model and Estimate Discrete Choice Models Machine Learning

We show the equivalence of discrete choice models and the class of binary choice forests, which are random forest based on binary choice trees. This suggests that standard machine learning techniques based on random forest can serve to estimate discrete choice model with an interpretable output. This is confirmed by our data driven result that states that random forest can accurately predict the choice probability of any discrete choice model. Our framework has unique advantages: it can capture behavioral patterns such as irrationality or sequential searches; it handles nonstandard formats of training data that result from aggregation; it can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product; it can also incorporate price information. Our numerical results show that binary choice forest can outperform the best parametric models with much better computational times.