Decision Tree Learning
Contextual Care Protocol using Neural Networks and Decision Trees
Sinha, Yash Pratyush, Malviya, Pranshu, Panda, Minerva, Ali, Syed Mohd
A contextual care protocol is used by a medical practitioner for patient healthcare, given the context or situation that the specified patient is in. This paper proposes a method to build an automated self-adapting protocol which can help make relevant, early decisions for effective healthcare delivery. The hybrid model leverages neural networks and decision trees. The neural network estimates the chances of each disease and each tree in the decision trees represents care protocol for a disease. These trees are subject to change in case of aberrations found by the diagnosticians. These corrections or prediction errors are clustered into similar groups for scalability and review by the experts. The corrections as suggested by the experts are incorporated into the model.
Machine Learning Analysis of Heterogeneity in the Effect of Student Mindset Interventions
We study heterogeneity in the effect of a mindset intervention on student-level performance through an observational dataset from the National Study of Learning Mindsets (NSLM). Our analysis uses machine learning (ML) to address the following associated problems: assessing treatment group overlap and covariate balance, imputing conditional average treatment effects, and interpreting imputed effects. By comparing several different model families we illustrate the flexibility of both off-the-shelf and purpose-built estimators. We find that the mindset intervention has a positive average effect of 0.26, 95%-CI [0.22, 0.30], and that heterogeneity in the range of [0.1, 0.4] is moderated by school-level achievement level, poverty concentration, urbanicity, and student prior expectations.
THORS: An Efficient Approach for Making Classifiers Cost-sensitive
In this paper, we propose an effective TH resholding method based on ORder S tatistic, called THORS, to convert an arbitrary scoring-type classifier, which can induce a continuous cumulative distribution function of the score, into a cost-sensitive one. The procedure, uses order statistic to find an optimal threshold for classification, requiring almost no knowledge of classifiers itself. Unlike common data-driven methods, we analytically show that THORS has theoretical guaranteed performance, theoretical bounds for the costs and lower time complexity. Coupled with empirical results on several real-world data sets, we argue that THORS is the preferred cost-sensitive technique. Key words: Classification; Cost-sensitive learning; Imbalanced data set; Statistical learning; Threshold adjusting.
FairMod - Making Predictive Models Discrimination Aware
Liu, Jixue, Li, Jiuyong, Liu, Lin, Le, Thuc Duy, Ye, Feiyue, Li, Gefei
Predictive models such as decision trees and neural networks may produce discrimination in their predictions. This paper proposes a method to post-process the predictions of a predictive model to make the processed predictions non-discriminatory. The method considers multiple protected variables together. Multiple protected variables make the problem more challenging than a simple protected variable. The method uses a well-cited discrimination metric and adapts it to allow the specification of explanatory variables, such as position, profession, education, that describe the contexts of the applications. It models the post-processing of predictions problem as a nonlinear optimization problem to find best adjustments to the predictions so that the discrimination constraints of all protected variables are all met at the same time. The proposed method is independent of classification methods. It can handle the cases that existing methods cannot handle: satisfying multiple protected attributes at the same time, allowing multiple explanatory attributes, and being independent of classification model types. An evaluation using four real world data sets shows that the proposed method is as effectively as existing methods, in addition to its extra power.
Modeling Stated Preference for Mobility-on-Demand Transit: A Comparison of Machine Learning and Logit Models
Zhao, Xilei, Yan, Xiang, Yu, Alan, Van Hentenryck, Pascal
Logit models are usually applied when studying individual travel behavior, i.e., to predict travel mode choice and to gain behavioral insights on traveler preferences. Recently, some studies have applied machine learning to model travel mode choice and reported higher out-of-sample prediction accuracy than conventional logit models (e.g., multinomial logit). However, there has not been a comprehensive comparison between logit models and machine learning that covers both prediction and behavioral analysis. This paper aims at addressing this gap by examining the key differences in model development, evaluation, and behavioral interpretation between logit and machine-learning models for travel-mode choice modeling. To complement the theoretical discussions, we also empirically evaluated the two approaches on stated-preference survey data for a new type of transit system integrating high-frequency fixed routes and micro-transit. The results show that machine learning can produce significantly higher predictive accuracy than logit models and are better at capturing the nonlinear relationships between trip attributes and mode-choice outcomes. On the other hand, compared to the multinomial logit model, the best-performing machine-learning model, the random forest model, produces less reasonable behavioral outputs (i.e. marginal effects and elasticities) when they were computed from a standard approach. By introducing some behavioral constraints into the computation of behavioral outputs from a random forest model, however, we obtained better results that are somewhat comparable with the multinomial logit model. We believe that there is great potential in merging ideas from machine learning and conventional statistical methods to develop refined models for travel-behavior research and suggest some possible research directions.
Risk-Stratify: Confident Stratification Of Patients Based On Risk
Ahuja, Kartik, van der Schaar, Mihaela
A clinician desires to use a risk-stratification method that achieves confident risk-stratification - the risk estimates of the different patients reflect the true risks with a high probability. This allows him/her to use these risks to make accurate predictions about prognosis and decisions about screening, treatments for the current patient. We develop Risk-stratify - a two phase algorithm that is designed to achieve confident risk-stratification. In the first phase, we grow a tree to partition the covariate space. Each node in the tree is split using statistical tests that determine if the risks of the child nodes are different or not. The choice of the statistical tests depends on whether the data is censored (Log-rank test) or not (U-test). The set of the leaves of the tree form a partition. The risk distribution of patients that belong to a leaf is different from the sibling leaf but not the rest of the leaves. Therefore, some of the leaves that have similar underlying risks are incorrectly specified to have different risks. In the second phase, we develop a novel recursive graph decomposition approach to address this problem. We merge the leaves of the tree that have similar risks to form new leaves that form the final output. We apply Risk-stratify on a cohort of patients (with no history of cardiovascular disease) from UK Biobank and assess their risk for cardiovascular disease. Risk-stratify significantly improves risk-stratification, i.e., a lower fraction of the groups have over/under estimated risks (measured in terms of false discovery rate; 33% reduction) in comparison to state-of-the-art methods for cardiovascular prediction (Random forests, Cox model, etc.). We find that the Cox model significantly over estimates the risk of 21,621 patients out of 216,211 patients. Risk-stratify can accurately categorize 2,987 of these 21,621 patients as low-risk individuals.
Machine Learning Basics - Random Forest
RF is based on decision trees. In machine learning decision trees are a technique for creating predictive models. They are called decision trees because the prediction follows several branches of "if… then…" decision splits - similar to the branches of a tree. If we imagine that we start with a sample, which we want to predict a class for, we would start at the bottom of a tree and travel up the trunk until we come to the first split-off branch. This split can be thought of as a feature in machine learning, let's say it would be "age"; we would now make a decision about which branch to follow: "if our sample has an age bigger than 30, continue along the left branch, else continue along the right branch".
Computational Intelligence in Sports: A Systematic Literature Review
Bonidia, Robson P., Rodrigues, Luiz A. L., Avila-Santos, Anderson P., Sanches, Danilo S., Brancher, Jacques D.
Recently, data mining studies are being successfully conducted to estimate several parameters in a variety of domains. Data mining techniques have attracted the attention of the information industry and society as a whole, due to a large amount of data and the imminent need to turn it into useful knowledge. However, the effective use of data in some areas is still under development, as is the case in sports, which in recent years, has presented a slight growth; consequently, many sports organizations have begun to see that there is a wealth of unexplored knowledge in the data extracted by them. Therefore, this article presents a systematic review of sports data mining. Regarding years 2010 to 2018, 31 types of research were found in this topic. Based on these studies, we present the current panorama, themes, the database used, proposals, algorithms, and research opportunities. Our findings provide a better understanding of the sports data mining potentials, besides motivating the scientific community to explore this timely and interesting topic.
How to Interpret a Random Forest Model (Machine Learning with Python)
Machine Learning is a fast evolving field – but a few things would remain as they were years ago. One such thing is ability to interpret and explain your machine learning models. If you build a model and can not explain it to your business users – it is very unlikely that it will see the light of the day. Can you imagine integrating a model into your product without understanding how it works? Or which features are impacting your final result? In addition to backing from stakeholders, we as data scientists benefit from interpreting our work and improving upon it. The first article of this fast.ai I'm delighted to share part 2 of this series, which primarily deals with how you can intepret a random forest model. We will understand the theory and also implement it in Python to solidify our grasp on this critical concept.
Dealing with Uncertain Inputs in Regression Trees
Tami, Myriam, Clausel, Marianne, Devijver, Emilie, Dulac, Adrien, Gaussier, Eric, Janaqi, Stefan, Chebre, Meriam
Tree-based ensemble methods, as Random Forests and Gradient Boosted Trees, have been successfully used for regression in many applications and research studies. Furthermore, these methods have been extended in order to deal with uncertainty in the output variable, using for example a quantile loss in Random Forests (Meinshausen, 2006). To the best of our knowledge, no extension has been provided yet for dealing with uncertainties in the input variables, even though such uncertainties are common in practical situations. We propose here such an extension by showing how standard regression trees optimizing a quadratic loss can be adapted and learned while taking into account the uncertainties in the input. By doing so, one no longer assumes that an observation lies into a single region of the regression tree, but rather that it belongs to each region with a certain probability. Experiments conducted on several data sets illustrate the good behavior of the proposed extension.