Decision Tree Learning
VisRuler: Visual Analytics for Extracting Decision Rules from Bagged and Boosted Decision Trees
Chatzimparmpas, Angelos, Martins, Rafael M., Kerren, Andreas
Bagging and boosting are two popular ensemble methods in machine learning (ML) that produce many individual decision trees. Due to the inherent ensemble characteristic of these methods, they typically outperform single decision trees or other ML models in predictive performance. However, numerous decision paths are generated for each decision tree, increasing the overall complexity of the model and hindering its use in domains that require trustworthy and explainable decisions, such as finance, social care, and health care. Thus, the interpretability of bagging and boosting algorithms, such as random forests and adaptive boosting, reduces as the number of decisions rises. In this paper, we propose a visual analytics tool that aims to assist users in extracting decisions from such ML models via a thorough visual inspection workflow that includes selecting a set of robust and diverse models (originating from different ensemble learning algorithms), choosing important features according to their global contribution, and deciding which decisions are essential for global explanation (or locally, for specific cases). The outcome is a final decision based on the class agreement of several models and the explored manual decisions exported by users. Finally, we evaluate the applicability and effectiveness of VisRuler via a use case, a usage scenario, and a user study.
Improve Random Forest with Linear Models
Random Forest is probably considered by most the silver bullet in supervised prediction tasks. For sure, any data scientist involved in standard machine learning applications is used to fit and benchmark a Random Forest. Random Forest is a well-known algorithm in literature and is proven to reach satisfactory results in both regression and classification contexts. It enjoys the ability to learn complex data relationships with low effort. There are a lot of open-sourced efficient implementations which are available to all of us (the one provided by scikit-learn is for sure the most famous).
Top 5 techniques for Explainable AI
As you can see that all these explainable AI techniques are not "nice-to-have", but mandatory. Using these techniques will help you better communicate with the person impacted through AI decisions. In some cases, as seen in the stroke prediction example, understanding these techniques can help improve or save lives. You can experience some of the techniques in this article on my website -- https://experiencedatascience.com
Random Forests Algorithm explained with a real-life example and some Python code
Random Forests is a Machine Learning algorithm that tackles one of the biggest problems with Decision Trees: variance. Even though Decision Trees is simple and flexible, it is greedy algorithm. It focuses on optimizing for the node split at hand, rather than taking into account how that split impacts the entire tree. A greedy approach makes Decision Trees run faster, but makes it prone overfitting. An overfit tree is highly optimized to predicting the values in the training dataset, resulting in a learning model with high-variance.
Factor-augmented tree ensembles
This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. First, it allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Second, it gives a transparent way for using domain-specific theory to inform time-series regression trees. As a byproduct, this technique sets the foundations for structuring powerful ensembles. Their real-world applicability is studied under the lenses of empirical macro-finance. Keywords: Ensemble learning, Factor models, State-space models, Time series, Unobserved components.Introduction In time series, the simplicity of regression trees (Morgan and Sonquist, 1963; Breiman et al., 1984; Quinlan, 1986) comes at a cost: irregularities, complicated periodic patterns and non-stationary trends cannot be explicitly modelled, and this is unfortunate given that many real-world examples are subject to them. Following, in spirit, Harvey et al. (1998), this paper proposes to pre-process problematic predictors using state-space representations general enough to deal with all these complexities at once. This operation can be thought as an automated feature engineering process that extracts stationary patterns hidden across multiple predictors, while handling problematic data characteristics. Besides, when the state-space representation is compatible with domain-specific theory, this becomes a transparent way for extracting signals with structural interpretation. The resulting stationary common components, referred hereinbelow as stationary dynamic factors, are then employed as regular predictors for standard time-series regression trees. This manuscript calls them factor-augmented regression trees to stress their dependence on latent components. I thank Matteo Barigozzi and Kostas Kalogeropoulos for their valuable suggestions and supervision; Serena Lariccia and Qiwei Yao for their helpful comments on a preliminary draft of this article.
Statistical Tests for Comparing Classification Algorithms
Comparing prediction methods to define which one should be used for the task at hand is a daily activity for most data scientists. Usually, one will have a pool of classification models and will validate them using cross-validation to define which one is best. Another goal, however, is not to compare classifiers, but the learning algorithms themselves. The idea is: given this task (data), which learning algorithm (KNN, SVM, Random Forests, etc) will generate more accurate classifiers on a dataset of size D? As we will see, every method presented here has some pros and cons. However, the first intuition of using a two proportions test can lead to some really bad results.
MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data
Gerasimiuk, Michal, Shung, Dennis, Tong, Alexander, Stanley, Adrian, Schultz, Michael, Ngu, Jeffrey, Laine, Loren, Wolf, Guy, Krishnaswamy, Smita
A major challenge in embedding or visualizing clinical patient data is the heterogeneity of variable types including continuous lab values, categorical diagnostic codes, as well as missing or incomplete data. In particular, in EHR data, some variables are {\em missing not at random (MNAR)} but deliberately not collected and thus are a source of information. For example, lab tests may be deemed necessary for some patients on the basis of suspected diagnosis, but not for others. Here we present the MURAL forest -- an unsupervised random forest for representing data with disparate variable types (e.g., categorical, continuous, MNAR). MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random, such that the marginal entropy of all other variables is minimized by the split. This allows us to also split on MNAR variables and discrete variables in a way that is consistent with the continuous variables. The end goal is to learn the MURAL embedding of patients using average tree distances between those patients. These distances can be fed to nonlinear dimensionality reduction method like PHATE to derive visualizable embeddings. While such methods are ubiquitous in continuous-valued datasets (like single cell RNA-sequencing) they have not been used extensively in mixed variable data. We showcase the use of our method on one artificial and two clinical datasets. We show that using our approach, we can visualize and classify data more accurately than competing approaches. Finally, we show that MURAL can also be used to compare cohorts of patients via the recently proposed tree-sliced Wasserstein distances.
A Hybrid Approach for an Interpretable and Explainable Intrusion Detection System
Dias, Tiago, Oliveira, Nuno, Sousa, Norberto, Praça, Isabel, Sousa, Orlando
Cybersecurity has been a concern for quite a while now. In the latest years, cyberattacks have been increasing in size and complexity, fueled by significant advances in technology. Nowadays, there is an unavoidable necessity of protecting systems and data crucial for business continuity. Hence, many intrusion detection systems have been created in an attempt to mitigate these threats and contribute to a timelier detection. This work proposes an interpretable and explainable hybrid intrusion detection system, which makes use of artificial intelligence methods to achieve better and more long-lasting security. The system combines experts' written rules and dynamic knowledge continuously generated by a decision tree algorithm as new shreds of evidence emerge from network activity.
A Large Scale Benchmark for Individual Treatment Effect Prediction and Uplift Modeling
Diemert, Eustache, Betlei, Artem, Renaudin, Christophe, Amini, Massih-Reza, Gregoir, Théophane, Rahier, Thibaud
Individual Treatment Effect (ITE) prediction is an important area of research in machine learning which aims at explaining and estimating the causal impact of an action at the granular level. It represents a problem of growing interest in multiple sectors of application such as healthcare, online advertising or socioeconomics. To foster research on this topic we release a publicly available collection of 13.9 million samples collected from several randomized control trials, scaling up previously available datasets by a healthy 210x factor. We provide details on the data collection and perform sanity checks to validate the use of this data for causal inference tasks. First, we formalize the task of uplift modeling (UM) that can be performed with this data, along with the relevant evaluation metrics. Then, we propose synthetic response surfaces and heterogeneous treatment assignment providing a general set-up for ITE prediction. Finally, we report experiments to validate key characteristics of the dataset leveraging its size to evaluate and compare - with high statistical significance - a selection of baseline UM and ITE prediction methods.