Decision Tree Learning
3 decision tree-based algorithms for Machine Learning
Decision trees are a tree algorithm that split the data based on certain decisions. Look at the image below of a very simple decision tree. We want to decide if an animal is a cat or a dog based on 2 questions. We can answer each question and depending on the answer, we can classify the animal as either a dog or a cat. The red lines represent the answer "NO" and the green line, "YES".
Decision Trees, Random Forests, AdaBoost & XGBoost in Python
You're looking for a complete Decision tree course that teaches you everything you need to create a Decision tree/ Random Forest/ XGBoost model in Python, right? You've found the right Decision Trees and tree based advanced techniques course! How this course will help you? A Verifiable Certificate of Completion is presented to all students who undertake this Machine learning advanced course. If you are a business manager or an executive, or a student who wants to learn and apply machine learning in Real world problems of business, this course will give you a solid base for that by teaching you some of the advanced technique of machine learning, which are Decision tree, Random Forest, Bagging, AdaBoost and XGBoost.
Regression Trees for Cumulative Incidence Functions
The use of cumulative incidence functions for characterizing the risk of one type of event in the presence of others has become increasingly popular over the past decade. The problems of modeling, estimation and inference have been treated using parametric, nonparametric and semi-parametric methods. Efforts to develop suitable extensions of machine learning methods, such as regression trees and related ensemble methods, have begun only recently. In this paper, we develop a novel approach to building regression trees for estimating cumulative incidence curves in a competing risks setting. The proposed methods employ augmented estimators of the Brier score risk as the primary basis for building and pruning trees.
A Survey on the Explainability of Supervised Machine Learning
Burkart, Nadia, Huber, Marco F.
Predictions obtained by, e.g., artificial neural networks have a high accuracy but humans often perceive the models as black boxes. Insights about the decision making are mostly opaque for humans. Particularly understanding the decision making in highly sensitive areas such as healthcare or fifinance, is of paramount importance. The decision-making behind the black boxes requires it to be more transparent, accountable, and understandable for humans. This survey paper provides essential definitions, an overview of the different principles and methodologies of explainable Supervised Machine Learning (SML). We conduct a state-of-the-art survey that reviews past and recent explainable SML approaches and classifies them according to the introduced definitions. Finally, we illustrate principles by means of an explanatory case study and discuss important future directions.
Precision-Recall Curve (PRC) Classification Trees
The classification of imbalanced data has presented a significant challenge for most well-known classification algorithms that were often designed for data with relatively balanced class distributions. Nevertheless skewed class distribution is a common feature in real world problems. It is especially prevalent in certain application domains with great need for machine learning and better predictive analysis such as disease diagnosis, fraud detection, bankruptcy prediction, and suspect identification. In this paper, we propose a novel tree-based algorithm based on the area under the precision-recall curve (AUPRC) for variable selection in the classification context. Our algorithm, named as the "Precision-Recall Curve classification tree", or simply the "PRC classification tree" modifies two crucial stages in tree building. The first stage is to maximize the area under the precision-recall curve in node variable selection. The second stage is to maximize the harmonic mean of recall and precision (F-measure) for threshold selection. We found the proposed PRC classification tree, and its subsequent extension, the PRC random forest, work well especially for class-imbalanced data sets. We have demonstrated that our methods outperform their classic counterparts, the usual CART and random forest for both synthetic and real data. Furthermore, the ROC classification tree proposed by our group previously has shown good performance in imbalanced data. The combination of them, the PRC-ROC tree, also shows great promise in identifying the minority class.
Regression Trees for Cumulative Incidence Functions
Cho, Youngjoo, Molinaro, Annette M., Hu, Chen, Strawderman, Robert L.
A subject being followed over time may experience several types of events related, for example, to disease morbidity and mortality. For example, in a Phase III trial of concomitant versus sequential chemotherapy and thoracic radiotherapy for patients with inoperable non-small cell lung cancer (NSCLC) conducted by the Radiation Therapy Oncology Group (RTOG), patients were followed up to 5 years, the occurrence of either disease progression or death being of particular interest. Such "competing risks" data are commonly encountered in cancer and other biomedical followup studies, in addition to the potential complication of right-censoring on the event time(s) of interest. Two quantities are often used when analyzing competing risks data: the cause-specific hazard function (CSH) and the cumulative incidence function (CIF). For a given event, the former describes the instantaneous risk of this event at time t, given that no events have yet occurred; the latter describes the probability of occurrence, or absolute risk, of that event across time and can be derived directly from the subdistribution hazard function (Fine and Gray, 1999).
Interpretable Machine Learning for COVID-19: An Empirical Study on Severity Prediction Task
Wu, Han, Ruan, Wenjie, Wang, Jiangtao, Zheng, Dingchang, Li, Shaolin, Chen, Jian, Li, Kunwei, Chai, Xiangfei, Helal, Sumi
Black-box nature hinders the deployment of many high-accuracy models in medical diagnosis. It is risky to put one's life in the hands of models that medical researchers do not trust. However, to understand the mechanism of a new virus, such as COVID-19, machine learning models may catch important symptoms that medical practitioners do not notice due to the surge of infected patients during a pandemic. In this work, the interpretation of machine learning models reveals that a high C-reactive protein (CRP) corresponds to severe infection, and severe patients usually go through a cardiac injury, which is consistent with well-established medical knowledge. Additionally, through the interpretation of machine learning models, we find phlegm and diarrhea are two important symptoms, without which indicate a high risk of turning severe. These two symptoms are not recognized at the early stage of the outbreak, whereas our findings are corroborated by later autopsies of COVID-19 patients. We find patients with a high N-terminal pro B-type natriuretic peptide (NTproBNP) have a significantly increased risk of death which does not receive much attention initially but proves true by the following-up study. Thus, we suggest interpreting machine learning models can offer help to diagnosis at the early stage of an outbreak.
An Embedded Model Estimator for Non-Stationary Random Functions using Multiple Secondary Variables
An algorithm for non-stationary spatial modelling using multiple secondary variables is developed. It combines Geostatistics with Quantile Random Forests to give a new interpolation and stochastic simulation algorithm. This paper introduces the method and shows that it has consistency results that are similar in nature to those applying to geostatistical modelling and to Quantile Random Forests. The method allows for embedding of simpler interpolation techniques, such as Kriging, to further condition the model. The algorithm works by estimating a conditional distribution for the target variable at each target location. The family of such distributions is called the envelope of the target variable. From this, it is possible to obtain spatial estimates, quantiles and uncertainty. An algorithm to produce conditional simulations from the envelope is also developed. As they sample from the envelope, realizations are therefore locally influenced by relative changes of importance of secondary variables, trends and variability.
The Macroeconomy as a Random Forest
I develop Macroeconomic Random Forest (MRF), an algorithm adapting the canonical Machine Learning (ML) tool to flexibly model evolving parameters in a linear macro equation. Its main output, Generalized Time-Varying Parameters (GTVPs), is a versatile device nesting many popular nonlinearities (threshold/switching, smooth transition, structural breaks/change) and allowing for sophisticated new ones. The approach delivers clear forecasting gains over numerous alternatives, predicts the 2008 drastic rise in unemployment, and performs well for inflation. Unlike most ML-based methods, MRF is directly interpretable -- via its GTVPs. For instance, the successful unemployment forecast is due to the influence of forward-looking variables (e.g., term spreads, housing starts) nearly doubling before every recession. Interestingly, the Phillips curve has indeed flattened, and its might is highly cyclical.
Enhash: A Fast Streaming Algorithm For Concept Drift Detection
Jindal, Aashi, Gupta, Prashant, Sengupta, Debarka, Jayadeva, null
We propose Enhash, a fast ensemble learner that detects \textit{concept drift} in a data stream. A stream may consist of abrupt, gradual, virtual, or recurring events, or a mixture of various types of drift. Enhash employs projection hash to insert an incoming sample. We show empirically that the proposed method has competitive performance to existing ensemble learners in much lesser time. Also, Enhash has moderate resource requirements. Experiments relevant to performance comparison were performed on 6 artificial and 4 real data sets consisting of various types of drifts.