Decision Tree Learning
Automatic Induction of Neural Network Decision Tree Algorithms
This work presents an approach to automatically induction for non-greedy decision trees constructed from neural network architecture. This construction can be used to transfer weights when growing or pruning a decision tree, allowing non-greedy decision tree algorithms to automatically learn and adapt to the ideal architecture. In this work, we examine the underpinning ideas within ensemble modelling and Bayesian model averaging which allow our neural network to asymptotically approach the ideal architecture through weights transfer. Experimental results demonstrate that this approach improves models over fixed set of hyperparameters for decision tree models and decision forest models.
Classification when Learning is not Feasible
Consider classification problems, where attributes do not give any information about the class label. I do not know what kind of behavior to expect when running a classification algorithm in this setting (let's assume ID3 decision trees for simplicity). The decision tree constructed should be some kind of "empty" model, because it's even less than a decision stump (i.e. In practice, the model is likely to fit the noise, and find some kind of pattern that does not exist. The algorithm could still manage to come up with some decision tree on data that is in actual fact random.
Intersectionality: Multiple Group Fairness in Expectation Constraints
Fitzsimons, Jack, Osborne, Michael, Roberts, Stephen
Group fairness is an important concern for machine learning researchers, developers, and regulators. However, the strictness to which models must be constrained to be considered fair is still under debate. The focus of this work is on constraining the expected outcome of subpopulations in kernel regression and, in particular, decision tree regression, with application to random forests, boosted trees and other ensemble models. While individual constraints were previously addressed, this work addresses concerns about incorporating multiple constraints simultaneously. The proposed solution does not affect the order of computational or memory complexity of the decision trees and is easily integrated into models post training.
PSICA: decision trees for probabilistic subgroup identification with categorical treatments
Sysoev, Oleg, Bartoszek, Krzysztof, Ekstrom, Eva-Charlotte, Selling, Katarina Ekholm
Personalized medicine aims at identifying best treatments for a patient with given characteristics. It has been shown in the literature that these methods can lead to great improvements in medicine compared to traditional methods prescribing the same treatment to all patients. Subgroup identification is a branch of personalized medicine which aims at finding subgroups of the patients with similar characteristics for which some of the investigated treatments have a better effect than the other treatments. A number of approaches based on decision trees has been proposed to identify such subgroups, but most of them focus on the two-arm trials (control/treatment) while a few methods consider quantitative treatments (defined by the dose). However, no subgroup identification method exists that can predict the best treatments in a scenario with a categorical set of treatments. We propose a novel method for subgroup identification in categorical treatment scenarios. This method outputs a decision tree showing the probabilities of a given treatment being the best for a given group of patients as well as labels showing the possible best treatments. The method is implemented in an R package \textbf{psica} available at CRAN. In addition to numerical simulations based on artificial data, we present an analysis of a community-based nutrition intervention trial that justifies the validity of our method.
Privacy-Preserving Collaborative Prediction using Random Forests
Giacomelli, Irene, Jha, Somesh, Kleiman, Ross, Page, David, Yoon, Kyonghwan
We study the problem of privacy-preserving machine learning (PPML) for ensemble methods, focusing our effort on random forests. In collaborative analysis, PPML attempts to solve the conflict between the need for data sharing and privacy. This is especially important in privacy sensitive applications such as learning predictive models for clinical decision support from EHR data from different clinics, where each clinic has a responsibility for its patients' privacy. We propose a new approach for ensemble methods: each entity learns a model, from its own data, and then when a client asks the prediction for a new private instance, the answers from all the locally trained models are used to compute the prediction in such a way that no extra information is revealed. We implement this approach for random forests and we demonstrate its high efficiency and potential accuracy benefit via experiments on real-world datasets, including actual EHR data.
Simultaneous 12-Lead Electrocardiogram Synthesis using a Single-Lead ECG Signal: Application to Handheld ECG Devices
Afrin, Kahkashan, Verma, Parikshit, Srivatsa, Sanjay S., Bukkapatnam, Satish T. S.
Recent introduction of wearable single-lead ECG devices of diverse configurations has caught the intrigue of the medical community. While these devices provide a highly affordable support tool for the caregivers for continuous monitoring and to detect acute conditions, such as arrhythmia, their utility for cardiac diagnostics remains limited. This is because clinical diagnosis of many cardiac pathologies is rooted in gleaning patterns from synchronous 12-lead ECG. If synchronous 12-lead signals of clinical quality can be synthesized from these single-lead devices, it can transform cardiac care by substantially reducing the costs and enhancing access to cardiac diagnostics. However, prior attempts to synthesize synchronous 12-lead ECG have not been successful. Vectorcardiography (VCG) analysis suggests that cardiac axis synthesized from earlier attempts deviates significantly from that estimated from 12-lead and/or Frank lead measurements. This work is perhaps the first successful attempt to synthesize clinically equivalent synchronous 12-lead ECG from single-lead ECG. Our method employs a random forest machine learning model that uses a subject's historical 12-lead recordings to estimate the morphology including the actual timing of various ECG events (relative to the measured single-lead ECG) for all 11 missing leads of the subject. Our method was validated on two benchmark datasets as well as paper ECG and AliveCor-Kardia data obtained from the Heart, Artery, and Vein Center of Fresno, California. Results suggest that this approach can synthesize synchronous ECG with accuracies (R2) exceeding 90%. Accurate synthesis of 12-lead ECG from a single-lead device can ultimately enable its wider application and improved point-of-care (POC) diagnostics.
CLUB-DRF: A Clustering Approach to Extreme Pruning of Random Forests
Random Forest (RF) is an ensemble supervised machine learning technique that was developed by Breiman over a decade ago. Compared with other ensemble techniques, it has proved its superiority. Many researchers, however, believe that there is still room for enhancing and improving its performance accuracy. This explains why, over the past decade, there have been many extensions of RF where each extension employed a variety of techniques and strategies to improve certain aspect(s) of RF. Since it has been proven empirically that ensembles tend to yield better results when there is a significant diversity among the constituent models, the objective of this paper is twofold.
Decision Tree in Machine Learning – Towards Data Science
A decision tree is a flowchart-like structure in which each internal node represents a test on a feature (e.g. The paths from root to leaf represent classification rules. Below diagram illustrate the basic flow of decision tree for decision making with labels (Rain(Yes), No Rain(No)). Decision tree is one of the predictive modelling approaches used in statistics, data mining and machine learning. Decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on different conditions.
A case study : Influence of Dimension Reduction on regression trees-based Algorithms -Predicting Aeronautics Loads of a Derivative Aircraft
Fournier, Edouard, Grihon, Stéphane, Klein, Thierry
In aircraft industry, market needs evolve quickly in a high competitiveness context. Thisrequires adapting a given aircraft model in minimum time considering for example an increase of range or of the number of passengers such as the A330 family in [1]. In our case study, variants concern the maximum takeoff weight of a given aircraft model. Depending on the configuration, the computation of loads and stress, as defined in [13, 12], to resize the airframe is on the critical path of this aircraft variant definition: this is a time consuming (approximately a year for a new aircraft variant) and costly process, one of the reason being the high dimensionality and the large amount of data. Big Data approaches such as defined by [11] is mandatory to improve the speed, the data value extraction and the responsiveness of the overall process. This study has been realized during aproof of value sprint project within Airbus to demonstrate the usefulness of statistics and machine learning approaches in the Engineering field. In a previous internal project, it has been shown that the family of regression trees [5] works well to predict loads for different aircraft missions in an interpolation context. Thus, we can formulate our problem in this way: is it possible to use dimensional reduction and regression trees-based algorithms to predict loads in an extrapolation context (i.e outside the design space of a certain weight variant) toimprove the actual process?
Response to Comment on "Predicting reaction performance in C-N cross-coupling using machine learning"
We demonstrate that the chemical-feature model described in our original paper is distinguishable from the nongeneralizable models introduced by Chuang and Keiser. Furthermore, the chemical-feature model significantly outperforms these models in out-of-sample predictions, justifying the use of chemical featurization from which machine learning models can extract meaningful patterns in the dataset, as originally described. In Ahneman et al. (1), we showed that a random forest (RF) algorithm built using computationally derived chemical descriptors for the components of a Pd-catalyzed C–N cross-coupling reaction (aryl halide, ligand, base, and potentially inhibitory isoxazole additive) could identify predictive and meaningful relationships in a multidimensional chemical dataset comprising 4608 reactions. Chuang and Keiser (2) built alternative models using random barcode features ("straw" models), wherein the chemical descriptors are replaced with random numbers selected from a standard normal distribution. One-hot encoded features, wherein each reagent acts as a categorical descriptor and is marked as absent or present, were also evaluated.