Decision Tree Learning
Deep Mining Generation of Lung Cancer Malignancy Models from Chest X-ray Images
Lung cancer is the leading cause of cancer death and morbidity worldwide. Many studies have shown machine learning models to be effective in detecting lung nodules from chest X-ray images. However, these techniques have yet to be embraced by the medical community due to several practical, ethical, and regulatory constraints stemming from the “black-box” nature of deep learning models. Additionally, most lung nodules visible on chest X-rays are benign; therefore, the narrow task of computer vision-based lung nodule detection cannot be equated to automated lung cancer detection. Addressing both concerns, this study introduces a novel hybrid deep learning and decision tree-based computer vision model, which presents lung cancer malignancy predictions as interpretable decision trees. The deep learning component of this process is trained using a large publicly available dataset on pathological biomarkers associated with lung cancer. These models are then used to inference biomarker scores for chest X-ray images from two independent data sets, for which malignancy metadata is available. Next, multi-variate predictive models were mined by fitting shallow decision trees to the malignancy stratified datasets and interrogating a range of metrics to determine the best model. The best decision tree model achieved sensitivity and specificity of 86.7% and 80.0%, respectively, with a positive predictive value of 92.9%. Decision trees mined using this method may be considered as a starting point for refinement into clinically useful multi-variate lung cancer malignancy models for implementation as a workflow augmentation tool to improve the efficiency of human radiologists.
Learning Data Science: Predictive Maintenance with Decision Trees
Predictive Maintenance is one of the big revolutions happening across all major industries right now. Instead of changing parts regularly or even only after they failed it uses Machine Learning methods to predict when a part is going to fail. If you want to get an introduction to this fascinating developing area, read on! Predictive maintenance techniques are designed to help determine the condition of in-service equipment in order to estimate when maintenance should be performed. This approach promises cost savings over routine or time-based preventive maintenance, because tasks are performed only when warranted.
Utility Assessment of Synthetic Data Generation Methods
Khan, Md Sakib Nizam, Reje, Niklas, Buchegger, Sonja
Big data analysis poses the dual problem of privacy preservation and utility, i.e., how accurate data analyses remain after transforming original data in order to protect the privacy of the individuals that the data is about - and whether they are accurate enough to be meaningful. In this paper, we thus investigate across several datasets whether different methods of generating fully synthetic data vary in their utility a priori (when the specific analyses to be performed on the data are not known yet), how closely their results conform to analyses on original data a posteriori, and whether these two effects are correlated. We find some methods (decision-tree based) to perform better than others across the board, sizeable effects of some choices of imputation parameters (notably the number of released datasets), no correlation between broad utility metrics and analysis accuracy, and varying correlations for narrow metrics. We did get promising findings for classification tasks when using synthetic data for training machine learning models, which we consider worth exploring further also in terms of mitigating privacy attacks against ML models such as membership inference and model inversion.
SketchBoost: Fast Gradient Boosted Decision Tree for Multioutput Problems
Iosipoi, Leonid, Vakhrushev, Anton
Gradient Boosted Decision Tree (GBDT) is a widely-used machine learning algorithm that has been shown to achieve state-of-the-art results on many standard data science problems. We are interested in its application to multioutput problems when the output is highly multidimensional. Although there are highly effective GBDT implementations, their scalability to such problems is still unsatisfactory. In this paper, we propose novel methods aiming to accelerate the training process of GBDT in the multioutput scenario. The idea behind these methods lies in the approximate computation of a scoring function used to find the best split of decision trees. These methods are implemented in SketchBoost, which itself is integrated into our easily customizable Python-based GPU implementation of GBDT called Py-Boost. Our numerical study demonstrates that SketchBoost speeds up the training process of GBDT by up to over 40 times while achieving comparable or even better performance.
High-Order Optimization of Gradient Boosted Decision Trees
Pachebat, Jean, Ivanov, Sergei
Gradient Boosted Decision Trees (GBDTs) are dominant machine learning algorithms for modeling discrete or tabular data. Unlike neural networks with millions of trainable parameters, GBDTs optimize loss function in an additive manner and have a single trainable parameter per leaf, which makes it easy to apply high-order optimization of the loss function. In this paper, we introduce high-order optimization for GBDTs based on numerical optimization theory which allows us to construct trees based on high-order derivatives of a given loss function. In the experiments, we show that high-order optimization has faster per-iteration convergence that leads to reduced running time. Our solution can be easily parallelized and run on GPUs with little overhead on the code. Finally, we discuss future potential improvements such as automatic differentiation of arbitrary loss function and combination of GBDTs with neural networks.
Explaining Random Forests using Bipolar Argumentation and Markov Networks (Technical Report)
Potyka, Nico, Yin, Xiang, Toni, Francesca
Random forests (RFs) [Bre01] are machine learning models with various applications in areas like E-commerce, Finance and Medicine. They consist of multiple decision trees that use different subsets of the available features. Given an input, every tree makes an individual decision and the output of the random forest is obtained by a majority vote. They have low risk of overfitting; support both classification and regression tasks and come equipped with some feature importance measures [Bre01]. However, feature importance measures can be too simplistic as they can represent neither joint effects of features (e.g., multi-drug interactions) nor non-monotonicity (e.g., increasing the weight may be healthy for an underweight person, but not for an overweight person). In recent years, a variety of other explanation methods has been proposed. Modelagnostic feature importance measures like LIME [RSG16], SHAP [LL17] and MAPLE [PMT18] have similar limitations like the feature importance measures defined for random forests.
Diabetes Prediction using Machine Learning, Java, and GridDB
This article will cover the health care concern of diabetes that is driving the lifestyle of many people worldwide. This article will cover the usage of machine learning models to create a predictive system. This model will use random-forest to predict if patients have diabetes or not. The article will outline the requirements needed to set up our database GridDB. Following that, we will briefly describe our dataset and model.
A Gentle Introduction to Random Forests, Ensembles, and Performance Metrics in a Commercial System
This is the first in a series of posts that illustrate what our data team is up to, experimenting with, and building'under the hood' at CitizenNet. He has been involved in web-scale machine learning and information retrieval for over 10 years. One of the first posts we published spoke at a high level of the technical problem CitizenNet is trying to solve. In essence, we are trying to predict what combinations of demographic and interest targets will be interested in some piece of content. On the CitizenNet platform, a user would create a project that would define (broadly) the target audience, the pieces of Facebook content they are looking to promote, and other campaign and financial information. Behind the scenes, a robust prediction system builds the targets for the project.
Concept-based Explanations using Non-negative Concept Activation Vectors and Decision Tree for CNN Models
This paper evaluates whether training a decision tree based on concepts extracted from a concept-based explainer can increase interpretability for Convolutional Neural Networks (CNNs) models and boost the fidelity and performance of the used explainer. CNNs for computer vision have shown exceptional performance in critical industries. However, it is a significant barrier when deploying CNNs due to their complexity and lack of interpretability. Recent studies to explain computer vision models have shifted from extracting low-level features (pixel-based explanations) to mid-or high-level features (concept-based explanations). The current research direction tends to use extracted features in developing approximation algorithms such as linear or decision tree models to interpret an original model. In this work, we modify one of the state-of-the-art concept-based explanations and propose an alternative framework named TreeICE. We design a systematic evaluation based on the requirements of fidelity (approximate models to original model's labels), performance (approximate models to ground-truth labels), and interpretability (meaningful of approximate models to humans). We conduct computational evaluation (for fidelity and performance) and human subject experiments (for interpretability) We find that Tree-ICE outperforms the baseline in interpretability and generates more human readable explanations in the form of a semantic tree structure. This work features how important to have more understandable explanations when interpretability is crucial.
Decision Trees (the upside-down trees)
Taking SpongeBob SquarePants' mood as an example, based on historical data, there are two factors that affect whether he is happy or upset. So, if the whole mood table of Mr. SquarePants is visualized into a scatter plot, it will look as the following: Decision trees are used in classification problems to find a line(s) that separates the data as perfectly as possible. The separation process is done by measuring the homogeneity or similarity of the data. Attempting to separate data, a vertical line is drawn to separate happy SpongeBob from upset SpongeBob, taking into consideration one feature only -which is the number of jellyfish that were hunted. A vertical line at the number 10 on the x-axis can work as a separator, so if the number of jellyfish hunted is less than 10 SpongeBob is upset otherwise, he is happy.