Decision Tree Learning
A review on longitudinal data analysis with random forest in precision medicine
Hu, Jianchang, Szymczak, Silke
Precision medicine provides customized treatments to patients based on their characteristics and is a promising approach to improving treatment efficiency. Large scale omics data are useful for patient characterization, but often their measurements change over time, leading to longitudinal data. Random forest is one of the state-of-the-art machine learning methods for building prediction models, and can play a crucial role in precision medicine. In this paper, we review extensions of the standard random forest method for the purpose of longitudinal data analysis. Extension methods are categorized according to the data structures for which they are designed. We consider both univariate and multivariate responses and further categorize the repeated measurements according to whether the time effect is relevant. Information of available software implementations of the reviewed extensions is also given. We conclude with discussions on the limitations of our review and some future research directions.
Machine Learning and Bioinformatics for Diagnosis Analysis of Obesity Spectrum Disorders
Globally, the number of obese patients has doubled due to sedentary lifestyles and improper dieting. The tremendous increase altered human genetics, and health. According to the world health organization, Life expectancy dropped from 80 to 75 years, as obese people struggle with different chronic diseases. This report will address the problems of obesity in children and adults using ML datasets to feature, predict, and analyze the causes of obesity. By engaging neural ML networks, we will explore neural control using diffusion tensor imaging to consider body fats, BMI, waist \& hip ratio circumference of obese patients. To predict the present and future causes of obesity with ML, we will discuss ML techniques like decision trees, SVM, RF, GBM, LASSO, BN, and ANN and use datasets implement the stated algorithms. Different theoretical literature from experts ML \& Bioinformatics experiments will be outlined in this report while making recommendations on how to advance ML for predicting obesity and other chronic diseases.
The New Machine Learning Specialization : in-depth review
The lectures starts with defining the decision trees, the splitting criteria,and different uses of the tree like applying the algorithm to categorial features, splitting on continuous features,or using the trees for regression problems, then it explains combining multiple trees and using Ensemble Learning to apply Random Forest, in the last lecture we take a glimpse of XGBoost and how to use them, without any more details. This is probably the most hyped part of the whole specialization, I found many people celebrating that this introductory course will discuss such topics.
Automated fault tree learning from continuous-valued sensor data: a case study on domestic heaters
Verkuil, Bart, Budde, Carlos E., Bucur, Doina
Many industrial sectors have been collecting big sensor data. With recent technologies for processing big data, companies can exploit this for automatic failure detection and prevention. We propose the first completely automated method for failure analysis, machine-learning fault trees from raw observational data with continuous variables. Our method scales well and is tested on a real-world, five-year dataset of domestic heater operations in The Netherlands, with 31 million unique heater-day readings, each containing 27 sensor and 11 failure variables. Our method builds on two previous procedures: the C4.5 decision-tree learning algorithm, and the LIFT fault tree learning algorithm from Boolean data. C4.5 pre-processes each continuous variable: it learns an optimal numerical threshold which distinguishes between faulty and normal operation of the top-level system. These thresholds discretise the variables, thus allowing LIFT to learn fault trees which model the root failure mechanisms of the system and are explainable. We obtain fault trees for the 11 failure variables, and evaluate them in two ways: quantitatively, with a significance score, and qualitatively, with domain specialists. Some of the fault trees learnt have almost maximum significance (above 0.95), while others have medium-to-low significance (around 0.30), reflecting the difficulty of learning from big, noisy, real-world sensor data. The domain specialists confirm that the fault trees model meaningful relationships among the variables.
Machine Learning Pipelines
In this use case, we will be using the Titanic dataset. In this dataset, we will apply some common Transformers on certain columns and then we will use a Decision Tree Estimator to classify whether the passenger will live or die. Here is the plan outline for our use case. To make our use case easy to understand, let us see the diagram below. This will give you a fairly good understanding of the pipeline visually.
Accelerated and interpretable oblique random survival forests
Jaeger, Byron C., Welden, Sawyer, Lenoir, Kristin, Speiser, Jaime L., Segar, Matthew W., Pandey, Ambarish, Pajewski, Nicholas M.
The oblique random survival forest (RSF) is an ensemble supervised learning method for right-censored outcomes. Trees in the oblique RSF are grown using linear combinations of predictors to create branches, whereas in the standard RSF, a single predictor is used. Oblique RSF ensembles often have higher prediction accuracy than standard RSF ensembles. However, assessing all possible linear combinations of predictors induces significant computational overhead that limits applications to large-scale data sets. In addition, few methods have been developed for interpretation of oblique RSF ensembles, and they remain more difficult to interpret compared to their axis-based counterparts. We introduce a method to increase computational efficiency of the oblique RSF and a method to estimate importance of individual predictor variables with the oblique RSF. Our strategy to reduce computational overhead makes use of Newton-Raphson scoring, a classical optimization technique that we apply to the Cox partial likelihood function within each non-leaf node of decision trees. We estimate the importance of individual predictors for the oblique RSF by negating each coefficient used for the given predictor in linear combinations, and then computing the reduction in out-of-bag accuracy. In general benchmarking experiments, we find that our implementation of the oblique RSF is approximately 450 times faster with equivalent discrimination and superior Brier score compared to existing software for oblique RSFs. We find in simulation studies that 'negation importance' discriminates between relevant and irrelevant predictors more reliably than permutation importance, Shapley additive explanations, and a previously introduced technique to measure variable importance with oblique RSFs based on analysis of variance. Methods introduced in the current study are available in the aorsf R package.
Estimating a Book's Publication Date with Artificial Intelligence
You're probably aware of AI's increasing ability to analyze and synthesize human language, such as the recent controversy over whether a Google chatbot is, in fact, sentient (Google claims -- and I'm inclined to believe -- that the chatbot is just very, very good at recognizing and replicating speech patterns). Since AI is so skilled at analyzing language, I wondered whether it could detect changes in language over time. Could it differentiate between texts written in, say, the 12th century and the 18th century? As it turns out, it can! To build this model, I used natural language processing, the branch of machine learning dedicated to (you guessed it!)
ANOVA-based Automatic Attribute Selection and a Predictive Model for Heart Disease Prognosis
Chowdhury, Mohammed Nowshad Ruhani, Zhang, Wandong, Akilan, Thangarajah
Studies show that Studies that cardiovascular diseases (CVDs) are malignant for human health. Thus, it is important to have an efficient way of CVD prognosis. In response to this, the healthcare industry has adopted machine learning-based smart solutions to alleviate the manual process of CVD prognosis. Thus, this work proposes an information fusion technique that combines key attributes of a person through analysis of variance (ANOVA) and domain experts' knowledge. It also introduces a new collection of CVD data samples for emerging research. There are thirty-eight experiments conducted exhaustively to verify the performance of the proposed framework on four publicly available benchmark datasets and the newly created dataset in this work. The ablation study shows that the proposed approach can achieve a competitive mean average accuracy (mAA) of 99.2% and a mean average AUC of 97.9%.
SHAP for additively modeled features in a boosted trees model
An important technique to explore a black-box machine learning (ML) model is called SHAP (SHapley Additive exPlanation). SHAP values decompose predictions into contributions of the features in a fair way. We will show that for a boosted trees model with some or all features being additively modeled, the SHAP dependence plot of such a feature corresponds to its partial dependence plot up to a vertical shift. We illustrate the result with XGBoost.
Classification of FIB/SEM-tomography images for highly porous multiphase materials using random forest classifiers
Osenberg, Markus, Hilger, André, Neumann, Matthias, Wagner, Amalia, Bohn, Nicole, Binder, Joachim R., Schmidt, Volker, Banhart, John, Manke, Ingo
FIB/SEM tomography represents an indispensable tool for the characterization of three-dimensional nanostructures in battery research and many other fields. However, contrast and 3D classification/reconstruction problems occur in many cases, which strongly limits the applicability of the technique especially on porous materials, like those used for electrode materials in batteries or fuel cells. Distinguishing the different components like active Li storage particles and carbon/binder materials is difficult and often prevents a reliable quantitative analysis of image data, or may even lead to wrong conclusions about structure-property relationships. In this contribution, we present a novel approach for data classification in three-dimensional image data obtained by FIB/SEM tomography and its applications to NMC battery electrode materials. We use two different image signals, namely the signal of the angled SE2 chamber detector and the Inlens detector signal, combine both signals and train a random forest, i.e. a particular machine learning algorithm. We demonstrate that this approach can overcome current limitations of existing techniques suitable for multi-phase measurements and that it allows for quantitative data reconstruction even where current state-of the art techniques fail, or demand for large training sets. This approach may yield as guideline for future research using FIB/SEM tomography.