Decision Tree Learning
Self-Updating Models with Error Remediation
Doak, Justin E., Smith, Michael R., Ingram, Joey B.
Many environments currently employ machine learning models for data processing and analytics that were built using a limited number of training data points. Once deployed, the models are exposed to significant amounts of previously-unseen data, not all of which is representative of the original, limited training data. However, updating these deployed models can be difficult due to logistical, bandwidth, time, hardware, and/or data sensitivity constraints. We propose a framework, Self-Updating Models with Error Remediation (SUMER), in which a deployed model updates itself as new data becomes available. SUMER uses techniques from semi-supervised learning and noise remediation to iteratively retrain a deployed model using intelligently-chosen predictions from the model as the labels for new training iterations. A key component of SUMER is the notion of error remediation as self-labeled data can be susceptible to the propagation of errors. We investigate the use of SUMER across various data sets and iterations. We find that self-updating models (SUMs) generally perform better than models that do not attempt to self-update when presented with additional previously-unseen data. This performance gap is accentuated in cases where there is only limited amounts of initial training data. We also find that the performance of SUMER is generally better than the performance of SUMs, demonstrating a benefit in applying error remediation. Consequently, SUMER can autonomously enhance the operational capabilities of existing data processing systems by intelligently updating models in dynamic environments.
Interpretable random forest models through forward variable selection
Velthoen, Jasper, Cai, Juan-Juan, Jongbloed, Geurt
Random forest is a popular prediction approach for handling high dimensional covariates. However, it often becomes infeasible to interpret the obtained high dimensional and non-parametric model. Aiming for obtaining an interpretable predictive model, we develop a forward variable selection method using the continuous ranked probability score (CRPS) as the loss function. Our stepwise procedure leads to a smallest set of variables that optimizes the CRPS risk by performing at each step a hypothesis test on a significant decrease in CRPS risk. We provide mathematical motivation for our method by proving that in population sense the method attains the optimal set. Additionally, we show that the test is consistent provided that the random forest estimator of a quantile function is consistent. In a simulation study, we compare the performance of our method with an existing variable selection method, for different sample sizes and different correlation strength of covariates. Our method is observed to have a much lower false positive rate. We also demonstrate an application of our method to statistical post-processing of daily maximum temperature forecasts in the Netherlands. Our method selects about 10% covariates while retaining the same predictive power.
Adversarial Robustness Toolbox v1.2 releases: crafting and analysis of attacks and defense methods for machine learning models โข Penetration Testing
Adversarial Robustness 360 Toolbox (ART) is a Python library supporting developers and researchers in defending Machine Learning models (Deep Neural Networks, Gradient Boosted Decision Trees, Support Vector Machines, Random Forests, Logistic Regression, Gaussian Processes, Decision Trees, Scikit-learn Pipelines, etc.) against adversarial threats and helps making AI systems more secure and trustworthy. Machine Learning models are vulnerable to adversarial examples, which are inputs (images, texts, tabular data, etc.) deliberately modified to produce a desired response by the Machine Learning model. ART provides the tools to build and deploy defenses and test them with adversarial attacks. Defending Machine Learning models involves certifying and verifying model robustness and model hardening with approaches such as pre-processing inputs, augmenting training data with adversarial samples, and leveraging runtime detection methods to flag any inputs that might have been modified by an adversary. The attacks implemented in ART allow creating adversarial attacks against Machine Learning models which are required to test defenses with state-of-the-art threat models.
Why White-Box Models in Enterprise Data Science Work More Efficiently
Data science is the current powerhouse for organizations, turning mountains of data into actionable business insights that impact every part of the business, including customer experience, revenue, operations, risk management and other functions. Data science has the potential to dramatically accelerate digital transformation initiatives, delivering greater performance and advantages over the competition. However, not all data science platforms and methodologies are created equal. The ability to use data science to make predictions and take decisions that optimize business outcome requires transparency and accountability. There are several underlying factors such as trust, having confidence in the prediction and understanding how the technology works, but fundamentally it comes down to whether the platform uses a black-box or white-box model approach.
JigSaw: A tool for discovering explanatory high-order interactions from random forests
Machine learning is revolutionizing biology by facilitating the prediction of outcomes from complex patterns found in massive data sets. Large biological data sets, like those generated by transcriptome or microbiome studies,measure many relevant components that interact in vivo with one another in modular ways.Identifying the high-order interactions that machine learning models use to make predictions would facilitate the development of hypotheses linking combinations of measured components to outcome. By using the structure of random forests, a new algorithmic approach, termed JigSaw,was developed to aid in the discovery of patterns that could explain predictions made by the forest. By examining the patterns of individual decision trees JigSaw identifies high-order interactions between measured features that are strongly associated with a particular outcome and identifies the relevant decision thresholds. JigSaw's effectiveness was tested in simulation studies where it was able to recover multiple ground truth patterns;even in the presence of significant noise. It was then used to find patterns associated with outcomes in two real world data sets.It was first used to identify patterns clinical measurements associated with heart disease. It was then used to find patterns associated with breast cancer using metabolites measured in the blood. In heart disease, JigSaw identified several three-way interactions that combine to explain most of the heart disease records (66%) with high precision (93%). In breast cancer, three two-way interactions were recovered that can be combined to explain almost all records (92%) with good precision (79%). JigSaw is an efficient method for exploring high-dimensional feature spaces for rules that explain statistical associations with a given outcome and can inspire the generation of testable hypotheses.
Predictive Modeling of ICU Healthcare-Associated Infections from Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling Approach
Sรกnchez-Hernรกndez, Fernando, Ballesteros-Herrรกez, Juan Carlos, Kraiem, Mohamed S., Sรกnchez-Barba, Mercedes, Moreno-Garcรญa, Marรญa N.
Early detection of patients vulnerable to infections acquired in the hospital environment is a challenge in current health systems given the impact that such infections have on patient mortality and healthcare costs. This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units by means of machine-learning methods. The aim is to support decision making addressed at reducing the incidence rate of infections. In this field, it is necessary to deal with the problem of building reliable classifiers from imbalanced datasets. We propose a clustering-based undersampling strategy to be used in combination with ensemble classifiers. A comparative study with data from 4616 patients was conducted in order to validate our proposal. We applied several single and ensemble classifiers both to the original dataset and to data preprocessed by means of different resampling methods. The results were analyzed by means of classic and recent metrics specifically designed for imbalanced data classification. They revealed that the proposal is more efficient in comparison with other approaches.
Tree-based Machine Learning Models for Handling Imbalanced Datasets
Recently, I have been working on a binary classification problem with an imbalanced dataset, where the ratio of positive class to negative class is around 1:4. Imbalanced classification problems are so commonplace that data enthusiasts would encounter them sooner or later. In this post, I will be sharing three tree-based Machine Learning Models that can help handle imbalanced datasets. The dataset that I am going to use to illustrate the effectiveness of algorithms is the credit card fraud dataset from Kaggle. This is an extremely imbalanced dataset: out of 284,807 transactions, there are only 492 frauds. Following the convention, we label the fraud class samples as positive class and normal transactions, negative class.
LIMEtree: Interactively Customisable Explanations Based on Local Surrogate Multi-output Regression Trees
Systems based on artificial intelligence and machine learning models should be transparent, in the sense of being capable of explaining their decisions to gain humans' approval and trust. While there are a number of explainability techniques that can be used to this end, many of them are only capable of outputting a single one-size-fits-all explanation that simply cannot address all of the explainees' diverse needs. In this work we introduce a model-agnostic and post-hoc local explainability technique for black-box predictions called LIMEtree, which employs surrogate multi-output regression trees. We validate our algorithm on a deep neural network trained for object detection in images and compare it against Local Interpretable Model-agnostic Explanations (LIME). Our method comes with local fidelity guarantees and can produce a range of diverse explanation types, including contrastive and counterfactual explanations praised in the literature. Some of these explanations can be interactively personalised to create bespoke, meaningful and actionable insights into the model's behaviour. While other methods may give an illusion of customisability by wrapping, otherwise static, explanations in an interactive interface, our explanations are truly interactive, in the sense of allowing the user to "interrogate" a black-box model. LIMEtree can therefore produce consistent explanations on which an interactive exploratory process can be built.
Machine Learning: An Introduction to Decision Trees
Machine Learning for trading is the new buzz word today and some of the tech companies are doing wonderful unimaginable things with it. Today, we're going to show you, how you can predict stock movements (that's either up or down) with the help of'Decision Trees', one of the most commonly used ML algorithms. Decision trees in Machine Learning are used for building classification and regression models to be used in data mining and trading. A decision tree algorithm performs a set of recursive actions before it arrives at the end result and when you plot these actions on a screen, the visual looks like a big tree, hence the name'Decision Tree'. Basically, a decision tree is a flowchart to help you make decisions.
Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach
Chivers, Benedict Delahaye, Wallbank, John, Cole, Steven J., Sebek, Ondrej, Stanley, Simon, Fry, Matthew, Leontidis, Georgios
Precipitation data collected at sub-hourly resolution represents specific challenges for missing data recovery by being largely stochastic in nature and highly unbalanced in the duration of rain vs nonrain. Here we present a two-step analysis utilising current machine learning techniques for imputing precipitation data sampled at 30-minute intervals by devolving the task into (a) the classification of rain or non-rain samples, and (b) regressing the absolute values of predicted rain samples. Investigating 37 weather stations in the UK, this machine learning process produces more accurate predictions for recovering precipitation data than an established surface fitting technique utilising neighbouring rain gauges. Increasing available features for the training of machine learning algorithms increases performance with the integration of weather data at the target site with externally sourced rain gauges providing the highest performance. This method informs machine learning models by utilising information in concurrently collected environmental data to make accurate predictions of missing rain data. Capturing complex nonlinear relationships from weakly correlated variables is critical for data recovery at sub-hourly resolutions. Such pipelines for data recovery can be developed and deployed for highly automated and near instantaneous imputation of missing values in ongoing datasets at high temporal resolutions. Keywords: machine learning, data imputation, gradient boosted trees, environmental sensor networks, precipitation, soil moisture 1. Introduction Precipitation data is of critical importance across multiple lines of enquiry, informing statistical models and analysis relating to weather forecasting, extreme weather events, climate change, water-resource management, droughts, flooding, agricultural impact, and hydroelectric power. Historical rainfall data can reveal long term trends in environmental hydrological issues with real-time data input allowing for immediate forecasting of future conditions. Distributed networks of rain gauges are typically used to provide precipitation data at the earth's surface at varying temporal resolutions and can cover large geographical areas (Kidd, 2001). As is the case in many databases, particularly those utilising physical sensors, the problem of missing data arises. Missing data can be a result of sensor failure, data storage/transmission failure, or post-collection quality control procedures resulting in removal of identified problem data (Blenkinsop et al., 2017). Missing data in precipitation databases represents a serious limitation for the effective use of the data. Given the global scale and importance of precipitation and meteorological data (Sun et al., 2018), developing solutions to missing data is of paramount importance for maximising information gain.