Goto

Collaborating Authors

 Decision Tree Learning


Orange: A Handy Open-Source Tool for Creating Machine Learning Models - DZone AI

#artificialintelligence

In this tutorial, I will demonstrate Orange, a tool for machine learning. Orange is an extremely easy-to-use, lightweight, drag-and-drop tool. More importantly, it is open source! If you are an Anaconda user, then you can find it in the console as shown in the following image -- a pure, fresh orange wearing sunglasses with a smile. Orange is a platform built for creating machine learning pipelines on a GUI workflow.


Early Hospital Mortality Prediction using Vital Signals

arXiv.org Machine Learning

Early hospital mortality prediction is critical as intensivists strive to make efficient medical decisions about the severely ill patients staying in intensive care units. As a result, various methods have been developed to address this problem based on clinical records. However, some of the laboratory test results are time-consuming and need to be processed. In this paper, we propose a novel method to predict mortality using features extracted from the heart signals of patients within the first hour of ICU admission. In order to predict the risk, quantitative features have been computed based on the heart rate signals of ICU patients. Each signal is described in terms of 12 statistical and signal-based features. The extracted features are fed into eight classifiers: decision tree, linear discriminant, logistic regression, support vector machine (SVM), random forest, boosted trees, Gaussian SVM, and K-nearest neighborhood (K-NN). To derive insight into the performance of the proposed method, several experiments have been conducted using the well-known clinical dataset named Medical Information Mart for Intensive Care III (MIMIC-III). The experimental results demonstrate the capability of the proposed method in terms of precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The decision tree classifier satisfies both accuracy and interpretability better than the other classifiers, producing an F1-score and AUC equal to 0.91 and 0.93, respectively. It indicates that heart rate signals can be used for predicting mortality in patients in the ICU, achieving a comparable performance with existing predictions that rely on high dimensional features from clinical records which need to be processed and may contain missing information.


Evaluating Conditional Cash Transfer Policies with Machine Learning Methods

arXiv.org Machine Learning

This paper presents an out-of-sample prediction comparison between major machine learning models and the structural econometric model. Over the past decade, machine learning has established itself as a powerful tool in many prediction applications, but this approach is still not widely adopted in empirical economic studies. To evaluate the benefits of this approach, I use the most common machine learning algorithms, CART, C4.5, LASSO, random forest, and adaboost, to construct prediction models for a cash transfer experiment conducted by the Progresa program in Mexico, and I compare the prediction results with those of a previous structural econometric study. Two prediction tasks are performed in this paper: the out-of-sample forecast and the long-term within-sample simulation. For the out-of-sample forecast, both the mean absolute error and the root mean square error of the school attendance rates found by all machine learning models are smaller than those found by the structural model. Random forest and adaboost have the highest accuracy for the individual outcomes of all subgroups. For the long-term within-sample simulation, the structural model has better performance than do all of the machine learning models. The poor within-sample fitness of the machine learning model results from the inaccuracy of the income and pregnancy prediction models. The result shows that the machine learning model performs better than does the structural model when there are many data to learn; however, when the data are limited, the structural model offers a more sensible prediction. The findings of this paper show promise for adopting machine learning in economic policy analyses in the era of big data.


Minimax optimal rates for Mondrian trees and forests

arXiv.org Machine Learning

Originally introduced by [7], Random Forests (RF) are state-of-the-art classification and regression algorithms that proceed by averaging the forecasts of a number of randomized decision trees grown in parallel. Despite their widespread use and remarkable success in practical applications, the theoretical properties of such algorithms are still not fully understood. For an overview of theoretical results on random forests, see [5]. As a result of the complexity of the procedure, which combines sampling steps and feature selection, Breiman's original algorithm has proved difficult to analyze. Consequently, most theoretical studies focus on modified and stylized versions of Random Forests. Among these methods, Purely Random Forests (PRF) [6, 4, 3, 13, 2] that grow the individual trees independently of the sample, are particularly amenable to theoretical analysis. The consistency of such estimates (as well as other idealized RF procedures) was first obtained by [4], as a byproduct of the consistency of individual tree estimates. A recent line of research [25, 28, 18, 27] has sought to obtain some theoretical guarantees for RF variants that more closely resembled the algorithm used in practice. It should be noted, however, that most of these theoretical guarantees come at the price of assumptions either on the data structure or on the Random Forest algorithm itself, being thus still far from explaining the excellent empirical performance of Random Forests.


Small Moving Window Calibration Models for Soft Sensing Processes with Limited History

arXiv.org Machine Learning

Five simple soft sensor methodologies with two update conditions were compared on two experimentally-obtained datasets and one simulated dataset. The soft sensors investigated were moving window partial least squares regression (and a recursive variant), moving window random forest regression, the mean moving window of y, and a novel random forest partial least squares regression ensemble (RF-PLS), all of which can be used with small sample sizes so that they can be rapidly placed online. It was found that, on two of the datasets studied, small window sizes led to the lowest prediction errors for all of the moving window methods studied. On the majority of datasets studied, the RF-PLS calibration method offered the lowest onestep-ahead prediction errors compared to those of the other methods, and it demonstrated greater predictive stability at larger time delays than moving window PLS alone. It was found that both the random forest and RF-PLS methods most adequately modeled the datasets that did not feature purely monotonic increases in property values, but that both methods performed more poorly than moving window PLS models on one dataset with purely monotonic property values. Other data dependent findings are presented and discussed. Preprint submitted to Arxiv March 14, 2018 1. Introduction Soft sensors for regression tasks have found wide utility in process engineering and process analytical chemistry [1, 2, 3]. A soft sensor is effectively a calibration used on time-series data. Here, we consider a soft sensor to be any algorithm that can be used to estimate a property value from several readily available but indirect measurements. The goal of implementing a soft sensor is typically to avoid the use of a physical sensor for variables that may require extensive time or work up to measure [3]. In the context of industrial chemical processes, these algorithms should meet several specifications.


Finding Influential Training Samples for Gradient Boosted Decision Trees

arXiv.org Machine Learning

We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model's predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.


Interpretability via Model Extraction

arXiv.org Machine Learning

The ability to interpret machine learning models has become increasingly important now that machine learning is used to inform consequential decisions. We propose an approach called model extraction for interpreting complex, blackbox models. Our approach approximates the complex model using a much more interpretable model; as long as the approximation quality is good, then statistical properties of the complex model are reflected in the interpretable model. We show how model extraction can be used to understand and debug random forests and neural nets trained on several datasets from the UCI Machine Learning Repository, as well as control policies learned for several classical reinforcement learning problems.


Machine learning as a service ? Might lose sleep over this !

@machinelearnbot

This post is'not' intended to teach people how to use popular predictive modelling APIs for free. Although, to your surprise, this isn't a far fetched possibility. Trained Machine learning models are basically a function that maps feature vectors to the output variable. Upon querying with a test instance, the model predicts an outcome, assigning probability scores to all the possible classes. Google, Amazon etc provides public facing APIs to train predictive models on the subscriber's data, the model can further be used for prediction purposes .


A Bayesian and Machine Learning approach to estimating Influence Model parameters for IM-RO

arXiv.org Machine Learning

The rise of Online Social Networks (OSNs) has caused an insurmountable amount of interest from advertisers and researchers seeking to monopolize on its features. Researchers aim to develop strategies for determining how information is propagated among users within an OSN that is captured by diffusion or influence models. We consider the influence models for the IM-RO problem, a novel formulation to the Influence Maximization (IM) problem based on implementing Stochastic Dynamic Programming (SDP). In contrast to existing approaches involving influence spread and the theory of submodular functions, the SDP method focuses on optimizing clicks and ultimately revenue to advertisers in OSNs. Existing approaches to influence maximization have been actively researched over the past decade, with applications to multiple fields, however, our approach is a more practical variant to the original IM problem. In this paper, we provide an analysis on the influence models of the IM-RO problem by conducting experiments on synthetic and real-world datasets. We propose a Bayesian and Machine Learning approach for estimating the parameters of the influence models for the (Influence Maximization- Revenue Optimization) IM-RO problem. We present a Bayesian hierarchical model and implement the well-known Naive Bayes classifier (NBC), Decision Trees classifier (DTC) and Random Forest classifier (RFC) on three real-world datasets. Compared to previous approaches to estimating influence model parameters, our strategy has the great advantage of being directly implementable in standard software packages such as WinBUGS/OpenBUGS/JAGS and Apache Spark. We demonstrate the efficiency and usability of our methods in terms of spreading information and generating revenue for advertisers in the context of OSNs.


Teaching computers to guide science: Machine learning method sees forests and trees: 'Iterative Random Forests' will deliver powerful scientific insights, researchers say

#artificialintelligence

In a paper published recently in the Proceedings of the National Academy of Sciences (PNAS), the researchers describe a technique called "iterative Random Forests," which they say could have a transformative effect on any area of science or engineering with complex systems, including biology, precision medicine, materials science, environmental science, and manufacturing, to name a few. "Take a human cell, for example. There are 10170 possible molecular interactions in a single cell. That creates considerable computing challenges in searching for relationships," said Ben Brown, head of Berkeley Lab's Molecular Ecosystems Biology Department. "Our method enables the identification of interactions of high order at the same computational cost as main effects -- even when those interactions are local with weak marginal effects."