Goto

Collaborating Authors

 Decision Tree Learning



Random Forests of Interaction Trees for Estimating Individualized Treatment Effects in Randomized Trials

arXiv.org Machine Learning

Assessing heterogeneous treatment effects has become a growing interest in advancing precision medicine. Individualized treatment effects (ITE) play a critical role in such an endeavor. Concerning experimental data collected from randomized trials, we put forward a method, termed random forests of interaction trees (RFIT), for estimating ITE on the basis of interaction trees (Su et al., 2009). To this end, we first propose a smooth sigmoid surrogate (SSS) method, as an alternative to greedy search, to speed up tree construction. RFIT outperforms the traditional `separate regression' approach in estimating ITE. Furthermore, standard errors for the estimated ITE via RFIT can be obtained with the infinitesimal jackknife method. We assess and illustrate the use of RFIT via both simulation and the analysis of data from an acupuncture headache trial.


Understanding Boosted Trees Models

#artificialintelligence

In the previous post, we learned about tree based learning methods - basics of tree based models and the use of bagging to reduce variance. We also looked at one of the most famous learning algorithms based on the idea of bagging- random forests. In this post, we will look into the details of yet another type of tree-based learning algorithms: boosted trees. Boosting, similar to Bagging, is a general class of learning algorithm where a set of weak learners are combined to get strong learners. For classification problems, a weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification. Recall that bagging involves creating multiple copies of the original training data set via bootstrapping, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model.


Intro to The Data Science Behind EEG-Based Neurobiofeedback

#artificialintelligence

The Neurobiofeedback machine gained popularity for its non-invasive and quantitative approach to behavior regulation, but its legitimacy remains in question by pediatricians, therapists, and other professionals. In academic-sounding terms, this machine (which I'll be abbreviating as NBF from now on) is built on the concept of feedback therapy, which exploits our ability to exert and/or regain control over physiological aspects in our body. NBF is a type of Brain-Computer Interface (BCI) machine that senses your brain wave activity in different ways (usually involving hardware-software interaction) and rewards you with an auditory or visual stimulus when your brain wave's frequency matches the desired frequency. This comes from the scientific notion that brain rhythms correspond to certain cognitive states. By "mind games", the'auditory or visual stimulus' I mentioned last paragraph usually comes in the form of a game.


Machine learning leveraging genomes from metagenomes identifies influential antibiotic resistance genes in the infant gut microbiome

#artificialintelligence

Antibiotic resistance in pathogens is extensively studied, yet little is known about how antibiotic resistance genes of typical gut bacteria influence microbiome dynamics. Here, we leverage genomes from metagenomes to investigate how genes of the premature infant gut resistome correspond to the ability of bacteria to survive under certain environmental and clinical conditions. We find that formula feeding impacts the resistome. Random forest models corroborated by statistical tests revealed that the gut resistome of formula-fed infants is enriched in class D beta-lactamase genes. Interestingly, Clostridium difficile strains harboring this gene are at higher abundance in formula-fed infants compared to C. difficile lacking this gene.


Crowdsourcing Predictors of Residential Electric Energy Usage

arXiv.org Machine Learning

Crowdsourcing has been successfully applied in many domains including astronomy, cryptography and biology. In order to test its potential for useful application in a Smart Grid context, this paper investigates the extent to which a crowd can contribute predictive hypotheses to a model of residential electric energy consumption. In this experiment, the crowd generated hypotheses about factors that make one home different from another in terms of monthly energy usage. To implement this concept, we deployed a web-based system within which 627 residential electricity customers posed 632 questions that they thought predictive of energy usage. While this occurred, the same group provided 110,573 answers to these questions as they accumulated. Thus users both suggested the hypotheses that drive a predictive model and provided the data upon which the model is built. We used the resulting question and answer data to build a predictive model of monthly electric energy consumption, using random forest regression. Because of the sparse nature of the answer data, careful statistical work was needed to ensure that these models are valid. The results indicate that the crowd can generate useful hypotheses, despite the sparse nature of the dataset.


Random Subspace with Trees for Feature Selection Under Memory Constraints

arXiv.org Machine Learning

Cรฉlia Chรขtel Aix-Marseille University, France Pierre Geurts University of Liรจge, Belgium Dealing with datasets of very high dimension is a major challenge in machine learning. In this paper, we consider the problem of feature selection in applications where the memory is not large enough to contain all features. In this setting, we propose a novel tree-based feature selection approach that builds a sequence of randomized trees on small subsamples of variables mixing both variables already identified as relevant by previous models and variables randomly selected among the other variables. As our main contribution, we provide an in-depth theoretical analysis of this method in infinite sample setting. In particular, we study its soundness with respect to common definitions of feature relevance and its convergence speed under various variable dependance scenarios. We also provide some preliminary empirical results highlighting the potential of the approach.


GIS and Machine Learning for Habitat Protection GIS Lounge

@machinelearnbot

With machine learning having become a typical application along with GIS, one area of focus has been habitat protection. Habitat managers and conservation specialists have struggled to find ways in which to protect wildlife threatened by a variety of mostly-human induced factors. Machine learning and GIS have proven one way in which new ideas and scenarios can be tested before any plan is carried out, saving time, money, and possibly avoiding making crucial habitat errors in plans implemented. A recent example of using GIS and machine learning for habitat protection has been applied on the black-necked crane.[1] This type of bird is very particular with where it can breed and relatively little is known about it.


Understanding random forests with randomForestExplainer

#artificialintelligence

Next, we pass it to the function plot_min_depth_distribution and under default settings obtain obtain a plot of the distribution of minimal depth for top ten variables according to mean minimal depth calculated using top trees (mean_sample "top_trees"). We could also pass our forest directly to the plotting function but if we want to make more than one plot of the minimal depth distribution is more efficient to pass the min_depth_frame to the plotting function so that it will not be calculated again for each plot (this works similarly for other plotting functions of randomForestExplainer). The function plot_min_depth_distribution offers three possibilities when it comes to calculating the mean minimal depth, which differ in he way they treat missing values that appear when a variable is not used for splitting in a tree. Note that the depth of a tree is equal to the length of the longest path from root to leave in this tree. This equals the maximum depth of a variable in this tree plus one, as leaves are by definition not split by any variable.


A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)

#artificialintelligence

Tree based learning algorithms are considered to be one of the best and mostly used supervised learning methods. Tree based methods empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any kind of problem at hand (classification or regression). Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. Hence, for every analyst (fresher also), it's important to learn these algorithms and use them for modeling. This tutorial is meant to help beginners learn tree based modeling from scratch. After the successful completion of this tutorial, one is expected to become proficient at using tree based algorithms and build predictive models. Note: This tutorial requires no prior knowledge of machine learning.