Goto

Collaborating Authors

 Decision Tree Learning


Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale

Neural Information Processing Systems

Deep distributed decision trees and tree ensembles have grown in importance due to the need to model increasingly large datasets. However, PLANET, the standard distributed tree learning algorithm implemented in systems such as \xgboost and Spark MLlib, scales poorly as data dimensionality and tree depths grow. We present Yggdrasil, a new distributed tree learning method that outperforms existing methods by up to 24x. Unlike PLANET, Yggdrasil is based on vertical partitioning of the data (i.e., partitioning by feature), along with a set of optimized data structures to reduce the CPU and communication costs of training. Yggdrasil (1) trains directly on compressed data for compressible features and labels; (2) introduces efficient data structures for training on uncompressed data; and (3) minimizes communication between nodes by using sparse bitvectors. Moreover, while PLANET approximates split points through feature binning, Yggdrasil does not require binning, and we analytically characterize the impact of this approximation. We evaluate Yggdrasil against the MNIST 8M dataset and a high-dimensional dataset at Yahoo; for both, Yggdrasil is faster by up to an order of magnitude.


Pruning Random Forests for Prediction on a Budget

Neural Information Processing Systems

We propose to prune a random forest (RF) for resource-constrained prediction. We first construct a RF and then prune it to optimize expected feature cost & accuracy. We pose pruning RFs as a novel 0-1 integer program with linear constraints that encourages feature re-use. We establish total unimodularity of the constraint set to prove that the corresponding LP relaxation solves the original integer program. We then exploit connections to combinatorial optimization and develop an efficient primal-dual algorithm, scalable to large datasets. In contrast to our bottom-up approach, which benefits from good RF initialization, conventional methods are top-down acquiring features based on their utility value and is generally intractable, requiring heuristics. Empirically, our pruning algorithm outperforms existing state-of-the-art resource-constrained algorithms.


A Communication-Efficient Parallel Algorithm for Decision Tree

Neural Information Processing Systems

Decision tree (and its extensions such as Gradient Boosting Decision Trees and Random Forest) is a widely used machine learning algorithm, due to its practical effectiveness and model interpretability. With the emergence of big data, there is an increasing need to parallelize the training process of decision tree. However, most existing attempts along this line suffer from high communication costs. In this paper, we propose a new algorithm, called \emph{Parallel Voting Decision Tree (PV-Tree)}, to tackle this challenge. After partitioning the training data onto a number of (e.g., $M$) machines, this algorithm performs both local voting and global voting in each iteration. For local voting, the top-$k$ attributes are selected from each machine according to its local data. Then, the indices of these top attributes are aggregated by a server, and the globally top-$2k$ attributes are determined by a majority voting among these local candidates. Finally, the full-grained histograms of the globally top-$2k$ attributes are collected from local machines in order to identify the best (most informative) attribute and its split point. PV-Tree can achieve a very low communication cost (independent of the total number of attributes) and thus can scale out very well. Furthermore, theoretical analysis shows that this algorithm can learn a near optimal decision tree, since it can find the best attribute with a large probability. Our experiments on real-world datasets show that PV-Tree significantly outperforms the existing parallel decision tree algorithms in the tradeoff between accuracy and efficiency.


Winning Kaggle 101: Introduction to Stacking

#artificialintelligence

Random Forest) โ€ข Used to ensemble a diverse group of strong learners โ€ข Involves training a second-level machine learning algorithm called a "metalearner" to learn the optimal combination of the base learners 5. History of Stacking โ€ข Leo Breiman, "Stacked Regressions" (1996) โ€ข Modified algorithm to use CV to generate level-one data โ€ข Blended Neural Networks and GLMs (separately) Stacked Generalization Stacked Regressions Super Learning โ€ข David H. Wolpert, "Stacked Generalization" (1992) โ€ข First formulation of stacking via a metalearner โ€ข Blended Neural Networks โ€ข Mark van der Laan et al., "Super Learner" (2007) โ€ข Provided the theory to prove that the Super Learner is the asymptotically optimal combination โ€ข First R implementation in 2010 6.


Introduction to Classification & Regression Trees (CART)

@machinelearnbot

Decision Trees are commonly used in data mining with the objective of creating a model that predicts the value of a target (or dependent variable) based on the values of several input (or independent variables). In today's post, we discuss the CART decision tree methodology. The CART or Classification & Regression Trees methodology was introduced in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone as an umbrella term to refer to the following types of decision trees: Classification Trees: where the target variable is categorical and the tree is used to identify the "class" within which a target variable would likely fall into. Regression Trees: where the target variable is continuous and tree is used to predict it's value. The CART algorithm is structured as a sequence of questions, the answers to which determine what the next question, if any should be.


5 Machine Learning Research Studies To Understand & Predict Length of Stay in Hospitals

@machinelearnbot

Length of Stay (LOS) is a critical factor in managing hospital quality & economic outcomes in Healthcare. The metric is calculated by summing the total number of days for all discharges & dividing it by the total number of discharges. Insurance programs such as Medicare are moving to a model where they are compensating Hospitals the same amount for a specific surgery (e.g. Joint replacement) regardless of the number of days spent in the hospital. Therefore, hospitals & the overall healthcare ecosystem are motivated to reduce LOS.


Top Algorithms and Methods Used by Data Scientists

#artificialintelligence

Latest KDnuggets poll identifies the list of top algorithms actually used by Data Scientists, finds surprises including the most academic and most industry-oriented algorithms. Latest KDnuggets Poll asked Which methods/algorithms you used in the past 12 months for an actual Data Science-related application? . Here are the results, based on 844 voters. The top 10 algorithms (and methods) and their share of voters are: Figure 1: Top 10 algorithms & methods used by Data Scientists. See full table of all algorithms and methods at the end of the post.


ggRandomForests: Exploring Random Forest Survival

arXiv.org Machine Learning

Random forest (Leo Breiman 2001a) (RF) is a non-parametric statistical method requiring no distributional assumptions on covariate relation to the response. RF is a robust, nonlinear technique that optimizes predictive accuracy by fitting an ensemble of trees to stabilize model estimates. Random survival forests (RSF) (Ishwaran and Kogalur 2007; Ishwaran et al. 2008) are an extension of Breimans RF techniques allowing efficient nonparametric analysis of time to event data. The randomForestSRC package (Ishwaran and Kogalur 2014) is a unified treatment of Breimans random forest for survival, regression and classification problems. Predictive accuracy makes RF an attractive alternative to parametric models, though complexity and interpretability of the forest hinder wider application of the method. We introduce the ggRandomForests package, tools for visually understand random forest models grown in R (R Core Team 2014) with the randomForestSRC package. The ggRandomForests package is structured to extract intermediate data objects from randomForestSRC objects and generate figures using the ggplot2 (Wickham 2009) graphics package. This document is structured as a tutorial for building random forest for survival with the randomForestSRC package and using the ggRandomForests package for investigating how the forest is constructed. We analyse the Primary Biliary Cirrhosis of the liver data from a clinical trial at the Mayo Clinic (Fleming and Harrington 1991). Our aim is to demonstrate the strength of using Random Forest methods for both prediction and information retrieval, specifically in time to event data settings.


Distributed Real-Time Sentiment Analysis for Big Data Social Streams

arXiv.org Machine Learning

Big data trend has enforced the data-centric systems to have continuous fast data streams. In recent years, real-time analytics on stream data has formed into a new research field, which aims to answer queries about what-is-happening-now with a negligible delay. The real challenge with real-time stream data processing is that it is impossible to store instances of data, and therefore online analytical algorithms are utilized. To perform real-time analytics, pre-processing of data should be performed in a way that only a short summary of stream is stored in main memory. In addition, due to high speed of arrival, average processing time for each instance of data should be in such a way that incoming instances are not lost without being captured. Lastly, the learner needs to provide high analytical accuracy measures. Sentinel is a distributed system written in Java that aims to solve this challenge by enforcing both the processing and learning process to be done in distributed form. Sentinel is built on top of Apache Storm, a distributed computing platform. Sentinels learner, Vertical Hoeffding Tree, is a parallel decision tree-learning algorithm based on the VFDT, with ability of enabling parallel classification in distributed environments. Sentinel also uses SpaceSaving to keep a summary of the data stream and stores its summary in a synopsis data structure. Application of Sentinel on Twitter Public Stream API is shown and the results are discussed.


Machine Learning and the Law

#artificialintelligence

Last week I went to the workshops at NIPS (biggest ML conference in the world) and I also attended part of the ML and the Law symposium the day before. I found out a little bit too late about the symposia but I was still able to attend two panels on which there were both lawyers and computer scientists. They were very insightful and informative -- did you know that this Spring, the European Union passed a regulation giving its citizens a "right to an explanation" for decisions made by machine-learning systems? The panel discussions were motivated by the problem of explaining ML-powered decisions which have an important impact on people's lives: We need to be able to test how systems get to their conclusions; if we can't test, we can't contest. Individuals are entitled to know which data is being processed of them, and to explanations of how predictions & decisions work, in terms they can understand.