Goto

Collaborating Authors

 Ensemble Learning


CatBoost: gradient boosting with categorical features support

arXiv.org Machine Learning

In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes.


Why every GBDT speed benchmark is wrong

arXiv.org Machine Learning

This article provides a comprehensive study of different ways to make speed benchmarks of gradient boosted decision trees (GBDT) algorithm. We show main problems of several straight forward ways to make benchmarks, explain, why a speed benchmark is challenging task and provide a set of reasonable requirements for a benchmark to be fair and useful.


Machine Learning and Credit Risk Analytics

#artificialintelligence

In the last few years, new statistical algorithms have become very popular. Traditional scorecards were based on one decision tree, or "logistic regression." The newer algorithms represent a combination of hundreds of decision trees instead of one single tree. These algorithms also provide much more accurate predictions compared to traditional methods. The current hype around machine learning methods typically revolves around these algorithms in particular: random forests, XgBoost, and deep learning.


On PAC-Bayesian Bounds for Random Forests

arXiv.org Machine Learning

Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning methods, are unsatisfying. We discuss and evaluate various PAC-Bayesian approaches to derive such bounds. The bounds do not require additional hold-out data, because the out-of-bag samples from the bagging in the training process can be exploited. A random forest predicts by taking a majority vote of an ensemble of decision trees. The first approach is to bound the error of the vote by twice the error of the corresponding Gibbs classifier (classifying with a single member of the ensemble selected at random). However, this approach does not take into account the effect of averaging out of errors of individual classifiers when taking the majority vote. This effect provides a significant boost in performance when the errors are independent or negatively correlated, but when the correlations are strong the advantage from taking the majority vote is small. The second approach based on PAC-Bayesian C-bounds takes dependencies between ensemble members into account, but it requires estimating correlations between the errors of the individual classifiers. When the correlations are high or the estimation is poor, the bounds degrade. In our experiments, we compute generalization bounds for random forests on various benchmark data sets. Because the individual decision trees already perform well, their predictions are highly correlated and the C-bounds do not lead to satisfactory results. For the same reason, the bounds based on the analysis of Gibbs classifiers are typically superior and often reasonably tight. Bounds based on a validation set coming at the cost of a smaller training set gave better performance guarantees, but worse performance in most experiments.


Comparative Evaluation of Tree-Based Ensemble Algorithms for Short-Term Travel Time Prediction

arXiv.org Artificial Intelligence

Disseminating accurate travel time information to road users helps achieve traffic equilibrium and reduce traffic congestion. The deployment of Connected Vehicles technology will provide unique opportunities for the implementation of travel time prediction models. The aim of this study is twofold: (1) estimate travel times in the freeway network at five-minute intervals using Basic Safety Messages (BSM); (2) develop an eXtreme Gradient Boosting (XGB) model for short-term travel time prediction on freeways. The XGB tree-based ensemble prediction model is evaluated against common tree-based ensemble algorithms and the evaluations are performed at five-minute intervals over a 30-minute horizon. BSMs generated by the Safety Pilot Model Deployment conducted in Ann Arbor, Michigan, were used. Nearly two billion messages were processed for providing travel time estimates for the entire freeway network. A Combination of grid search and five-fold cross-validation techniques using the travel time estimates were used for developing the prediction models and tuning their parameters. About 9.6 km freeway stretch was used for evaluating the XGB together with the most common tree-based ensemble algorithms. The results show that XGB is superior to all other algorithms, followed by the Gradient Boosting. XGB travel time predictions were accurate and consistent with variations during peak periods, with mean absolute percentage error in prediction about 5.9% and 7.8% for 5-minute and 30-minute horizons, respectively. Additionally, through applying the developed models to another 4.7 km stretch along the eastbound segment of M-14, the XGB demonstrated its considerable advantages in travel time prediction during congested and uncongested conditions.


Interpretability is Harder in the Multiclass Setting: Axiomatic Interpretability for Multiclass Additive Models

arXiv.org Machine Learning

Generalized additive models (GAMs) are favored in many regression and binary classification problems because they are able to fit complex, nonlinear functions while still remaining interpretable. In the first part of this paper, we generalize a state-of-the-art GAM learning algorithm based on boosted trees to the multiclass setting, and show that this multiclass algorithm outperforms existing GAM fitting algorithms and sometimes matches the performance of full complex models. In the second part, we turn our attention to the interpretability of GAMs in the multiclass setting. Surprisingly, the natural interpretability of GAMs breaks down when there are more than two classes. Drawing inspiration from binary GAMs, we identify two axioms that any additive model must satisfy to not be visually misleading. We then develop a post-processing technique (API) that provably transforms pretrained additive models to satisfy the interpretability axioms without sacrificing accuracy. The technique works not just on models trained with our algorithm, but on any multiclass additive model. We demonstrate API on a 12-class infant-mortality dataset.


MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales

arXiv.org Machine Learning

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient Boosting, Model Interpretability, and other areas of modern computation. Furthermore, we present a novel system called Spark Serving that allows users to run any Apache Spark program as a distributed, sub-millisecond latency web service backed by their existing Spark Cluster. All MMLSpark contributions have the same API to enable simple composition across frameworks and usage across batch, streaming, and RESTful web serving scenarios on static, elastic, or serverless clusters. We showcase MMLSpark by creating a method for deep object detection capable of learning without human labeled data and demonstrate its effectiveness for Snow Leopard conservation.


Gradient Boosting Decision trees: XGBoost vs LightGBM

#artificialintelligence

Gradient boosting decision trees is the state of the art for structured data problems. Two modern algorithms that make gradient boosted tree models are XGBoost and LightGBM. In this article I'll summarize their introductory papers for each algorithm's approach. Gradient Boosting Decision Trees (GBDT) are currently the best techniques for building predictive models from structured data. Unlike models for analyzing images (for that you want to use a deep learning model), structured data problems can be solved very well with a lot of decision trees.


Refining interaction search through signed iterative Random Forests

arXiv.org Machine Learning

Advances in supervised learning have enabled accurate prediction in biological systems governed by complex interactions among biomolecules. However, state-of-the-art predictive algorithms are typically black-boxes, learning statistical interactions that are difficult to translate into testable hypotheses. The iterative Random Forest algorithm took a step towards bridging this gap by providing a computationally tractable procedure to identify the stable, high-order feature interactions that drive the predictive accuracy of Random Forests (RF). Here we refine the interactions identified by iRF to explicitly map responses as a function of interacting features. Our method, signed iRF, describes subsets of rules that frequently occur on RF decision paths. We refer to these rule subsets as signed interactions. Signed interactions share not only the same set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We describe stable and predictive importance metrics to rank signed interactions. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate our proposed approach in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of enhancer activity, s-iRF recovers one of the few experimentally validated high-order interactions and suggests novel enhancer elements where this interaction may be active. In the case of spatial gene expression patterns, s-iRF recovers all 11 reported links in the gap gene network. By refining the process of interaction recovery, our approach has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension.


Random Forests and the Bias-Variance Tradeoff – Towards Data Science

#artificialintelligence

The Random Forest is an extremely popular machine learning algorithm. Often, with not too much pre-processing, one can throw together a quick and dirty model with no hyperparameter tuning and acheive results that aren't awful. As an example, I put together a RandomForestRegressor in Python using scikit-learn for the New York City Taxi Fare Prediction playground competition on Kaggle recently, passing in no arguments to the model constructor and using 1/100 for the training data (554238 of 55M rows), for a validation R² of 0.8. NOTE: This snippet assumes you split the data into training and validation sets with your features and target variable separated. You can see the full code on my GitHub profile.