AITopics | Ensemble Learning

Collaborating Authors

Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

CatBoost: gradient boosting with categorical features support

Dorogush, Anna Veronika, Ershov, Vasily, Gulin, Andrey

arXiv.org Machine LearningOct-24-2018

In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes.

artificial intelligence, implementation, machine learning, (19 more...)

arXiv.org Machine Learning

1810.11363

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Why every GBDT speed benchmark is wrong

Dorogush, Anna Veronika, Ershov, Vasily, Kruchinin, Dmitriy

arXiv.org Machine LearningOct-24-2018

This article provides a comprehensive study of different ways to make speed benchmarks of gradient boosted decision trees (GBDT) algorithm. We show main problems of several straight forward ways to make benchmarks, explain, why a speed benchmark is challenging task and provide a set of reasonable requirements for a benchmark to be fair and useful.

artificial intelligence, benchmark, machine learning, (17 more...)

arXiv.org Machine Learning

1810.1038

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.70)

Add feedback

Machine Learning and Credit Risk Analytics

#artificialintelligenceOct-23-2018, 11:43:18 GMT

In the last few years, new statistical algorithms have become very popular. Traditional scorecards were based on one decision tree, or "logistic regression." The newer algorithms represent a combination of hundreds of decision trees instead of one single tree. These algorithms also provide much more accurate predictions compared to traditional methods. The current hype around machine learning methods typically revolves around these algorithms in particular: random forests, XgBoost, and deep learning.

artificial intelligence, learning and credit risk analytic, machine learning, (3 more...)

#artificialintelligence

Industry:

Banking & Finance > Credit (0.53)
Banking & Finance > Risk Management (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.61)

Add feedback

On PAC-Bayesian Bounds for Random Forests

Lorenzen, Stephan Sloth, Igel, Christian, Seldin, Yevgeny

arXiv.org Machine LearningOct-23-2018

Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning methods, are unsatisfying. We discuss and evaluate various PAC-Bayesian approaches to derive such bounds. The bounds do not require additional hold-out data, because the out-of-bag samples from the bagging in the training process can be exploited. A random forest predicts by taking a majority vote of an ensemble of decision trees. The first approach is to bound the error of the vote by twice the error of the corresponding Gibbs classifier (classifying with a single member of the ensemble selected at random). However, this approach does not take into account the effect of averaging out of errors of individual classifiers when taking the majority vote. This effect provides a significant boost in performance when the errors are independent or negatively correlated, but when the correlations are strong the advantage from taking the majority vote is small. The second approach based on PAC-Bayesian C-bounds takes dependencies between ensemble members into account, but it requires estimating correlations between the errors of the individual classifiers. When the correlations are high or the estimation is poor, the bounds degrade. In our experiments, we compute generalization bounds for random forests on various benchmark data sets. Because the individual decision trees already perform well, their predictions are highly correlated and the C-bounds do not lead to satisfactory results. For the same reason, the bounds based on the analysis of Gibbs classifiers are typically superior and often reasonably tight. Bounds based on a validation set coming at the cost of a smaller training set gave better performance guarantees, but worse performance in most experiments.

artificial intelligence, classifier, machine learning, (17 more...)

arXiv.org Machine Learning

1810.09746

Country: Europe (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

Comparative Evaluation of Tree-Based Ensemble Algorithms for Short-Term Travel Time Prediction

Mousa, Saleh, Ishak, Sherif

arXiv.org Artificial IntelligenceOct-23-2018

Disseminating accurate travel time information to road users helps achieve traffic equilibrium and reduce traffic congestion. The deployment of Connected Vehicles technology will provide unique opportunities for the implementation of travel time prediction models. The aim of this study is twofold: (1) estimate travel times in the freeway network at five-minute intervals using Basic Safety Messages (BSM); (2) develop an eXtreme Gradient Boosting (XGB) model for short-term travel time prediction on freeways. The XGB tree-based ensemble prediction model is evaluated against common tree-based ensemble algorithms and the evaluations are performed at five-minute intervals over a 30-minute horizon. BSMs generated by the Safety Pilot Model Deployment conducted in Ann Arbor, Michigan, were used. Nearly two billion messages were processed for providing travel time estimates for the entire freeway network. A Combination of grid search and five-fold cross-validation techniques using the travel time estimates were used for developing the prediction models and tuning their parameters. About 9.6 km freeway stretch was used for evaluating the XGB together with the most common tree-based ensemble algorithms. The results show that XGB is superior to all other algorithms, followed by the Gradient Boosting. XGB travel time predictions were accurate and consistent with variations during peak periods, with mean absolute percentage error in prediction about 5.9% and 7.8% for 5-minute and 30-minute horizons, respectively. Additionally, through applying the developed models to another 4.7 km stretch along the eastbound segment of M-14, the XGB demonstrated its considerable advantages in travel time prediction during congested and uncongested conditions.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

1810.10102

Country: North America > United States > Michigan > Washtenaw County > Ann Arbor (0.24)

Genre:

Research Report > New Finding (0.49)
Research Report > Experimental Study (0.48)

Industry: Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.69)

Add feedback

Interpretability is Harder in the Multiclass Setting: Axiomatic Interpretability for Multiclass Additive Models

Zhang, Xuezhou, Tan, Sarah, Koch, Paul, Lou, Yin, Chajewska, Urszula, Caruana, Rich

arXiv.org Machine LearningOct-22-2018

Generalized additive models (GAMs) are favored in many regression and binary classification problems because they are able to fit complex, nonlinear functions while still remaining interpretable. In the first part of this paper, we generalize a state-of-the-art GAM learning algorithm based on boosted trees to the multiclass setting, and show that this multiclass algorithm outperforms existing GAM fitting algorithms and sometimes matches the performance of full complex models. In the second part, we turn our attention to the interpretability of GAMs in the multiclass setting. Surprisingly, the natural interpretability of GAMs breaks down when there are more than two classes. Drawing inspiration from binary GAMs, we identify two axioms that any additive model must satisfy to not be visually misleading. We then develop a post-processing technique (API) that provably transforms pretrained additive models to satisfy the interpretability axioms without sacrificing accuracy. The technique works not just on models trained with our algorithm, but on any multiclass additive model. We demonstrate API on a 12-class infant-mortality dataset.

artificial intelligence, machine learning, shape function, (14 more...)

arXiv.org Machine Learning

1810.09092

Genre: Research Report (0.83)

Industry:

Health & Medicine > Public Health (1.00)
Health & Medicine > Therapeutic Area > Pediatrics/Neonatology (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.47)

Add feedback

MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales

Hamilton, Mark, Raghunathan, Sudarshan, Matiach, Ilya, Schonhoffer, Andrew, Raman, Anand, Barzilay, Eli, Thigpen, Minsoo, Rajendran, Karthik, Mahajan, Janhavi Suresh, Cochrane, Courtney, Eswaran, Abhiram, Green, Ari

arXiv.org Machine LearningOct-19-2018

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient Boosting, Model Interpretability, and other areas of modern computation. Furthermore, we present a novel system called Spark Serving that allows users to run any Apache Spark program as a distributed, sub-millisecond latency web service backed by their existing Spark Cluster. All MMLSpark contributions have the same API to enable simple composition across frameworks and usage across batch, streaming, and RESTful web serving scenarios on static, elastic, or serverless clusters. We showcase MMLSpark by creating a method for deep object detection capable of learning without human labeled data and demonstrate its effectiveness for Snow Leopard conservation.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Machine Learning

1810.08744

Country: North America > United States (0.69)

Genre: Research Report (0.50)

Industry:

Education (0.64)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.49)

Add feedback

Gradient Boosting Decision trees: XGBoost vs LightGBM

#artificialintelligenceOct-18-2018, 00:09:57 GMT

Gradient boosting decision trees is the state of the art for structured data problems. Two modern algorithms that make gradient boosted tree models are XGBoost and LightGBM. In this article I'll summarize their introductory papers for each algorithm's approach. Gradient Boosting Decision Trees (GBDT) are currently the best techniques for building predictive models from structured data. Unlike models for analyzing images (for that you want to use a deep learning model), structured data problems can be solved very well with a lot of decision trees.

algorithm, artificial intelligence, machine learning, (15 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.56)

Add feedback

Refining interaction search through signed iterative Random Forests

Kumbier, Karl, Basu, Sumanta, Brown, James B., Celniker, Susan, Yu, Bin

arXiv.org Machine LearningOct-16-2018

Advances in supervised learning have enabled accurate prediction in biological systems governed by complex interactions among biomolecules. However, state-of-the-art predictive algorithms are typically black-boxes, learning statistical interactions that are difficult to translate into testable hypotheses. The iterative Random Forest algorithm took a step towards bridging this gap by providing a computationally tractable procedure to identify the stable, high-order feature interactions that drive the predictive accuracy of Random Forests (RF). Here we refine the interactions identified by iRF to explicitly map responses as a function of interacting features. Our method, signed iRF, describes subsets of rules that frequently occur on RF decision paths. We refer to these rule subsets as signed interactions. Signed interactions share not only the same set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We describe stable and predictive importance metrics to rank signed interactions. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate our proposed approach in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of enhancer activity, s-iRF recovers one of the few experimentally validated high-order interactions and suggests novel enhancer elements where this interaction may be active. In the case of spatial gene expression patterns, s-iRF recovers all 11 reported links in the gap gene network. By refining the process of interaction recovery, our approach has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension.

artificial intelligence, decision tree learning, machine learning, (16 more...)

arXiv.org Machine Learning

1810.07287

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Random Forests and the Bias-Variance Tradeoff – Towards Data Science

#artificialintelligenceOct-11-2018, 14:19:33 GMT

The Random Forest is an extremely popular machine learning algorithm. Often, with not too much pre-processing, one can throw together a quick and dirty model with no hyperparameter tuning and acheive results that aren't awful. As an example, I put together a RandomForestRegressor in Python using scikit-learn for the New York City Taxi Fare Prediction playground competition on Kaggle recently, passing in no arguments to the model constructor and using 1/100 for the training data (554238 of 55M rows), for a validation R² of 0.8. NOTE: This snippet assumes you split the data into training and validation sets with your features and target variable separated. You can see the full code on my GitHub profile.

artificial intelligence, machine learning, random forest, (7 more...)

#artificialintelligence

Country: North America > United States > New York (0.26)

Industry:

Transportation > Passenger (0.57)
Transportation > Ground > Road (0.57)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.66)

Add feedback