AITopics | Ensemble Learning

Collaborating Authors

Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

GAM Coach: Towards Interactive and User-centered Algorithmic Recourse

Wang, Zijie J., Vaughan, Jennifer Wortman, Caruana, Rich, Chau, Duen Horng

arXiv.org Artificial IntelligenceFeb-28-2023

Machine learning (ML) recourse techniques are increasingly used in high-stakes domains, providing end users with actions to alter ML predictions, but they assume ML developers understand what input variables can be changed. However, a recourse plan's actionability is subjective and unlikely to match developers' expectations completely. We present GAM Coach, a novel open-source system that adapts integer linear programming to generate customizable counterfactual explanations for Generalized Additive Models (GAMs), and leverages interactive visualizations to enable end users to iteratively generate recourse plans meeting their needs. A quantitative user study with 41 participants shows our tool is usable and useful, and users prefer personalized recourse plans over generic plans. Through a log analysis, we explore how users discover satisfactory recourse plans, and provide empirical evidence that transparency can lead to more opportunities for everyday users to discover counterintuitive patterns in ML models. GAM Coach is available at: https://poloclub.github.io/gam-coach/.

machine learning, natural language, recourse plan, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3544548.3580816

2302.14165

Country:

Europe > Germany > Hamburg (0.05)
Asia > Taiwan (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Education (1.00)
Banking & Finance > Loans (0.46)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

Add feedback

[2302.12360] Practical Knowledge Distillation: Using DNNs to Beat DNNs

#artificialintelligenceFeb-27-2023, 01:03:10 GMT

For tabular data sets, we explore data and model distillation, as well as data denoising. These techniques improve both gradient-boosting models and a specialized DNN architecture. While gradient boosting is known to outperform DNNs on tabular data, we close the gap for datasets with 100K+ rows and give DNNs an advantage on small data sets. We extend these results with input-data distillation and optimized ensembling to help DNN performance match or exceed that of gradient boosting. As a theoretical justification of our practical method, we prove its equivalence to classical cross-entropy knowledge distillation. We also qualitatively explain the superiority of DNN ensembles over XGBoost on small data sets. For an industry end-to-end real-time ML platform with 4M production inferences per second, we develop a model-training workflow based on data sampling that distills ensembles of models into a single gradient-boosting model favored for high-performance real-time inference, without performance loss. Empirical evaluation shows that the proposed combination of methods consistently improves model accuracy over prior best models across several production applications deployed worldwide.

beat dnn, dnn, practical knowledge distillation

#artificialintelligence

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Random forests for binary geospatial data

Saha, Arkajyoti, Datta, Abhirup

arXiv.org Machine LearningFeb-27-2023

Binary geospatial data is commonly analyzed with generalized linear mixed models, specified with a linear fixed covariate effect and a Gaussian Process (GP)-distributed spatial random effect, relating to the response via a link function. The assumption of linear covariate effects is severely restrictive. Random Forests (RF) are increasingly being used for non-linear modeling of spatial data, but current extensions of RF for binary spatial data depart the mixed model setup, relinquishing inference on the fixed effects and other advantages of using GP. We propose RF-GP, using Random Forests for estimating the non-linear covariate effect and Gaussian Processes for modeling the spatial random effects directly within the generalized mixed model framework. We observe and exploit equivalence of Gini impurity measure and least squares loss to propose an extension of RF for binary data that accounts for the spatial dependence. We then propose a novel link inversion algorithm that leverages the properties of GP to estimate the covariate effects and offer spatial predictions. RF-GP outperforms existing RF methods for estimation and prediction in both simulated and real-world data. We establish consistency of RF-GP for a general class of $\beta$-mixing binary processes that includes common choices like spatial Mat\'ern GP and autoregressive processes.

artificial intelligence, machine learning, spatial reasoning, (18 more...)

arXiv.org Machine Learning

2302.13828

Country:

South America > French Guiana (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Netherlands (0.04)
Africa (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)

Add feedback

Book Review: Tree-based Methods for Statistical Learning in R - insideBIGDATA

#artificialintelligenceFeb-23-2023, 19:35:42 GMT

Here's a new title that is a "must have" for any data scientist who uses the R language. It's a wonderful learning resource for tree-based techniques in statistical learning, one that's become my go-to text when I find the need to do a deep dive into various ML topic areas for my work. The methods discussed represent the cornerstone for using tabular data sets for making predictions using decision trees, ensemble methods like random forest, and of course the industry's darling gradient boosting machines (GBM). Algorithms like XGBoost are king of the hill for solving problems involving tabular data. A number of timely and somewhat high-profile benchmarks show that this class of algorithm beats deep learning algorithms for many problem domains.

insidebigdata, statistical learning, tree-based method, (14 more...)

#artificialintelligence

Genre: Summary/Review (0.51)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.56)

Add feedback

A Comparison of Modeling Preprocessing Techniques

Johnson, Tosan, Liu, Alice J., Raza, Syed, McGuire, Aaron

arXiv.org Machine LearningFeb-23-2023

This paper compares the performance of various data processing methods in terms of predictive performance for structured data. This paper also seeks to identify and recommend preprocessing methodologies for tree-based binary classification models, with a focus on eXtreme Gradient Boosting (XGBoost) models. Three data sets of various structures, interactions, and complexity were constructed, which were supplemented by a real-world data set from the Lending Club. We compare several methods for feature selection, categorical handling, and null imputation. Performance is assessed using relative comparisons among the chosen methodologies, including model prediction variability. This paper is presented by the three groups of preprocessing methodologies, with each section consisting of generalized observations. Each observation is accompanied by a recommendation of one or more preferred methodologies. Among feature selection methods, permutation-based feature importance, regularization, and XGBoost's feature importance by weight are not recommended. The correlation coefficient reduction also shows inferior performance. Instead, XGBoost importance by gain shows the most consistency and highest caliber of performance. Categorical featuring encoding methods show greater discrimination in performance among data set structures. While there was no universal "best" method, frequency encoding showed the greatest performance for the most complex data sets (Lending Club), but had the poorest performance for all synthetic (i.e., simpler) data sets. Finally, missing indicator imputation dominated in terms of performance among imputation methods, whereas tree imputation showed extremely poor and highly variable model performance.

artificial intelligence, imputation, machine learning, (16 more...)

arXiv.org Machine Learning

2302.12042

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Industry: Banking & Finance > Loans (0.55)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

From Shapley Values to Generalized Additive Models and back

Bordt, Sebastian, von Luxburg, Ulrike

arXiv.org Artificial IntelligenceFeb-23-2023

In explainable machine learning, local post-hoc explanation algorithms and inherently interpretable models are often seen as competing approaches. This work offers a partial reconciliation between the two by establishing a correspondence between Shapley Values and Generalized Additive Models (GAMs). We introduce $n$-Shapley Values, a parametric family of local post-hoc explanation algorithms that explain individual predictions with interaction terms up to order $n$. By varying the parameter $n$, we obtain a sequence of explanations that covers the entire range from Shapley Values up to a uniquely determined decomposition of the function we want to explain. The relationship between $n$-Shapley Values and this decomposition offers a functionally-grounded characterization of Shapley Values, which highlights their limitations. We then show that $n$-Shapley Values, as well as the Shapley Taylor- and Faith-Shap interaction indices, recover GAMs with interaction terms up to order $n$. This implies that the original Shapely Values recover GAMs without variable interactions. Taken together, our results provide a precise characterization of Shapley Values as they are being used in explainable machine learning. They also offer a principled interpretation of partial dependence plots of Shapley Values in terms of the underlying functional decomposition. A package for the estimation of different interaction indices is available at \url{https://github.com/tml-tuebingen/nshap}.

alue, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2209.04012

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.34)
North America > United States > California (0.05)
Europe > Spain (0.04)

Genre: Research Report > New Finding (0.34)

Industry:

Banking & Finance (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.47)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.46)

Add feedback

Unifying local and global model explanations by functional decomposition of low dimensional structures

Hiabu, Munir, Meyer, Joseph T., Wright, Marvin N.

arXiv.org Artificial IntelligenceFeb-23-2023

We consider a global representation of a regression or classification function by decomposing it into the sum of main and interaction components of arbitrary order. We propose a new identification constraint that allows for the extraction of interventional SHAP values and partial dependence plots, thereby unifying local and global explanations. With our proposed identification, a feature's partial dependence plot corresponds to the main effect term plus the intercept. The interventional SHAP value of feature $k$ is a weighted sum of the main component and all interaction components that include $k$, with the weights given by the reciprocal of the component's dimension. This brings a new perspective to local explanations such as SHAP values which were previously motivated by game theory only. We show that the decomposition can be used to reduce direct and indirect bias by removing all components that include a protected feature. Lastly, we motivate a new measure of feature importance. In principle, our proposed functional decomposition can be applied to any machine learning model, but exact calculation is only feasible for low-dimensional structures or ensembles of those. We provide an algorithm and efficient implementation for gradient-boosted trees (xgboost) and random planted forest. Conducted experiments suggest that our method provides meaningful explanations and reveals interactions of higher orders. The proposed methods are implemented in an R package, available at \url{https://github.com/PlantedML/glex}.

artificial intelligence, decomposition, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2208.06151

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
Europe > Germany > Bremen > Bremen (0.04)

Genre: Research Report (0.84)

Industry: Leisure & Entertainment > Games (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.88)

Add feedback

Tree-Based Machine Learning Methods For Vehicle Insurance Claims Size Prediction

Terefe, Edossa Merga

arXiv.org Artificial IntelligenceFeb-21-2023

Vehicle insurance claims size prediction needs methods to efficiently handle these claims. Machine learning (ML) is one of the methods that solve this problem. Tree-based ensemble learning algorithms are highly effective and widely used ML methods. This study considers how vehicle insurance providers incorporate ML methods in their companies and explores how the models can be applied to insurance big data. We utilize various tree-based ML methods, such as bagging, random forest, and gradient boosting, to determine the relative importance of predictors in predicting claims size and to explore the relationships between claims size and predictors. Furthermore, we evaluate and compare these models' performances. The results show that tree-based ensemble methods are better than the classical least square method. Keywords: claims size prediction; machine learning; tree-based ensemble methods; vehicle insurance.

artificial intelligence, machine learning, predictor, (18 more...)

arXiv.org Artificial Intelligence

2302.10612

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry: Banking & Finance > Insurance (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

An Efficient Two-stage Gradient Boosting Framework for Short-term Traffic State Estimation

Lu, Yichao

arXiv.org Artificial IntelligenceFeb-20-2023

Real-time traffic state estimation is essential for intelligent transportation systems. The NeurIPS 2022 Traffic4cast challenge provides an excellent testbed for benchmarking short-term traffic state estimation approaches. This technical report describes our solution to this challenge. In particular, we present an efficient two-stage gradient boosting framework for short-term traffic state estimation. The first stage derives the month, day of the week, and time slot index based on the sparse loop counter data, and the second stage predicts the future traffic states based on the sparse loop counter data and the derived month, day of the week, and time slot index. Experimental results demonstrate that our two-stage gradient boosting framework achieves strong empirical performance, achieving third place in both the core and the extended challenges while remaining highly efficient.

artificial intelligence, machine learning, neural network, (15 more...)

arXiv.org Artificial Intelligence

2302.104

Country: North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)

Genre: Research Report (0.70)

Industry: Transportation > Infrastructure & Services (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Reproducing Random Forest Efficacy in Detecting Port Scanning

Pittman, Jason M.

arXiv.org Artificial IntelligenceFeb-18-2023

Port scanning is the process of attempting to connect to various network ports on a computing endpoint to determine which ports are open and which services are running on them. It is a common method used by hackers to identify vulnerabilities in a network or system. By determining which ports are open, an attacker can identify which services and applications are running on a device and potentially exploit any known vulnerabilities in those services. Consequently, it is important to detect port scanning because it is often the first step in a cyber attack. By identifying port scanning attempts, cybersecurity professionals can take proactive measures to protect the systems and networks before an attacker has a chance to exploit any vulnerabilities. Against this background, researchers have worked for over a decade to develop robust methods to detect port scanning. One such method revealed by a recent systematic review is the random forest supervised machine learning algorithm. The review revealed six existing studies using random forest since 2021. Unfortunately, those studies each exhibit different results, do not all use the same training and testing dataset, and only two include source code. Accordingly, the goal of this work was to reproduce the six random forest studies while addressing the apparent shortcomings. The outcomes are significant for researchers looking to explore random forest to detect port scanning and for practitioners interested in reliable technology to detect the early stages of cyber attack.

artificial intelligence, decision tree learning, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2302.09317

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback