AITopics | Ensemble Learning

Collaborating Authors

Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Breast Cancer Classification Using Gradient Boosting Algorithms Focusing on Reducing the False Negative and SHAP for Explainability

Pinheiro, João Manoel Herrera, Becker, Marcelo

arXiv.org Artificial IntelligenceMar-14-2024

Cancer is one of the diseases that kill the most women in the world, with breast cancer being responsible for the highest number of cancer cases and consequently deaths. However, it can be prevented by early detection and, consequently, early treatment. Any development for detection or perdition this kind of cancer is important for a better healthy life. Many studies focus on a model with high accuracy in cancer prediction, but sometimes accuracy alone may not always be a reliable metric. This study implies an investigative approach to studying the performance of different machine learning algorithms based on boosting to predict breast cancer focusing on the recall metric. Boosting machine learning algorithms has been proven to be an effective tool for detecting medical diseases. The dataset of the University of California, Irvine (UCI) repository has been utilized to train and test the model classifier that contains their attributes. The main objective of this study is to use state-of-the-art boosting algorithms such as AdaBoost, XGBoost, CatBoost and LightGBM to predict and diagnose breast cancer and to find the most effective metric regarding recall, ROC-AUC, and confusion matrix. Furthermore, our study is the first to use these four boosting algorithms with Optuna, a library for hyperparameter optimization, and the SHAP method to improve the interpretability of our model, which can be used as a support to identify and predict breast cancer. We were able to improve AUC or recall for all the models and reduce the False Negative for AdaBoost and LigthGBM the final AUC were more than 99.41\% for all models.

accuracy, algorithm, breast cancer, (16 more...)

arXiv.org Artificial Intelligence

2403.09548

Country:

North America > United States > California > Orange County > Irvine (0.24)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
(6 more...)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

iBRF: Improved Balanced Random Forest Classifier

Newaz, Asif, Mohosheu, Md. Salman, Noman, MD. Abdullah al, Jabid, Dr. Taskeed

arXiv.org Artificial IntelligenceMar-14-2024

Class imbalance poses a major challenge in different classification tasks, which is a frequently occurring scenario in many real-world applications. Data resampling is considered to be the standard approach to address this issue. The goal of the technique is to balance the class distribution by generating new samples or eliminating samples from the data. A wide variety of sampling techniques have been proposed over the years to tackle this challenging problem. Sampling techniques can also be incorporated into the ensemble learning framework to obtain more generalized prediction performance. Balanced Random Forest (BRF) and SMOTE-Bagging are some of the popular ensemble approaches. In this study, we propose a modification to the BRF classifier to enhance the prediction performance. In the original algorithm, the Random Undersampling (RUS) technique was utilized to balance the bootstrap samples. However, randomly eliminating too many samples from the data leads to significant data loss, resulting in a major decline in performance. We propose to alleviate the scenario by incorporating a novel hybrid sampling approach to balance the uneven class distribution in each bootstrap sub-sample. Our proposed hybrid sampling technique, when incorporated into the framework of the Random Forest classifier, termed as iBRF: improved Balanced Random Forest classifier, achieves better prediction performance than other sampling techniques used in imbalanced classification tasks. Experiments were carried out on 44 imbalanced datasets on which the original BRF classifier produced an average MCC score of 47.03% and an F1 score of 49.09%. Our proposed algorithm outperformed the approach by producing a far better MCC score of 53.04% and an F1 score of 55%. The results obtained signify the superiority of the iBRF algorithm and its potential to be an effective sampling technique in imbalanced learning.

algorithm, bootstrap subset, classifier, (14 more...)

arXiv.org Artificial Intelligence

2403.09867

Country: Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.05)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Consumer Health (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Multivariate Gaussian Approximation for Random Forest via Region-based Stabilization

Shi, Zhaoyang, Bhattacharjee, Chinmoy, Balasubramanian, Krishnakumar, Polonik, Wolfgang

arXiv.org Machine LearningMar-14-2024

We derive Gaussian approximation bounds for random forest predictions based on a set of training points given by a Poisson process, under fairly mild regularity assumptions on the data generating process. Our approach is based on the key observation that the random forest predictions satisfy a certain geometric property called region-based stabilization. In the process of developing our results for the random forest, we also establish a probabilistic result, which might be of independent interest, on multivariate Gaussian approximation bounds for general functionals of Poisson process that are region-based stabilizing. This general result makes use of the Malliavin-Stein method, and is potentially applicable to various related statistical problems.

approximation, random forest, rect, (15 more...)

arXiv.org Machine Learning

2403.0996

Country: North America > United States > California > Yolo County > Davis (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Mondrian Forests: Efficient Online Random Forests

Neural Information Processing SystemsMar-13-2024, 12:53:08 GMT

Ensembles of randomized decision trees, usually referred to as random forests, are widely used for classification and regression tasks in machine learning and statistics. Random forests achieve competitive predictive performance and are computationally efficient to train and test, making them excellent candidates for real-world prediction tasks. The most popular random forest variants (such as Breiman's random forest and extremely randomized trees) operate on batches of training data. Online methods are now in greater demand. Existing online random forests, however, require more training data than their batch counterpart to achieve comparable predictive performance. In this work, we use Mondrian processes (Roy and Teh, 2009) to construct ensembles of random decision trees we call Mondrian forests. Mondrian forests can be grown in an incremental/online fashion and remarkably, the distribution of online Mondrian forests is the same as that of batch Mondrian forests. Mondrian forests achieve competitive predictive performance comparable with existing online random forests and periodically retrained batch random forests, while being more than an order of magnitude faster, thus representing a better computation vs accuracy tradeoff.

decision tree, random forest, training data, (17 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Massachusetts (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Online Gradient Boosting

Neural Information Processing SystemsMar-12-2024, 20:43:15 GMT

We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with smooth convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.

algorithm, loss function, online, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.54)

Add feedback

Pruning Random Forests for Prediction on a Budget

Neural Information Processing SystemsMar-12-2024, 10:01:34 GMT

We propose to prune a random forest (RF) for resource-constrained prediction. We first construct a RF and then prune it to optimize expected feature cost & accuracy. We pose pruning RFs as a novel 0-1 integer program with linear constraints that encourages feature re-use. We establish total unimodularity of the constraint set to prove that the corresponding LP relaxation solves the original integer program. We then exploit connections to combinatorial optimization and develop an efficient primal-dual algorithm, scalable to large datasets. In contrast to our bottom-up approach, which benefits from good RF initialization, conventional methods are top-down acquiring features based on their utility value and is generally intractable, requiring heuristics. Empirically, our pruning algorithm outperforms existing state-of-the-art resource-constrained algorithms.

acquisition cost, constraint, feature cost, (16 more...)

Neural Information Processing Systems

Country:

South America > Paraguay > Asunción > Asunción (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.85)

Add feedback

Machine Learning for Soccer Match Result Prediction

Bunker, Rory, Yeung, Calvin, Fujii, Keisuke

arXiv.org Artificial IntelligenceMar-12-2024

Machine learning has become a common approach to predicting the outcomes of soccer matches, and the body of literature in this domain has grown substantially in the past decade and a half. This chapter discusses available datasets, the types of models and features, and ways of evaluating model performance in this application domain. The aim of this chapter is to give a broad overview of the current state and potential future developments in machine learning for soccer match results prediction, as a resource for those interested in conducting future studies in the area. Our main findings are that while gradient-boosted tree models such as CatBoost, applied to soccer-specific ratings such as pi-ratings, are currently the best-performing models on datasets containing only goals as the match features, there needs to be a more thorough comparison of the performance of deep learning models and Random Forest on a range of datasets with different types of features. Furthermore, new rating systems using both player- and team-level information and incorporating additional information from, e.g., spatiotemporal tracking and event data, could be investigated further. Finally, the interpretability of match result prediction models needs to be enhanced for them to be more useful for team management.

match result prediction, prediction, result prediction, (12 more...)

arXiv.org Artificial Intelligence

2403.07669

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > Japan (0.04)
Oceania > New Zealand > North Island > Waikato (0.04)
(8 more...)

Genre: Research Report > New Finding (0.93)

Industry: Leisure & Entertainment > Sports > Soccer (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
(2 more...)

Add feedback

Grafting: Making Random Forests Consistent

Waltz, Nicholas

arXiv.org Machine LearningMar-9-2024

Despite their performance and widespread use, little is known about the theory of Random Forests. A major unanswered question is whether, or when, the Random Forest algorithm is consistent. The literature explores various variants of the classic Random Forest algorithm to address this question and known short-comings of the method. This paper is a contribution to this literature. Specifically, the suitability of grafting consistent estimators onto a shallow CART is explored. It is shown that this approach has a consistency guarantee and performs well in empirical settings.

algorithm, node, random forest, (12 more...)

arXiv.org Machine Learning

2403.06015

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

SzCORE: A Seizure Community Open-source Research Evaluation framework for the validation of EEG-based automated seizure detection algorithms

Dan, Jonathan, Pale, Una, Amirshahi, Alireza, Cappelletti, William, Ingolfsson, Thorir Mar, Wang, Xiaying, Cossettini, Andrea, Bernini, Adriano, Benini, Luca, Beniczky, Sándor, Atienza, David, Ryvlin, Philippe

arXiv.org Artificial IntelligenceMar-8-2024

The need for high-quality automated seizure detection algorithms based on electroencephalography (EEG) becomes ever more pressing with the increasing use of ambulatory and long-term EEG monitoring. Heterogeneity in validation methods of these algorithms influences the reported results and makes comprehensive evaluation and comparison challenging. This heterogeneity concerns in particular the choice of datasets, evaluation methodologies, and performance metrics. In this paper, we propose a unified framework designed to establish standardization in the validation of EEG-based seizure detection algorithms. Based on existing guidelines and recommendations, the framework introduces a set of recommendations and standards related to datasets, file formats, EEG data input content, seizure annotation input and output, cross-validation strategies, and performance metrics. We also propose the 10-20 seizure detection benchmark, a machine-learning benchmark based on public datasets converted to a standardized format. This benchmark defines the machine-learning task as well as reporting metrics. We illustrate the use of the benchmark by evaluating a set of existing seizure detection algorithms. The SzCORE (Seizure Community Open-source Research Evaluation) framework and benchmark are made publicly available along with an open-source software library to facilitate research use, while enabling rigorous evaluation of the clinical significance of the algorithms, fostering a collective effort to more optimally detect seizures to improve the lives of people with epilepsy.

algorithm, dataset, seizure detection algorithm, (14 more...)

arXiv.org Artificial Intelligence

2402.13005

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
(7 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology > Epilepsy (0.71)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.46)

Add feedback

Applied Causal Inference Powered by ML and AI

Chernozhukov, Victor, Hansen, Christian, Kallus, Nathan, Spindler, Martin, Syrgkanis, Vasilis

arXiv.org Machine LearningMar-4-2024

This book aims to provide a working introduction to the emerging fusion of modern statistical inference - aka machine learning (ML) or artificial intelligence (AI) - and causal inference methods. The book is aimed at upper level undergraduates and master's-level students as well as doctoral students focusing on applied empirical research. A sufficient background for the core material is one semester of introductory econometrics and one semester of machine learning. We hope the book is also useful to empirical researchers looking to apply modern methods in their work. The book provides an overview of key ideas in both predictive inference and causal inference and shows how predictive tools are key ingredients to answering many causal questions.

equation modelling and conditional exogeneity, intervention induce new counterfactual distribution, random assignment randomized controlled trial, (17 more...)

arXiv.org Machine Learning

2403.02467

Country:

North America > Canada > Ontario > Toronto (0.27)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.13)
North America > United States > New York (0.04)
(21 more...)

Genre:

Workflow (1.00)
Summary/Review (1.00)
Research Report > Strength High (1.00)
(5 more...)

Industry:

Marketing (1.00)
Law (1.00)
Information Technology (1.00)
(10 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(6 more...)

Add feedback