AITopics | Ensemble Learning

Collaborating Authors

Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Predicting Accident Severity: An Analysis Of Factors Affecting Accident Severity Using Random Forest Model

Adefabi, Adekunle, Olisah, Somtobe, Obunadike, Callistus, Oyetubo, Oluwatosin, Taiwo, Esther, Tella, Edward

arXiv.org Artificial IntelligenceOct-9-2023

Road accidents have significant economic and societal costs, with a small number of severe accidents accounting for a large portion of these costs. Predicting accident severity can help in the proactive approach to road safety by identifying potential unsafe road conditions and taking well-informed actions to reduce the number of severe accidents. This study investigates the effectiveness of the Random Forest machine learning algorithm for predicting the severity of an accident. The model is trained on a dataset of accident records from a large metropolitan area and evaluated using various metrics. Hyperparameters and feature selection are optimized to improve the model's performance. The results show that the Random Forest model is an effective tool for predicting accident severity with an accuracy of over 80%. The study also identifies the top six most important variables in the model, which include wind speed, pressure, humidity, visibility, clear conditions, and cloud cover. The fitted model has an Area Under the Curve of 80%, a recall of 79.2%, a precision of 97.1%, and an F1 score of 87.3%. These results suggest that the proposed model has higher performance in explaining the target variable, which is the accident severity class. Overall, the study provides evidence that the Random Forest model is a viable and reliable tool for predicting accident severity and can be used to help reduce the number of fatalities and injuries due to road accidents in the United States

accident, random forest model, severity, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijci.2023.120609

2310.0584

Country:

North America > United States > Tennessee (0.04)
North America > United States > California (0.04)
North America > United States > Texas > El Paso County (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Transportation > Ground > Road (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)

Add feedback

Boosted Control Functions

Gnecco, Nicola, Peters, Jonas, Engelke, Sebastian, Pfister, Niklas

arXiv.org Machine LearningOct-9-2023

Modern machine learning methods and the availability of large-scale data opened the door to accurately predict target quantities from large sets of covariates. However, existing prediction methods can perform poorly when the training and testing data are different, especially in the presence of hidden confounding. While hidden confounding is well studied for causal effect estimation (e.g., instrumental variables), this is not the case for prediction tasks. This work aims to bridge this gap by addressing predictions under different training and testing distributions in the presence of unobserved confounding. In particular, we establish a novel connection between the field of distribution generalization from machine learning, and simultaneous equation models and control function from econometrics. Central to our contribution are simultaneous equation models for distribution generalization (SIMDGs) which describe the data-generating process under a set of distributional shifts. Within this framework, we propose a strong notion of invariance for a predictive model and compare it with existing (weaker) versions. Building on the control function approach from instrumental variable regression, we propose the boosted control function (BCF) as a target of inference and prove its ability to successfully predict even in intervened versions of the underlying SIMDG. We provide necessary and sufficient conditions for identifying the BCF and show that it is worst-case optimal. We introduce the ControlTwicing algorithm to estimate the BCF and analyze its predictive performance on simulated and real world data.

artificial intelligence, machine learning, modeling & simulation, (16 more...)

arXiv.org Machine Learning

2310.05805

Country:

North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report (0.63)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Robust-GBDT: A Novel Gradient Boosting Model for Noise-Robust Classification

Luo, Jiaqi, Quan, Yuedong, Xu, Shixin

arXiv.org Artificial IntelligenceOct-8-2023

Robust boosting algorithms have emerged as alternative solutions to traditional boosting techniques for addressing label noise in classification tasks. However, these methods have predominantly focused on binary classification, limiting their applicability to multi-class tasks. Furthermore, they encounter challenges with imbalanced datasets, missing values, and computational efficiency. In this paper, we establish that the loss function employed in advanced Gradient Boosting Decision Trees (GBDT), particularly Newton's method-based GBDT, need not necessarily exhibit global convexity. Instead, the loss function only requires convexity within a specific region. Consequently, these GBDT models can leverage the benefits of nonconvex robust loss functions, making them resilient to noise. Building upon this theoretical insight, we introduce a new noise-robust boosting model called Robust-GBDT, which seamlessly integrates the advanced GBDT framework with robust losses. Additionally, we enhance the existing robust loss functions and introduce a novel robust loss function, Robust Focal Loss, designed to address class imbalance. As a result, Robust-GBDT generates more accurate predictions, significantly enhancing its generalization capabilities, especially in scenarios marked by label noise and class imbalance. Furthermore, Robust-GBDT is user-friendly and can easily integrate existing open-source code, enabling it to effectively handle complex datasets while improving computational efficiency. Numerous experiments confirm the superiority of Robust-GBDT over other noise-robust methods.

noise-robust classification, robust-gbdt

arXiv.org Artificial Intelligence

2310.05067

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.60)

Add feedback

Overview of AdaBoost : Reconciling its views to better understand its dynamics

Beja-Battais, Perceval

arXiv.org Machine LearningOct-6-2023

Boosting methods have been introduced in the late 1980's. They were born following the theoritical aspect of PAC learning. The main idea of boosting methods is to combine weak learners to obtain a strong learner. The weak learners are obtained iteratively by an heuristic which tries to correct the mistakes of the previous weak learner. In 1995, Freund and Schapire [18] introduced AdaBoost, a boosting algorithm that is still widely used today. Since then, many views of the algorithm have been proposed to properly tame its dynamics. In this paper, we will try to cover all the views that one can have on AdaBoost. We will start with the original view of Freund and Schapire before covering the different views and unify them with the same formalism. We hope this paper will help the non-expert reader to better understand the dynamics of AdaBoost and how the different views are equivalent and related to each other.

adaboost, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

2310.18323

Country:

North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
Europe > Italy > Sardinia > Cagliari (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.67)

Add feedback

ML4EJ: Decoding the Role of Urban Features in Shaping Environmental Injustice Using Interpretable Machine Learning

Ho, Yu-Hsuan, Liu, Zhewei, Lee, Cheng-Chun, Mostafavi, Ali

arXiv.org Artificial IntelligenceOct-3-2023

Understanding the key factors shaping environmental hazard exposures and their associated environmental injustice issues is vital for formulating equitable policy measures. Traditional perspectives on environmental injustice have primarily focused on the socioeconomic dimensions, often overlooking the influence of heterogeneous urban characteristics. This limited view may obstruct a comprehensive understanding of the complex nature of environmental justice and its relationship with urban design features. To address this gap, this study creates an interpretable machine learning model to examine the effects of various urban features and their non-linear interactions to the exposure disparities of three primary hazards: air pollution, urban heat, and flooding. The analysis trains and tests models with data from six metropolitan counties in the United States using Random Forest and XGBoost. The performance is used to measure the extent to which variations of urban features shape disparities in environmental hazard levels. In addition, the analysis of feature importance reveals features related to social-demographic characteristics as the most prominent urban features that shape hazard extent. Features related to infrastructure distribution and land cover are relatively important for urban heat and air pollution exposure respectively. Moreover, we evaluate the models' transferability across different regions and hazards. The results highlight limited transferability, underscoring the intricate differences among hazards and regions and the way in which urban features shape hazard exposures. The insights gleaned from this study offer fresh perspectives on the relationship among urban features and their interplay with environmental hazard exposure disparities, informing the development of more integrated urban design policies to enhance social equity and environmental injustice issues.

interpretable machine learning, shaping environmental injustice, urban feature, (2 more...)

arXiv.org Artificial Intelligence

2310.02476

Country: North America > United States (0.24)

Genre: Research Report (0.40)

Industry: Law > Environmental Law (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.53)

Add feedback

EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis

Corlatescu, Dragos Georgian, Dinu, Alexandru, Gaman, Mihaela, Sumedrea, Paul

arXiv.org Artificial IntelligenceOct-3-2023

In recent years there has been a shift from heuristics-based malware detection towards machine learning, which proves to be more robust in the current heavily adversarial threat landscape. While we acknowledge machine learning to be better equipped to mine for patterns in the increasingly high amounts of similar-looking files, we also note a remarkable scarcity of the data available for similarity-targeted research. Moreover, we observe that the focus in the few related works falls on quantifying similarity in malware, often overlooking the clean data. This one-sided quantification is especially dangerous in the context of detection bypass. We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER - one of the largest malware classification data sets. We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space. Our contribution is threefold: (1) we publish EMBERSim, an augmented version of EMBER, that includes similarity-informed tags; (2) we enrich EMBERSim with automatically determined malware class tags using the open-source tool AVClass on VirusTotal data and (3) we describe and share the implementation for our class scoring technique and leaf similarity method.

detection, information, similarity, (15 more...)

arXiv.org Artificial Intelligence

2310.01835

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
(3 more...)

Add feedback

SyMPox: An Automated Monkeypox Detection System Based on Symptoms Using XGBoost

Farzipour, Alireza, Elmi, Roya, Nasiri, Hamid

arXiv.org Artificial IntelligenceOct-2-2023

Monkeypox is a zoonotic disease. About 87000 cases of monkeypox were confirmed by the World Health Organization until 10th June 2023. The most prevalent methods for identifying this disease are image-based recognition techniques. Still, they are not too fast and could only be available to a few individuals. This study presents an independent application named SyMPox, developed to diagnose Monkeypox cases based on symptoms. SyMPox utilizes the robust XGBoost algorithm to analyze symptom patterns and provide accurate assessments. Developed using the Gradio framework, SyMPox offers a user-friendly platform for individuals to assess their symptoms and obtain reliable Monkeypox diagnoses.

automated monkeypox detection system, symptom, xgboost, (1 more...)

arXiv.org Artificial Intelligence

2310.19801

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.60)

Add feedback

A Machine Learning Approach to Improving Timing Consistency between Global Route and Detailed Route

Chhabria, Vidya A., Jiang, Wenjing, Kahng, Andrew B., Sapatnekar, Sachin S.

arXiv.org Artificial IntelligenceOct-2-2023

Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a "complete" netlist. The paper first documents that having "oracle knowledge" of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization, machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows - OpenROAD and a commercial tool flow - and results on 45nm bulk and 12nm FinFET enablements show improvements in post-DR slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.

macro, ml model, wire delay, (16 more...)

arXiv.org Artificial Intelligence

2305.06917

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > New York > New York County > New York City (0.05)
North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
(4 more...)

Genre: Research Report (0.82)

Industry: Energy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.46)

Add feedback

Enhancing Mortality Prediction in Heart Failure Patients: Exploring Preprocessing Methods for Imbalanced Clinical Datasets

Kia, Hanif, Vali, Mansour, Sabahi, Hadi

arXiv.org Artificial IntelligenceSep-30-2023

Heart failure (HF) is a critical condition in which the accurate prediction of mortality plays a vital role in guiding patient management decisions. However, clinical datasets used for mortality prediction in HF often suffer from an imbalanced distribution of classes, posing significant challenges. In this paper, we explore preprocessing methods for enhancing one-month mortality prediction in HF patients. We present a comprehensive preprocessing framework including scaling, outliers processing and resampling as key techniques. We also employed an aware encoding approach to effectively handle missing values in clinical datasets. Our study utilizes a comprehensive dataset from the Persian Registry Of cardio Vascular disease (PROVE) with a significant class imbalance. By leveraging appropriate preprocessing techniques and Machine Learning (ML) algorithms, we aim to improve mortality prediction performance for HF patients. The results reveal an average enhancement of approximately 3.6% in F1 score and 2.7% in MCC for tree-based models, specifically Random Forest (RF) and XGBoost (XGB). This demonstrates the efficiency of our preprocessing approach in effectively handling Imbalanced Clinical Datasets (ICD). Our findings hold promise in guiding healthcare professionals to make informed decisions and improve patient outcomes in HF management.

algorithm, dataset, prediction, (14 more...)

arXiv.org Artificial Intelligence

2310.00457

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Iran > Isfahan Province > Isfahan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Predicting Swarm Equatorial Plasma Bubbles via Machine Learning and Shapley Values

Reddy, S. A., Forsyth, C., Aruliah, A., Smith, A., Bortnik, J., Aa, E., Kataria, D. O., Lewis, G.

arXiv.org Artificial IntelligenceSep-30-2023

In this study we present AI Prediction of Equatorial Plasma Bubbles (APE), a machine learning model that can accurately predict the Ionospheric Bubble Index (IBI) on the Swarm spacecraft. IBI is a correlation ($R^2$) between perturbations in plasma density and the magnetic field, whose source can be Equatorial Plasma Bubbles (EPBs). EPBs have been studied for a number of years, but their day-to-day variability has made predicting them a considerable challenge. We build an ensemble machine learning model to predict IBI. We use data from 2014-22 at a resolution of 1sec, and transform it from a time-series into a 6-dimensional space with a corresponding EPB $R^2$ (0-1) acting as the label. APE performs well across all metrics, exhibiting a skill, association and root mean squared error score of 0.96, 0.98 and 0.08 respectively. The model performs best post-sunset, in the American/Atlantic sector, around the equinoxes, and when solar activity is high. This is promising because EPBs are most likely to occur during these periods. Shapley values reveal that F10.7 is the most important feature in driving the predictions, whereas latitude is the least. The analysis also examines the relationship between the features, which reveals new insights into EPB climatology. Finally, the selection of the features means that APE could be expanded to forecasting EPBs following additional investigations into their onset.

equatorial plasma bubble, machine learning and shapley value, swarm equatorial plasma bubble, (9 more...)

arXiv.org Artificial Intelligence

doi: 10.1029/2022JA031183

2209.13482

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Texas > Bexar County > San Antonio (0.04)
(3 more...)

Genre: Research Report (0.84)

Industry: Energy (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback