Goto

Collaborating Authors

 Ensemble Learning


Predicting Accident Severity: An Analysis Of Factors Affecting Accident Severity Using Random Forest Model

arXiv.org Artificial Intelligence

Road accidents have significant economic and societal costs, with a small number of severe accidents accounting for a large portion of these costs. Predicting accident severity can help in the proactive approach to road safety by identifying potential unsafe road conditions and taking well-informed actions to reduce the number of severe accidents. This study investigates the effectiveness of the Random Forest machine learning algorithm for predicting the severity of an accident. The model is trained on a dataset of accident records from a large metropolitan area and evaluated using various metrics. Hyperparameters and feature selection are optimized to improve the model's performance. The results show that the Random Forest model is an effective tool for predicting accident severity with an accuracy of over 80%. The study also identifies the top six most important variables in the model, which include wind speed, pressure, humidity, visibility, clear conditions, and cloud cover. The fitted model has an Area Under the Curve of 80%, a recall of 79.2%, a precision of 97.1%, and an F1 score of 87.3%. These results suggest that the proposed model has higher performance in explaining the target variable, which is the accident severity class. Overall, the study provides evidence that the Random Forest model is a viable and reliable tool for predicting accident severity and can be used to help reduce the number of fatalities and injuries due to road accidents in the United States


Boosted Control Functions

arXiv.org Machine Learning

Modern machine learning methods and the availability of large-scale data opened the door to accurately predict target quantities from large sets of covariates. However, existing prediction methods can perform poorly when the training and testing data are different, especially in the presence of hidden confounding. While hidden confounding is well studied for causal effect estimation (e.g., instrumental variables), this is not the case for prediction tasks. This work aims to bridge this gap by addressing predictions under different training and testing distributions in the presence of unobserved confounding. In particular, we establish a novel connection between the field of distribution generalization from machine learning, and simultaneous equation models and control function from econometrics. Central to our contribution are simultaneous equation models for distribution generalization (SIMDGs) which describe the data-generating process under a set of distributional shifts. Within this framework, we propose a strong notion of invariance for a predictive model and compare it with existing (weaker) versions. Building on the control function approach from instrumental variable regression, we propose the boosted control function (BCF) as a target of inference and prove its ability to successfully predict even in intervened versions of the underlying SIMDG. We provide necessary and sufficient conditions for identifying the BCF and show that it is worst-case optimal. We introduce the ControlTwicing algorithm to estimate the BCF and analyze its predictive performance on simulated and real world data.


Robust-GBDT: A Novel Gradient Boosting Model for Noise-Robust Classification

arXiv.org Artificial Intelligence

Robust boosting algorithms have emerged as alternative solutions to traditional boosting techniques for addressing label noise in classification tasks. However, these methods have predominantly focused on binary classification, limiting their applicability to multi-class tasks. Furthermore, they encounter challenges with imbalanced datasets, missing values, and computational efficiency. In this paper, we establish that the loss function employed in advanced Gradient Boosting Decision Trees (GBDT), particularly Newton's method-based GBDT, need not necessarily exhibit global convexity. Instead, the loss function only requires convexity within a specific region. Consequently, these GBDT models can leverage the benefits of nonconvex robust loss functions, making them resilient to noise. Building upon this theoretical insight, we introduce a new noise-robust boosting model called Robust-GBDT, which seamlessly integrates the advanced GBDT framework with robust losses. Additionally, we enhance the existing robust loss functions and introduce a novel robust loss function, Robust Focal Loss, designed to address class imbalance. As a result, Robust-GBDT generates more accurate predictions, significantly enhancing its generalization capabilities, especially in scenarios marked by label noise and class imbalance. Furthermore, Robust-GBDT is user-friendly and can easily integrate existing open-source code, enabling it to effectively handle complex datasets while improving computational efficiency. Numerous experiments confirm the superiority of Robust-GBDT over other noise-robust methods.


Overview of AdaBoost : Reconciling its views to better understand its dynamics

arXiv.org Machine Learning

Boosting methods have been introduced in the late 1980's. They were born following the theoritical aspect of PAC learning. The main idea of boosting methods is to combine weak learners to obtain a strong learner. The weak learners are obtained iteratively by an heuristic which tries to correct the mistakes of the previous weak learner. In 1995, Freund and Schapire [18] introduced AdaBoost, a boosting algorithm that is still widely used today. Since then, many views of the algorithm have been proposed to properly tame its dynamics. In this paper, we will try to cover all the views that one can have on AdaBoost. We will start with the original view of Freund and Schapire before covering the different views and unify them with the same formalism. We hope this paper will help the non-expert reader to better understand the dynamics of AdaBoost and how the different views are equivalent and related to each other.


ML4EJ: Decoding the Role of Urban Features in Shaping Environmental Injustice Using Interpretable Machine Learning

arXiv.org Artificial Intelligence

Understanding the key factors shaping environmental hazard exposures and their associated environmental injustice issues is vital for formulating equitable policy measures. Traditional perspectives on environmental injustice have primarily focused on the socioeconomic dimensions, often overlooking the influence of heterogeneous urban characteristics. This limited view may obstruct a comprehensive understanding of the complex nature of environmental justice and its relationship with urban design features. To address this gap, this study creates an interpretable machine learning model to examine the effects of various urban features and their non-linear interactions to the exposure disparities of three primary hazards: air pollution, urban heat, and flooding. The analysis trains and tests models with data from six metropolitan counties in the United States using Random Forest and XGBoost. The performance is used to measure the extent to which variations of urban features shape disparities in environmental hazard levels. In addition, the analysis of feature importance reveals features related to social-demographic characteristics as the most prominent urban features that shape hazard extent. Features related to infrastructure distribution and land cover are relatively important for urban heat and air pollution exposure respectively. Moreover, we evaluate the models' transferability across different regions and hazards. The results highlight limited transferability, underscoring the intricate differences among hazards and regions and the way in which urban features shape hazard exposures. The insights gleaned from this study offer fresh perspectives on the relationship among urban features and their interplay with environmental hazard exposure disparities, informing the development of more integrated urban design policies to enhance social equity and environmental injustice issues.


EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis

arXiv.org Artificial Intelligence

In recent years there has been a shift from heuristics-based malware detection towards machine learning, which proves to be more robust in the current heavily adversarial threat landscape. While we acknowledge machine learning to be better equipped to mine for patterns in the increasingly high amounts of similar-looking files, we also note a remarkable scarcity of the data available for similarity-targeted research. Moreover, we observe that the focus in the few related works falls on quantifying similarity in malware, often overlooking the clean data. This one-sided quantification is especially dangerous in the context of detection bypass. We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER - one of the largest malware classification data sets. We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space. Our contribution is threefold: (1) we publish EMBERSim, an augmented version of EMBER, that includes similarity-informed tags; (2) we enrich EMBERSim with automatically determined malware class tags using the open-source tool AVClass on VirusTotal data and (3) we describe and share the implementation for our class scoring technique and leaf similarity method.


SyMPox: An Automated Monkeypox Detection System Based on Symptoms Using XGBoost

arXiv.org Artificial Intelligence

Monkeypox is a zoonotic disease. About 87000 cases of monkeypox were confirmed by the World Health Organization until 10th June 2023. The most prevalent methods for identifying this disease are image-based recognition techniques. Still, they are not too fast and could only be available to a few individuals. This study presents an independent application named SyMPox, developed to diagnose Monkeypox cases based on symptoms. SyMPox utilizes the robust XGBoost algorithm to analyze symptom patterns and provide accurate assessments. Developed using the Gradio framework, SyMPox offers a user-friendly platform for individuals to assess their symptoms and obtain reliable Monkeypox diagnoses.


A Machine Learning Approach to Improving Timing Consistency between Global Route and Detailed Route

arXiv.org Artificial Intelligence

Due to the unavailability of routing information in design stages prior to detailed routing (DR), the tasks of timing prediction and optimization pose major challenges. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a "complete" netlist. The paper first documents that having "oracle knowledge" of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization, machine learning (ML)-based models are proposed, including the use of features for macro blockages for accurate predictions for designs with macros. Based on a set of experimental evaluations, it is demonstrated that these models show higher accuracy than GR-based timing estimation. When used during post-GR optimization, the ML-based models show demonstrable improvements in post-DR circuit performance. The methodology is applied to two different tool flows - OpenROAD and a commercial tool flow - and results on 45nm bulk and 12nm FinFET enablements show improvements in post-DR slack metrics without increasing congestion. The models are demonstrated to be generalizable to designs generated under different clock period constraints and are robust to training data with small levels of noise.


Enhancing Mortality Prediction in Heart Failure Patients: Exploring Preprocessing Methods for Imbalanced Clinical Datasets

arXiv.org Artificial Intelligence

Heart failure (HF) is a critical condition in which the accurate prediction of mortality plays a vital role in guiding patient management decisions. However, clinical datasets used for mortality prediction in HF often suffer from an imbalanced distribution of classes, posing significant challenges. In this paper, we explore preprocessing methods for enhancing one-month mortality prediction in HF patients. We present a comprehensive preprocessing framework including scaling, outliers processing and resampling as key techniques. We also employed an aware encoding approach to effectively handle missing values in clinical datasets. Our study utilizes a comprehensive dataset from the Persian Registry Of cardio Vascular disease (PROVE) with a significant class imbalance. By leveraging appropriate preprocessing techniques and Machine Learning (ML) algorithms, we aim to improve mortality prediction performance for HF patients. The results reveal an average enhancement of approximately 3.6% in F1 score and 2.7% in MCC for tree-based models, specifically Random Forest (RF) and XGBoost (XGB). This demonstrates the efficiency of our preprocessing approach in effectively handling Imbalanced Clinical Datasets (ICD). Our findings hold promise in guiding healthcare professionals to make informed decisions and improve patient outcomes in HF management.


Predicting Swarm Equatorial Plasma Bubbles via Machine Learning and Shapley Values

arXiv.org Artificial Intelligence

In this study we present AI Prediction of Equatorial Plasma Bubbles (APE), a machine learning model that can accurately predict the Ionospheric Bubble Index (IBI) on the Swarm spacecraft. IBI is a correlation ($R^2$) between perturbations in plasma density and the magnetic field, whose source can be Equatorial Plasma Bubbles (EPBs). EPBs have been studied for a number of years, but their day-to-day variability has made predicting them a considerable challenge. We build an ensemble machine learning model to predict IBI. We use data from 2014-22 at a resolution of 1sec, and transform it from a time-series into a 6-dimensional space with a corresponding EPB $R^2$ (0-1) acting as the label. APE performs well across all metrics, exhibiting a skill, association and root mean squared error score of 0.96, 0.98 and 0.08 respectively. The model performs best post-sunset, in the American/Atlantic sector, around the equinoxes, and when solar activity is high. This is promising because EPBs are most likely to occur during these periods. Shapley values reveal that F10.7 is the most important feature in driving the predictions, whereas latitude is the least. The analysis also examines the relationship between the features, which reveals new insights into EPB climatology. Finally, the selection of the features means that APE could be expanded to forecasting EPBs following additional investigations into their onset.