Arterial incident duration prediction using a bi-level framework of extreme gradient-tree boosting

arXiv.org Machine Learning

Abstract: Predicting traffic incident duration is a major challenge for many traffic centres around the world. Most research studies focus on predicting the incident duration on motorways rather than arterial roads, due to a high network complexity and lack of data. In this paper we propose a bi-level framework for predicting the accident duration on arterial road networks in Sydney, based on operational requirements of incident clearance target which is less than 45 minutes. Using incident baseline information, we first deploy a classification method using various ensemble tree models in order to predict whether a new incident will be cleared in less than 45min or not. If the incident was classified as short-term, then various regression models are developed for predicting the actual incident duration in minutes by incorporating various traffic flow features. After outlier removal and intensive model hyper-parameter tuning through randomized search and cross-validation, we show that the extreme gradient boost approach outperformed all models, including the gradient-boosted decision-trees by almost 53%. Finally, we perform a feature importance evaluation for incident duration prediction and show that the best prediction results are obtained when leveraging the real-time traffic flow in vicinity road sections to the reported accident location. Initial methods used to predict the incident duration were 1. Introduction Bayesian classifiers [5], discrete choice models (DCM) [6], probabilistic distribution analyses [7], and the hazard-based Traffic congestion is a major concern for many cities duration models (HBDM) [8].


Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights

arXiv.org Machine Learning

Reducing traffic accidents is an important public safety challenge, therefore, accident analysis and prediction has been a topic of much research over the past few decades. Using small-scale datasets with limited coverage, being dependent on extensive set of data, and being not applicable for real-time purposes are the important shortcomings of the existing studies. To address these challenges, we propose a new solution for real-time traffic accident prediction using easy-to-obtain, but sparse data. Our solution relies on a deep-neural-network model (which we have named DAP, for Deep Accident Prediction); which utilizes a variety of data attributes such as traffic events, weather data, points-of-interest, and time. DAP incorporates multiple components including a recurrent (for time-sensitive data), a fully connected (for time-insensitive data), and a trainable embedding component (to capture spatial heterogeneity). To fill the data gap, we have - through a comprehensive process of data collection, integration, and augmentation - created a large-scale publicly available database of accident information named US-Accidents. By employing the US-Accidents dataset and through an extensive set of experiments across several large cities, we have evaluated our proposal against several baselines. Our analysis and results show significant improvements to predict rare accident events. Further, we have shown the impact of traffic information, time, and points-of-interest data for real-time accident prediction.


Credit risk prediction in an imbalanced social lending environment

arXiv.org Machine Learning

Credit risk prediction is an effective way of evaluating whether a potential borrower will repay a loan, particularly in peer-to-peer lending where class imbalance problems are prevalent. However, few credit risk prediction models for social lending consider imbalanced data and, further, the best resampling technique to use with imbalanced data is still controversial. In an attempt to address these problems, this paper presents an empirical comparison of various combinations of classifiers and resampling techniques within a novel risk assessment methodology that incorporates imbalanced data. The credit predictions from each combination are evaluated with a G-mean measure to avoid bias towards the majority class, which has not been considered in similar studies. The results reveal that combining random forest and random under-sampling may be an effective strategy for calculating the credit risk associated with loan applicants in social lending markets.


A Data Mining Approach to Flight Arrival Delay Prediction for American Airlines

arXiv.org Machine Learning

In the present scenario of domestic flights in USA, there have been numerous instances of flight delays and cancellations. In the United States, the American Airlines, Inc. have been one of the most entrusted and the world's largest airline in terms of number of destinations served. But when it comes to domestic flights, AA has not lived up to the expectations in terms of punctuality or on-time performance. Flight Delays also result in airline companies operating commercial flights to incur huge losses. So, they are trying their best to prevent or avoid Flight Delays and Cancellations by taking certain measures. This study aims at analyzing flight information of US domestic flights operated by American Airlines, covering top 5 busiest airports of US and predicting possible arrival delay of the flight using Data Mining and Machine Learning Approaches. The Gradient Boosting Classifier Model is deployed by training and hyper-parameter tuning it, achieving a maximum accuracy of 85.73%. Such an Intelligent System is very essential in foretelling flights'on-time performance.


AI Predicts Independent Construction Safety Outcomes from Universal Attributes

arXiv.org Machine Learning

This paper significantly improves on, and finishes to validate, the approach proposed in "Application of Machine Learning to Construction Injury Prediction" (Tixier et al. 2016 [1]). Like in the original study, we use NLP to extract fundamental attributes from raw incident reports and machine learning models are trained to predict safety outcomes (here, these outcomes are injury severity, injury type, bodypart impacted, and incident type). However, in this study, safety outcomes were not extracted via NLP but are independent (human annotations), eliminating any potential source of artificial correlation between predictors and predictands. Results show that attributes are still highly predictive, confirming the validity of the original study. Other improvements brought by the current study include the use of (1) a much larger dataset, (2) two new models (XGBoost andlinear SVM), (3) model stacking, (4) a more straight forward experimental setup with more appropriate performance metrics, and (5) an analysis of per-category attribute importance scores. Finally, the injury severity outcome is well predicted, which was not the case in the original study. This is a significant advancement.