Goto

Collaborating Authors

 Ensemble Learning


Explainable AI Integrated Feature Selection for Landslide Susceptibility Mapping using TreeSHAP

arXiv.org Artificial Intelligence

Landslides have been a regular occurrence and an alarming threat to human life and property in the era of anthropogenic global warming. An early prediction of landslide susceptibility using a data-driven approach is a demand of time. In this study, we explored the eloquent features that best describe landslide susceptibility with state-of-the-art machine learning methods. In our study, we employed state-of-the-art machine learning algorithms including XgBoost, LR, KNN, SVM, and Adaboost for landslide susceptibility prediction. To find the best hyperparameters of each individual classifier for optimized performance, we have incorporated the Grid Search method, with 10 Fold Cross-Validation. In this context, the optimized version of XgBoost outperformed all other classifiers with a Cross-validation Weighted F1 score of 94.62 %. Followed by this empirical evidence, we explored the XgBoost classifier by incorporating TreeSHAP, a game-theory-based statistical algorithm used to explain Machine Learning models, to identify eloquent features such as SLOPE, ELEVATION, TWI that complement the performance of the XGBoost classifier mostly and features such as LANDUSE, NDVI, SPI which has less effect on models performance. According to the TreeSHAP explanation of features, we selected the 9 most significant landslide causal factors out of 15. Evidently, an optimized version of XgBoost along with feature reduction by 40 % has outperformed all other classifiers in terms of popular evaluation metrics with a Cross-Validation Weighted F1 score of 95.01 % on the training and AUC score of 97 %


Towards an Improved Understanding of Software Vulnerability Assessment Using Data-Driven Approaches

arXiv.org Artificial Intelligence

The thesis advances the field of software security by providing knowledge and automation support for software vulnerability assessment using data-driven approaches. Software vulnerability assessment provides important and multifaceted information to prevent and mitigate dangerous cyber-attacks in the wild. The key contributions include a systematisation of knowledge, along with a suite of novel data-driven techniques and practical recommendations for researchers and practitioners in the area. The thesis results help improve the understanding and inform the practice of assessing ever-increasing vulnerabilities in real-world software systems. This in turn enables more thorough and timely fixing prioritisation and planning of these critical security issues.


Linearly-scalable learning of smooth low-dimensional patterns with permutation-aided entropic dimension reduction

arXiv.org Artificial Intelligence

The problem of efficiently re-arranging or permuting data in some desired way, for example, deploying sorting algorithms, belongs to the most long-standing questions in computer science and applied mathematics [1]. Many impressive theoretical and practical algorithmic results in this area could be established in past decades for problems of one-and low-dimensional data sorting, where one seeks for a monotonic (ascending or descending) ordering in one or several data dimensions [1-3]. In their seminal work, Garrett Birkhoff and John von Neumann have established a mathematical relationship between permutations of the T -component vector and the multiplication of this vector with T T double-stochastic Markovian operator P, containing only one 1.0 in every row and every column, with all other matrix elements being equal to zero [4, 5]. Such a'crisp' double-stochastic Markov operator (containing only zeroes and ones) is referred to as a permutation matrix - as opposed to the'fuzzy' doublestochastic Markov operators that contain elements that are between zero and one. For a vector with T elements there exist T! of all possible ('crisp') permutation matrices P, that build the edges of the T (T 2)-dimensional polytope of all'fuzzy' (or'soft') double-stochastic Markov matrices [6, 7]. NP-complexity of the original'crisp' permutation problem - arising when extending the generic sorting and permutation problems (satisfying desired criteria) from one to several dimensions - has led to a growing popularity of methods based on'soft' relaxations of the permutation matrix, ignited by several very successful mathematical and algorithmic approaches that dwell on the spectral decomposition of the'soft'/'fuzzy' Markov operator [8, 9] and allowing for a metastable decomposition and a reduced analysis of high-dimensional systems from various areas [10-12]. Further, the ideas of'soft' Markovian relaxation for permutations were explored in the areas of graph-matching and graph-alignment, leading to new approaches to these problems - like the new spectral criteria for checking the matching of this'fuzzy' /continuous relaxation to the original'crisp' graph-matching permutation [13]. These'soft' permutation ideas were further applied to the supervised graph-permutation and graph-alignment problems [14-16]. In the literature, it is argued that the'soft' permutations allow reducing NP-hard to P-hard algorithmic solutions, but at the same time is not clear how the loss of'crispness' for the resulting'soft' permutation matrices, can avoid leading to such'soft' permutation relaxation extremes as the stochastic matrices with all of the elements being equal to


Ensemble Framework for Cardiovascular Disease Prediction

arXiv.org Artificial Intelligence

Heart disease is the major cause of non-communicable and silent death worldwide. Heart diseases or cardiovascular diseases are classified into four types: coronary heart disease, heart failure, congenital heart disease, and cardiomyopathy. It is vital to diagnose heart disease early and accurately in order to avoid further injury and save patients' lives. As a result, we need a system that can predict cardiovascular disease before it becomes a critical situation. Machine learning has piqued the interest of researchers in the field of medical sciences. For heart disease prediction, researchers implement a variety of machine learning methods and approaches. In this work, to the best of our knowledge, we have used the dataset from IEEE Data Port which is one of the online available largest datasets for cardiovascular diseases individuals. The dataset isa combination of Hungarian, Cleveland, Long Beach VA, Switzerland & Statlog datasets with important features such as Maximum Heart Rate Achieved, Serum Cholesterol, Chest Pain Type, Fasting blood sugar, and so on. To assess the efficacy and strength of the developed model, several performance measures are used, such as ROC, AUC curve, specificity, F1-score, sensitivity, MCC, and accuracy. In this study, we have proposed a framework with a stacked ensemble classifier using several machine learning algorithms including ExtraTrees Classifier, Random Forest, XGBoost, and so on. Our proposed framework attained an accuracy of 92.34% which is higher than the existing literature.


Predicting Real-time Crash Risks during Hurricane Evacuation Using Connected Vehicle Data

arXiv.org Artificial Intelligence

Hurricane evacuation, ordered to save lives of people of coastal regions, generates high traffic demand with increased crash risk. To mitigate such risk, transportation agencies need to anticipate highway locations with high crash risks to deploy appropriate countermeasures. With ubiquitous sensors and communication technologies, it is now possible to retrieve micro-level vehicular data containing individual vehicle trajectory and speed information. Such high-resolution vehicle data, potentially available in real time, can be used to assess prevailing traffic safety conditions. Using vehicle speed and acceleration profiles, potential crash risks can be predicted in real time. Previous studies on real-time crash risk prediction mainly used data from infrastructure-based sensors which may not cover many road segments. In this paper, we present methods to determine potential crash risks during hurricane evacuation from an emerging alternative data source known as connected vehicle data. Such data contain vehicle location, speed, and acceleration information collected at a very high frequency (less than 30 seconds). To predict potential crash risks, we utilized a dataset collected during the evacuation period of Hurricane Ida on Interstate-10 (I-10) in the state of Louisiana. Multiple machine learning models were trained considering weather features and different traffic characteristics extracted from the connected vehicle data in 5-minute intervals. The results indicate that the Gaussian Process Boosting (GPBoost) and Extreme Gradient Boosting (XGBoost) models perform better (recall = 0.91) than other models. The real-time connected vehicle data for crash risks assessment will allow traffic managers to efficiently utilize resources to proactively take safety measures.


Well-Calibrated Probabilistic Predictive Maintenance using Venn-Abers

arXiv.org Artificial Intelligence

When using machine learning for fault detection, a common problem is the fact that most data sets are very unbalanced, with the minority class (a fault) being the interesting one. In this paper, we investigate the usage of Venn-Abers predictors, looking specifically at the effect on the minority class predictions. A key property of Venn-Abers predictors is that they output well-calibrated probability intervals. In the experiments, we apply Venn-Abers calibration to decision trees, random forests and XGBoost models, showing how both overconfident and underconfident models are corrected. In addition, the benefit of using the valid probability intervals produced by Venn-Abers for decision support is demonstrated. When using techniques producing opaque underlying models, e.g., random forest and XGBoost, each prediction will consist of not only the label, but also a valid probability interval, where the width is an indication of the confidence in the estimate. Adding Venn-Abers on top of a decision tree allows inspection and analysis of the model, to understand both the underlying relationship, and finding out in which parts of feature space that the model is accurate and/or confident.


Gradient boosting for convex cone predict and optimize problems

arXiv.org Artificial Intelligence

Recently there has been a growing body of research on decision-aware predictive modelling (see for example [5, 4, 15, 16, 18, 21, 25]). A traditional'predict, then optimize' framework treats the prediction estimation and decision optimization problem independently. As such, an'objective mismatch' [20] can occur whereby improved prediction accuracy does not result in improved decision accuracy. Conversely, the smart'predict, then optimize' (SPO) [15] framework optimizes prediction models in order to minimize the final downstream decision regret. To date, the SPO framework has been studied in a general setting for linear and decision tree regression models [15, 16]. In this paper we present dboost, a general purpose framework that combines the strength of gradient boosting with the SPO framework. Previous work [19] considers gradient boosting for integrated prediction and optimization problems but only considers a small subset of optimization problems with linear inequality constraints.


Arbitrarily Large Labelled Random Satisfiability Formulas for Machine Learning Training

arXiv.org Artificial Intelligence

Applying deep learning to solve real-life instances of hard combinatorial problems has tremendous potential. Research in this direction has focused on the Boolean satisfiability (SAT) problem, both because of its theoretical centrality and practical importance. A major roadblock faced, though, is that training sets are restricted to random formulas of size several orders of magnitude smaller than formulas of practical interest, raising serious concerns about generalization. This is because labeling random formulas of increasing size rapidly becomes intractable. By exploiting the probabilistic method in a fundamental way, we remove this roadblock entirely: we show how to generate correctly labeled random formulas of any desired size, without having to solve the underlying decision problem. Moreover, the difficulty of the classification task for the formulas produced by our generator is tunable by varying a simple scalar parameter. This opens up an entirely new level of sophistication for the machine learning methods that can be brought to bear on Satisfiability. Using our generator, we train existing state-of-the-art models for the task of predicting satisfiability on formulas with 10,000 variables. We find that they do no better than random guessing. As a first indication of what can be achieved with the new generator, we present a novel classifier that performs significantly better than random guessing 99% on the same datasets, for most difficulty levels. Crucially, unlike past approaches that learn based on syntactic features of a formula, our classifier performs its learning on a short prefix of a solver's computation, an approach that we expect to be of independent interest.


Extrapolation to complete basis-set limit in density-functional theory by quantile random-forest models

arXiv.org Machine Learning

The numerical precision of density-functional-theory (DFT) calculations depends on a variety of computational parameters, one of the most critical being the basis-set size. The ultimate precision is reached with an infinitely large basis set, i.e., in the limit of a complete basis set (CBS). Our aim in this work is to find a machine-learning model that extrapolates finite basis-size calculations to the CBS limit. We start with a data set of 63 binary solids investigated with two all-electron DFT codes, exciting and FHI-aims, which employ very different types of basis sets. A quantile-random-forest model is used to estimate the total-energy correction with respect to a fully converged calculation as a function of the basis-set size. The random-forest model achieves a symmetric mean absolute percentage error of lower than 25% for both codes and outperforms previous approaches in the literature. Our approach also provides prediction intervals, which quantify the uncertainty of the models' predictions.


Improve State-Level Wheat Yield Forecasts in Kazakhstan on GEOGLAM's EO Data by Leveraging A Simple Spatial-Aware Technique

arXiv.org Artificial Intelligence

Accurate yield forecasting is essential for making informed policies and long-term decisions for food security. Earth Observation (EO) data and machine learning algorithms play a key role in providing a comprehensive and timely view of crop conditions from field to national scales. However, machine learning algorithms' prediction accuracy is often harmed by spatial heterogeneity caused by exogenous factors not reflected in remote sensing data, such as differences in crop management strategies. In this paper, we propose and investigate a simple technique called state-wise additive bias to explicitly address the cross-region yield heterogeneity in Kazakhstan. Compared to baseline machine learning models (Random Forest, CatBoost, XGBoost), our method reduces the overall RMSE by 8.9\% and the highest state-wise RMSE by 28.37\%. The effectiveness of state-wise additive bias indicates machine learning's performance can be significantly improved by explicitly addressing the spatial heterogeneity, motivating future work on spatial-aware machine learning algorithms for yield forecasts as well as for general geospatial forecasting problems.