Ensemble Learning
Reducing Air Pollution through Machine Learning
Bertsimas, Dimitris, Boussioux, Leonard, Zeng, Cynthia
This paper presents a data-driven approach to mitigate the effects of air pollution from industrial plants on nearby cities by linking operational decisions with weather conditions. Our method combines predictive and prescriptive machine learning models to forecast short-term wind speed and direction and recommend operational decisions to reduce or pause the industrial plant's production. We exhibit several trade-offs between reducing environmental impact and maintaining production activities. The predictive component of our framework employs various machine learning models, such as gradient-boosted tree-based models and ensemble methods, for time series forecasting. The prescriptive component utilizes interpretable optimal policy trees to propose multiple trade-offs, such as reducing dangerous emissions by 33-47% and unnecessary costs by 40-63%. Our deployed models significantly reduced forecasting errors, with a range of 38-52% for less than 12-hour lead time and 14-46% for 12 to 48-hour lead time compared to official weather forecasts. We have successfully implemented the predictive component at the OCP Safi site, which is Morocco's largest chemical industrial plant, and are currently in the process of deploying the prescriptive component. Our framework enables sustainable industrial development by eliminating the pollution-industrial activity trade-off through data-driven weather-based operational decisions, significantly enhancing factory optimization and sustainability. This modernizes factory planning and resource allocation while maintaining environmental compliance. The predictive component has boosted production efficiency, leading to cost savings and reduced environmental impact by minimizing air pollution.
Automatic pain recognition from Blood Volume Pulse (BVP) signal using machine learning techniques
Pouromran, Fatemeh, Lin, Yingzi, Kamarthi, Sagar
Physiological responses to pain have received increasing attention among researchers for developing an automated pain recognition sensing system. Though less explored, Blood Volume Pulse (BVP) is one of the candidate physiological measures that could help objective pain assessment. In this study, we applied machine learning techniques on BVP signals to device a non-invasive modality for pain sensing. Thirty-two healthy subjects participated in this study. First, we investigated a novel set of time-domain, frequency-domain and nonlinear dynamics features that could potentially be sensitive to pain. These include 24 features from BVP signals and 20 additional features from Inter-beat Intervals (IBIs) derived from the same BVP signals. Utilizing these features, we built machine learning models for detecting the presence of pain and its intensity. We explored different machine learning models, including Logistic Regression, Random Forest, Support Vector Machines, Adaptive Boosting (AdaBoost) and Extreme Gradient Boosting (XGBoost). Among them, we found that the XGBoost offered the best model performance for both pain classification and pain intensity estimation tasks. The ROC-AUC of the XGBoost model to detect low pain, medium pain and high pain with no pain as the baseline were 80.06 %, 85.81 %, and 90.05 % respectively. Moreover, the XGboost classifier distinguished medium pain from high pain with ROC-AUC of 91%. For the multi-class classification among three pain levels, the XGBoost offered the best performance with an average F1-score of 80.03%. Our results suggest that BVP signal together with machine learning algorithms is a promising physiological measurement for automated pain assessment. This work will have a national impact on accurate pain assessment, effective pain management, reducing drug-seeking behavior among patients, and addressing national opioid crisis.
Finding Minimum-Cost Explanations for Predictions made by Tree Ensembles
Törnblom, John, Karlsson, Emil, Nadjm-Tehrani, Simin
The ability to explain why a machine learning model arrives at a particular prediction is crucial when used as decision support by human operators of critical systems. The provided explanations must be provably correct, and preferably without redundant information, called minimal explanations. In this paper, we aim at finding explanations for predictions made by tree ensembles that are not only minimal, but also minimum with respect to a cost function. To this end, we first present a highly efficient oracle that can determine the correctness of explanations, surpassing the runtime performance of current state-of-the-art alternatives by several orders of magnitude when computing minimal explanations. Secondly, we adapt an algorithm called MARCO from related works (calling it m-MARCO) for the purpose of computing a single minimum explanation per prediction, and demonstrate an overall speedup factor of two compared to the MARCO algorithm which enumerates all minimal explanations. Finally, we study the obtained explanations from a range of use cases, leading to further insights of their characteristics. In particular, we observe that in several cases, there are more than 100,000 minimal explanations to choose from for a single prediction. In these cases, we see that only a small portion of the minimal explanations are also minimum, and that the minimum explanations are significantly less verbose, hence motivating the aim of this work.
Interpretable Ensembles of Hyper-Rectangles as Base Models
Konstantinov, Andrei V., Utkin, Lev V.
A new extremely simple ensemble-based model with the uniformly generated axis-parallel hyper-rectangles as base models (HRBM) is proposed. Two types of HRBMs are studied: closed rectangles and corners. The main idea behind HRBM is to consider and count training examples inside and outside each rectangle. It is proposed to incorporate HRBMs into the gradient boosting machine (GBM). Despite simplicity of HRBMs, it turns out that these simple base models allow us to construct effective ensemble-based models and avoid overfitting. A simple method for calculating optimal regularization parameters of the ensemble-based model, which can be modified in the explicit way at each iteration of GBM, is considered. Moreover, a new regularization called the "step height penalty" is studied in addition to the standard L1 and L2 regularizations. An extremely simple approach to the proposed ensemble-based model prediction interpretation by using the well-known method SHAP is proposed. It is shown that GBM with HRBM can be regarded as a model extending a set of interpretable models for explaining black-box models. Numerical experiments with real datasets illustrate the proposed GBM with HRBMs for regression and classification problems. Experiments also illustrate computational efficiency of the proposed SHAP modifications. The code of proposed algorithms implementing GBM with HRBM is publicly available.
EGFR mutation prediction using F18-FDG PET-CT based radiomics features in non-small cell lung cancer
Henriquez, Hector, Fuentes, Diana, Suarez, Francisco, Gonzalez, Patricio
Lung cancer is the leading cause of cancer death in the world. Accurate determination of the EGFR (epidermal growth factor receptor) mutation status is highly relevant for the proper treatment of this patients. Purpose: The aim of this study was to predict the mutational status of the EGFR in non-small cell lung cancer patients using radiomics features extracted from PET-CT images. Methods: Retrospective study that involve 34 patients with lung cancer confirmed by histology and EGFR status mutation assessment. A total of 2.205 radiomics features were extracted from manual segmentation of the PET-CT images using pyradiomics library. Both computed tomography and positron emission tomography images were used. All images were acquired with intravenous iodinated contrast and F18-FDG. Preprocessing includes resampling, normalization, and discretization of the pixel intensity. Three methods were used for the feature selection process: backward selection (set 1), forward selection (set 2), and feature importance analysis of random forest model (set 3). Nine machine learning methods were used for radiomics model building. Results: 35.2% of patients had EGFR mutation, without significant differences in age, gender, tumor size and SUVmax. After the feature selection process 6, 7 and 17 radiomics features were selected, respectively in each group. The best performances were obtained by Ridge Regression in set 1: AUC of 0.826 (95% CI, 0.811 - 0.839), Random Forest in set 2: AUC of 0.823 (95% CI, 0.808 - 0.838) and Neural Network in set 3: AUC of 0.821 (95% CI, 0.808 - 0.835). Conclusion: The radiomics features analysis has the potential of predicting clinically relevant mutations in lung cancer patients through a non-invasive methodology.
Gradient Boosting Performs Gaussian Process Inference
Ustimenko, Aleksei, Beliakov, Artem, Prokhorenkova, Liudmila
This paper shows that gradient boosting based on symmetric decision trees can be equivalently reformulated as a kernel method that converges to the solution of a certain Kernel Ridge Regression problem. Thus, we obtain the convergence to a Gaussian Process' posterior mean, which, in turn, allows us to easily transform gradient boosting into a sampler from the posterior to provide better knowledge uncertainty estimates through Monte-Carlo estimation of the posterior variance. We show that the proposed sampler allows for better knowledge uncertainty estimates leading to improved out-of-domain detection. Gradient boosting (Friedman, 2001) is a classic machine learning algorithm successfully used for web search, recommendation systems, weather forecasting, and other problems (Roe et al., 2005; Caruana & Niculescu-Mizil, 2006; Richardson et al., 2007; Wu et al., 2010; Burges, 2010; Zhang & Haghani, 2015). In a nutshell, gradient boosting methods iteratively combine simple models (usually decision trees), minimizing a given loss function. Despite the recent success of neural approaches in various areas, gradient-boosted decision trees (GBDT) are still state-of-the-art algorithms for tabular datasets containing heterogeneous features (Gorishniy et al., 2021; Katzir et al., 2021). This paper aims at a better theoretical understanding of GBDT methods for regression problems assuming the widely used RMSE loss function. First, we show that the gradient boosting with regularization can be reformulated as an optimization problem in some Reproducing Kernel Hilbert Space (RKHS) with implicitly defined kernel structure.
Credit Card Fraud Detection Using Enhanced Random Forest Classifier for Imbalanced Data
Aburbeian, AlsharifHasan Mohamad, Ashqar, Huthaifa I.
The credit card has become the most popular payment method for both online and offline transactions. The necessity to create a fraud detection algorithm to precisely identify and stop fraudulent activity arises as a result of both the development of technology and the rise in fraud cases. This paper implements the random forest (RF) algorithm to solve the issue in the hand. A dataset of credit card transactions was used in this study. The main problem when dealing with credit card fraud detection is the imbalanced dataset in which most of the transaction are non-fraud ones. To overcome the problem of the imbalanced dataset, the synthetic minority over-sampling technique (SMOTE) was used. Implementing the hyperparameters technique to enhance the performance of the random forest classifier. The results showed that the RF classifier gained an accuracy of 98% and about 98% of F1-score value, which is promising. We also believe that our model is relatively easy to apply and can overcome the issue of imbalanced data for fraud detection applications.
Monitoring Efficiency of IoT Wireless Charging
Yang, Pengwei, Abusafia, Amani, Lakhdari, Abdallah, Bouguettaya, Athman
Crowdsourcing wireless energy is a novel and convenient solution to charge nearby IoT devices. Several applications have been proposed to enable peer-to-peer wireless energy charging. However, none of them considered the energy efficiency of the wireless transfer of energy. In this paper, we propose an energy estimation framework that predicts the actual received energy. Our framework uses two machine learning algorithms, namely XGBoost and Neural Network, to estimate the received energy. The result shows that the Neural Network model is better than XGBoost at predicting the received energy. We train and evaluate our models by collecting a real wireless energy dataset.
Analysis and Evaluation of Explainable Artificial Intelligence on Suicide Risk Assessment
Tang, Hao, Rekavandi, Aref Miri, Rooprai, Dharjinder, Dwivedi, Girish, Sanfilippo, Frank, Boussaid, Farid, Bennamoun, Mohammed
This study investigates the effectiveness of Explainable Artificial Intelligence (XAI) techniques in predicting suicide risks and identifying the dominant causes for such behaviours. Data augmentation techniques and ML models are utilized to predict the associated risk. Furthermore, SHapley Additive exPlanations (SHAP) and correlation analysis are used to rank the importance of variables in predictions. Experimental results indicate that Decision Tree (DT), Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) models achieve the best results while DT has the best performance with an accuracy of 95.23% and an Area Under Curve (AUC) of 0.95. As per SHAP results, anger problems, depression, and social isolation are the leading variables in predicting the risk of suicide, and patients with good incomes, respected occupations, and university education have the least risk. Results demonstrate the effectiveness of machine learning and XAI framework for suicide risk prediction, and they can assist psychiatrists in understanding complex human behaviours and can also assist in reliable clinical decision-making.
Lexical Complexity Prediction: An Overview
North, Kai, Zampieri, Marcos, Shardlow, Matthew
Understanding the meaning of words in context is fundamental for reading comprehension. The perceived difficulty, hereafter referred to as complexity, of a target word within a given text varies widely among readers. With an increased demand for distance learning and educational technologies[107], research into automatically predicting which words are likely to cause comprehension problems is becoming a popular area of research [115, 147, 185]. Systems have been created to identify complex words that are difficult to acquire, reproduce, or understand for children [79], second-language learners [89], people suffering from a reading disability, such as dyslexia [131] or aphasia [35, 53], or more generally, individuals with low literacy [59, 175]. In Computational Linguistics and Natural Language Processing (NLP), the task of automatically recognizing complex words is most often achieved by training machine learning (ML) models. These ML models assign a complexity value to each target word within an inputted extract, sentence, or text that allows for the identification of complex words. This information can then be used to improve downstream lexical and text simplification systems that provide simpler alternatives to aid reading comprehension. Take the extract shown in Table 1 for example.