Ensemble Learning
SPINEX-TimeSeries: Similarity-based Predictions with Explainable Neighbors Exploration for Time Series and Forecasting Problems
This paper introduces a new addition to the SPINEX (Similarity-based Predictions with Explainable Neighbors Exploration) family, tailored specifically for time series and forecasting analysis. This new algorithm leverages the concept of similarity and higher-order temporal interactions across multiple time scales to enhance predictive accuracy and interpretability in forecasting. To evaluate the effectiveness of SPINEX, we present comprehensive benchmarking experiments comparing it against 18 algorithms and across 49 synthetic and real datasets characterized by varying trends, seasonality, and noise levels. Our performance assessment focused on forecasting accuracy and computational efficiency. Our findings reveal that SPINEX consistently ranks among the top 5 performers in forecasting precision and has a superior ability to handle complex temporal dynamics compared to commonly adopted algorithms. Moreover, the algorithm's explainability features, Pareto efficiency, and medium complexity (on the order of O(log n)) are demonstrated through detailed visualizations to enhance the prediction and decision-making process. We note that integrating similarity-based concepts opens new avenues for research in predictive analytics, promising more accurate and transparent decision making.
Enhanced Prediction of Ventilator-Associated Pneumonia in Patients with Traumatic Brain Injury Using Advanced Machine Learning Techniques
Ashrafi, Negin, Abdollahi, Armin, Pishgar, Maryam
Background: Ventilator-associated pneumonia (VAP) in traumatic brain injury (TBI) patients poses a significant mortality risk and imposes a considerable financial burden on patients and healthcare systems. Timely detection and prognostication of VAP in TBI patients are crucial to improve patient outcomes and alleviate the strain on healthcare resources. Methods: We implemented six machine learning models using the MIMIC-III database. Our methodology included preprocessing steps, such as feature selection with CatBoost and expert opinion, addressing class imbalance with the Synthetic Minority Oversampling Technique (SMOTE), and rigorous model tuning through 5-fold cross-validation to optimize hyperparameters. Key models evaluated included SVM, Logistic Regression, Random Forest, XGBoost, ANN, and AdaBoost. Additionally, we conducted SHAP analysis to determine feature importance and performed an ablation study to assess feature impacts on model performance. Results: XGBoost outperformed the baseline models and the best existing literature. We used metrics, including AUC, Accuracy, Specificity, Sensitivity, F1 Score, PPV, and NPV. XGBoost demonstrated the highest performance with an AUC of 0.940 and an Accuracy of 0.875, which are 23.4% and 23.5% higher than the best results in the existing literature, with an AUC of 0.706 and an Accuracy of 0.640, respectively. This enhanced performance underscores the models' effectiveness in clinical settings. Conclusions: This study enhances the predictive modeling of VAP in TBI patients, improving early detection and intervention potential. Refined feature selection and advanced ensemble techniques significantly boosted model accuracy and reliability, offering promising directions for future clinical applications and medical diagnostics research.
Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality
Shumilov, Arseniy, Zhu, Yueting, Ashrafi, Negin, Lian, Gaojie, Ren, Shilong, Pishgar, Maryam
Background: Sepsis is a severe condition responsible for many deaths worldwide. Accurate prediction of sepsis outcomes is crucial for timely and effective treatment. Although previous studies have used ML to forecast outcomes, they faced limitations in feature selection and model comprehensibility, resulting in less effective predictions. Thus, this research aims to develop an interpretable and accurate ML model to help clinical professionals predict in-hospital mortality. Methods: We analyzed ICU patient records from the MIMIC-III database based on specific criteria and extracted relevant data. Our feature selection process included a literature review, clinical input refinement, and using Random Forest to select the top 35 features. We performed data preprocessing, including cleaning, imputation, standardization, and applied SMOTE for oversampling to address imbalance, resulting in 4,683 patients, with admission counts of 17,429. We compared the performance of Random Forest, Gradient Boosting, Logistic Regression, SVM, and KNN models. Results: The Random Forest model was the most effective in predicting sepsis-related in-hospital mortality. It outperformed other models, achieving an accuracy of 0.90 and an AUROC of 0.97, significantly better than the existing literature. Our meticulous feature selection contributed to the model's precision and identified critical determinants of sepsis mortality. These results underscore the pivotal role of data-driven ML in healthcare, especially for predicting in-hospital mortality due to sepsis. Conclusion: This study represents a significant advancement in predicting in-hospital sepsis mortality, highlighting the potential of ML in healthcare. The implications are profound, offering a data-driven approach that enhances decision-making in patient care and reduces in-hospital mortality.
Open Set Recognition for Random Forest
Feng, Guanchao, Desai, Dhruv, Pasquali, Stefano, Mehta, Dhagash
In the open-set settings, classi ers are required to not only accurately classify new instances of known In many real-world classi cation or recognition tasks, it is often classes (whose samples are observed during training) but also e ectively di cult to collect training examples that exhaust all possible classes recognize the samples from unknown classes. In a nutshell, due to, for example, incomplete knowledge during training or ever open-set classi ers are capable of making the "none of the above" changing regimes. Therefore, samples from unknown/novel classes decision with respect to known classes. This is known as open-set may be encountered in testing/deployment. In such scenarios, the recognition (OSR) [38] and has received signi cant attention in classi ers should be able to i) perform classi cation on known recent years [11, 47]. Since many learning tasks in nance are naturally classes, and at the same time, ii) identify samples from unknown classi cation tasks, for instance, company classi cations using classes. This is known as open-set recognition. Although random Global Industry Classi cation Standard (GICS), fund categorization, forest has been an extremely successful framework as a generalpurpose risk pro ling, economic scenario classi cations, etc., where often a classi cation (and regression) method, in practice, it usually new company, fund or economic scenario may not belong to any operates under the closed-set assumption and is not able to identify of the existing categories, casting these recognition tasks as OSR samples from new classes when run out of the box. In this work, we instead of traditional closed-set classi cation tasks is more appropriate.
Towards Evolutionary-based Automated Machine Learning for Small Molecule Pharmacokinetic Prediction
de Sรก, Alex G. C., Ascher, David B.
Machine learning (ML) is revolutionising drug discovery by expediting the prediction of small molecule properties essential for developing new drugs. These properties -- including absorption, distribution, metabolism and excretion (ADME)-- are crucial in the early stages of drug development since they provide an understanding of the course of the drug in the organism, i.e., the drug's pharmacokinetics. However, existing methods lack personalisation and rely on manually crafted ML algorithms or pipelines, which can introduce inefficiencies and biases into the process. To address these challenges, we propose a novel evolutionary-based automated ML method (AutoML) specifically designed for predicting small molecule properties, with a particular focus on pharmacokinetics. Leveraging the advantages of grammar-based genetic programming, our AutoML method streamlines the process by automatically selecting algorithms and designing predictive pipelines tailored to the particular characteristics of input molecular data. Results demonstrate AutoML's effectiveness in selecting diverse ML algorithms, resulting in comparable or even improved predictive performances compared to conventional approaches. By offering personalised ML-driven pipelines, our method promises to enhance small molecule research in drug discovery, providing researchers with a valuable tool for accelerating the development of novel therapeutic drugs.
A collaborative ensemble construction method for federated random forest
Lim, Penjan Antonio Eng, Park, Cheong Hee
Random forests are considered a cornerstone in machine learning for their robustness and versatility. Despite these strengths, their conventional centralized training is ill-suited for the modern landscape of data that is often distributed, sensitive, and subject to privacy concerns. Federated learning (FL) provides a compelling solution to this problem, enabling models to be trained across a group of clients while maintaining the privacy of each client's data. However, adapting tree-based methods like random forests to federated settings introduces significant challenges, particularly when it comes to non-identically distributed (non-IID) data across clients, which is a common scenario in real-world applications. This paper presents a federated random forest approach that employs a novel ensemble construction method aimed at improving performance under non-IID data. Instead of growing trees independently in each client, our approach ensures each decision tree in the ensemble is iteratively and collectively grown across clients. To preserve the privacy of the client's data, we confine the information stored in the leaf nodes to the majority class label identified from the samples of the client's local data that reach each node. This limited disclosure preserves the confidentiality of the underlying data distribution of clients, thereby enhancing the privacy of the federated learning process. Furthermore, our collaborative ensemble construction strategy allows the ensemble to better reflect the data's heterogeneity across different clients, enhancing its performance on non-IID data, as our experimental results confirm.
Supervised Learning based Method for Condition Monitoring of Overhead Line Insulators using Leakage Current Measurement
Mitrovic, Mile, Titov, Dmitry, Volkhov, Klim, Lukicheva, Irina, Kudryavzev, Andrey, Vorobev, Petr, Li, Qi, Terzija, Vladimir
As a new practical and economical solution to the aging problem of overhead line (OHL) assets, the technical policies of most power grid companies in the world experienced a gradual transition from scheduled preventive maintenance to a risk-based approach in asset management. Even though the accumulation of contamination is predictable within a certain degree, there are currently no effective ways to identify the risk of the insulator flashover in order to plan its replacement. This paper presents a novel machine learning (ML) based method for estimating the flashover probability of the cup-and-pin glass insulator string. The proposed method is based on the Extreme Gradient Boosting (XGBoost) supervised ML model, in which the leakage current (LC) features and applied voltage are used as the inputs. The established model can estimate the critical flashover voltage (U50%) for various designs of OHL insulators with different voltage levels. The proposed method is also able to accurately determine the condition of the insulator strings and instruct asset management engineers to take appropriate actions.
Utilising Explainable Techniques for Quality Prediction in a Complex Textiles Manufacturing Use Case
Forsberg, Briony, Williams, Dr Henry, MacDonald, Prof Bruce, Chen, Tracy, Hamzeh, Dr Reza, Hulse, Dr Kirstine
This paper develops an approach to classify instances of product failure in a complex textiles manufacturing dataset using explainable techniques. The dataset used in this study was obtained from a New Zealand manufacturer of woollen carpets and rugs. In investigating the trade-off between accuracy and explainability, three different tree-based classification algorithms were evaluated: a Decision Tree and two ensemble methods, Random Forest and XGBoost. Additionally, three feature selection methods were also evaluated: the SelectKBest method, using chi-squared as the scoring function, the Pearson Correlation Coefficient, and the Boruta algorithm. Not surprisingly, the ensemble methods typically produced better results than the Decision Tree model. The Random Forest model yielded the best results overall when combined with the Boruta feature selection technique. Finally, a tree ensemble explaining technique was used to extract rule lists to capture necessary and sufficient conditions for classification by a trained model that could be easily interpreted by a human. Notably, several features that were in the extracted rule lists were statistical features and calculated features that were added to the original dataset. This demonstrates the influence that bringing in additional information during the data preprocessing stages can have on the ultimate model performance.
Flusion: Integrating multiple data sources for accurate influenza predictions
Ray, Evan L., Wang, Yijin, Wolfinger, Russell D., Reich, Nicholas G.
Over the last ten years, the US Centers for Disease Control and Prevention (CDC) has organized an annual influenza forecasting challenge with the motivation that accurate probabilistic forecasts could improve situational awareness and yield more effective public health actions. Starting with the 2021/22 influenza season, the forecasting targets for this challenge have been based on hospital admissions reported in the CDC's National Healthcare Safety Network (NHSN) surveillance system. Reporting of influenza hospital admissions through NHSN began within the last few years, and as such only a limited amount of historical data are available for this signal. To produce forecasts in the presence of limited data for the target surveillance system, we augmented these data with two signals that have a longer historical record: 1) ILI+, which estimates the proportion of outpatient doctor visits where the patient has influenza; and 2) rates of laboratory-confirmed influenza hospitalizations at a selected set of healthcare facilities. Our model, Flusion, is an ensemble that combines gradient boosting quantile regression models with a Bayesian autoregressive model. The gradient boosting models were trained on all three data signals, while the autoregressive model was trained on only the target signal; all models were trained jointly on data for multiple locations. Flusion was the top-performing model in the CDC's influenza prediction challenge for the 2023/24 season. In this article we investigate the factors contributing to Flusion's success, and we find that its strong performance was primarily driven by the use of a gradient boosting model that was trained jointly on data from multiple surveillance signals and locations. These results indicate the value of sharing information across locations and surveillance signals, especially when doing so adds to the pool of available training data.
Improving GBDT Performance on Imbalanced Datasets: An Empirical Study of Class-Balanced Loss Functions
Luo, Jiaqi, Yuan, Yuan, Xu, Shixin
However, like many machine learning algorithms, GBDT faces challenges when dealing with imbalanced datasets. Class imbalance is a persistent issue in many real-world applications, such as fraud detection [5], medical diagnosis [6], and fault diagnosis [7]. It poses significant challenges to machine learning algorithms, leading to poor performance, particularly in predicting the minority class. Various strategies have been prompted to address this challenge, including sampling techniques and algorithm modifications [8, 9]. While these methods have shown promise, the exploration of class-balanced losses, particularly in multi-label classification, has received comparatively little attention. This paper presents the first comprehensive study on adapting classbalanced loss functions to GBDT algorithms across various tabular classi-2 fication tasks, including binary, multi-class, and multi-label classification. We conduct extensive experiments on multiple datasets spanning diverse classification tasks, rigorously evaluating the performance of class-balanced losses within different GBDT models. Our thorough results demonstrate the effectiveness of these loss functions in mitigating class imbalance issues in tree-based ensemble methods.