Ensemble Learning
Interpretable Ensemble Learning for Materials Property Prediction with Classical Interatomic Potentials: Carbon as an Example
Jiang, Xinyu, Sun, Haofan, Choudhary, Kamal, Zhuang, Houlong, Nian, Qiong
Machine learning (ML) is widely used to explore crystal materials and predict their properties. However, the training is time-consuming for deep-learning models, and the regression process is a black box that is hard to interpret. Also, the preprocess to transfer a crystal structure into the input of ML, called descriptor, needs to be designed carefully. To efficiently predict important properties of materials, we propose an approach based on ensemble learning consisting of regression trees to predict formation energy and elastic constants based on small-size datasets of carbon allotropes as an example. Without using any descriptor, the inputs are the properties calculated by molecular dynamics with 9 different classical interatomic potentials. Overall, the results from ensemble learning are more accurate than those from classical interatomic potentials, and ensemble learning can capture the relatively accurate properties from the 9 classical potentials as criteria for predicting the final properties.
Evaluating the reliability of automatically generated pedestrian and bicycle crash surrogates
Sengupta, Agnimitra, Guler, S. Ilgin, Gayah, Vikash V., Warchol, Shannon
Vulnerable road users (VRUs), such as pedestrians and bicyclists, are at a higher risk of being involved in crashes with motor vehicles, and crashes involving VRUs also are more likely to result in severe injuries or fatalities. Signalized intersections are a major safety concern for VRUs due to their complex and dynamic nature, highlighting the need to understand how these road users interact with motor vehicles and deploy evidence-based countermeasures to improve safety performance. Crashes involving VRUs are relatively infrequent, making it difficult to understand the underlying contributing factors. An alternative is to identify and use conflicts between VRUs and motorized vehicles as a surrogate for safety performance. Automatically detecting these conflicts using a video-based systems is a crucial step in developing smart infrastructure to enhance VRU safety. The Pennsylvania Department of Transportation conducted a study using video-based event monitoring system to assess VRU and motor vehicle interactions at fifteen signalized intersections across Pennsylvania to improve VRU safety performance. This research builds on that study to assess the reliability of automatically generated surrogates in predicting confirmed conflicts using advanced data-driven models. The surrogate data used for analysis include automatically collectable variables such as vehicular and VRU speeds, movements, post-encroachment time, in addition to manually collected variables like signal states, lighting, and weather conditions. The findings highlight the varying importance of specific surrogates in predicting true conflicts, some being more informative than others. The findings can assist transportation agencies to collect the right types of data to help prioritize infrastructure investments, such as bike lanes and crosswalks, and evaluate their effectiveness.
Efficient Multi-stage Inference on Tabular Data
Johnson, Daniel S, Markov, Igor L
Many ML applications and products train on medium amounts of input data but get bottlenecked in real-time inference. When implementing ML systems, conventional wisdom favors segregating ML code into services queried by product code via Remote Procedure Call (RPC) APIs. This approach clarifies the overall software architecture and simplifies product code by abstracting away ML internals. However, the separation adds network latency and entails additional CPU overhead. Hence, we simplify inference algorithms and embed them into the product code to reduce network communication. For public datasets and a high-performance real-time platform that deals with tabular data, we show that over half of the inputs are often amenable to such optimization, while the remainder can be handled by the original model. By applying our optimization with AutoML to both training and inference, we reduce inference latency by 1.3x, CPU resources by 30%, and network communication between application front-end and ML back-end by about 50% for a commercial end-to-end ML platform that serves millions of real-time decisions per second. The crucial role of AutoML is in configuring first-stage inference and balancing the two stages.
The Effect of Epidemiological Cohort Creation on the Machine Learning Prediction of Homelessness and Police Interaction Outcomes Using Administrative Health Care Data
Shahidi, Faezehsadat, MacDonald, M. Ethan, Seitz, Dallas, Messier, Geoffrey
Background: Mental illness can lead to adverse outcomes such as homelessness and police interaction and understanding of the events leading up to these adverse outcomes is important. Predictive models may help identify individuals at risk of such adverse outcomes. Using a fixed observation window cohort with logistic regression (LR) or machine learning (ML) models can result in lower performance when compared with adaptive and parcellated windows. Method: An administrative healthcare dataset was used, comprising of 240,219 individuals in Calgary, Alberta, Canada who were diagnosed with addiction or mental health (AMH) between April 1, 2013, and March 31, 2018. The cohort was followed for 2 years to identify factors associated with homelessness and police interactions. To understand the benefit of flexible windows to predictive models, an alternative cohort was created. Then LR and ML models, including random forests (RF), and extreme gradient boosting (XGBoost) were compared in the two cohorts. Results: Among 237,602 individuals, 0.8% (1,800) experienced first homelessness, while 0.32% (759) reported initial police interaction among 237,141 individuals. Male sex (AORs: H=1.51, P=2.52), substance disorder (AORs: H=3.70, P=2.83), psychiatrist visits (AORs: H=1.44, P=1.49), and drug abuse (AORs: H=2.67, P=1.83) were associated with initial homelessness (H) and police interaction (P). XGBoost showed superior performance using the flexible method (sensitivity =91%, AUC =90% for initial homelessness, and sensitivity =90%, AUC=89% for initial police interaction) Conclusion: This study identified key features associated with initial homelessness and police interaction and demonstrated that flexible windows can improve predictive modeling.
Cross Feature Selection to Eliminate Spurious Interactions and Single Feature Dominance Explainable Boosting Machines
R, Shree Charran, Mahapatra, Sandipan Das
Interpretability is a crucial aspect of machine learning models that enables humans to understand and trust the decision-making process of these models. In many real-world applications, the interpretability of models is essential for legal, ethical, and practical reasons. For instance, in the banking domain, interpretability is critical for lenders and borrowers to understand the reasoning behind the acceptance or rejection of loan applications as per fair lending laws. However, achieving interpretability in machine learning models is challenging, especially for complex high-performance models. Hence Explainable Boosting Machines (EBMs) have been gaining popularity due to their interpretable and high-performance nature in various prediction tasks. However, these models can suffer from issues such as spurious interactions with redundant features and single-feature dominance across all interactions, which can affect the interpretability and reliability of the model's predictions. In this paper, we explore novel approaches to address these issues by utilizing alternate Cross-feature selection, ensemble features and model configuration alteration techniques. Our approach involves a multi-step feature selection procedure that selects a set of candidate features, ensemble features and then benchmark the same using the EBM model. We evaluate our method on three benchmark datasets and show that the alternate techniques outperform vanilla EBM methods, while providing better interpretability and feature selection stability, and improving the model's predictive performance. Moreover, we show that our approach can identify meaningful interactions and reduce the dominance of single features in the model's predictions, leading to more reliable and interpretable models. Index Terms- Interpretability, EBM's, ensemble, feature selection.
Quality Assessment of Photoplethysmography Signals For Cardiovascular Biomarkers Monitoring Using Wearable Devices
Dias, Felipe M., Toledo, Marcelo A. F., Cardenas, Diego A. C., Almeida, Douglas A., Oliveira, Filipe A. C., Ribeiro, Estela, Krieger, Jose E., Gutierrez, Marco A.
Photoplethysmography (PPG) is a non-invasive technology that measures changes in blood volume in the microvascular bed of tissue. It is commonly used in medical devices such as pulse oximeters and wrist worn heart rate monitors to monitor cardiovascular hemodynamics. PPG allows for the assessment of parameters (e.g., heart rate, pulse waveform, and peripheral perfusion) that can indicate conditions such as vasoconstriction or vasodilation, and provides information about microvascular blood flow, making it a valuable tool for monitoring cardiovascular health. However, PPG is subject to a number of sources of variations that can impact its accuracy and reliability, especially when using a wearable device for continuous monitoring, such as motion artifacts, skin pigmentation, and vasomotion. In this study, we extracted 27 statistical features from the PPG signal for training machine-learning models based on gradient boosting (XGBoost and CatBoost) and Random Forest (RF) algorithms to assess quality of PPG signals that were labeled as good or poor quality. We used the PPG time series from a publicly available dataset and evaluated the algorithm s performance using Sensitivity (Se), Positive Predicted Value (PPV), and F1-score (F1) metrics. Our model achieved Se, PPV, and F1-score of 94.4, 95.6, and 95.0 for XGBoost, 94.7, 95.9, and 95.3 for CatBoost, and 93.7, 91.3 and 92.5 for RF, respectively. Our findings are comparable to state-of-the-art reported in the literature but using a much simpler model, indicating that ML models are promising for developing remote, non-invasive, and continuous measurement devices.
Automatic Identification of Alzheimer's Disease using Lexical Features extracted from Language Samples
Objective: this study has a twofold goal. First, it aims to improve the understanding of the impact of Dementia of type Alzheimer's Disease (AD) on different aspects of the lexicon. Second, it aims to demonstrate that such aspects of the lexicon, when used as features of a machine learning classifier, can help achieve state-of-the-art performance in automatically identifying language samples produced by patients with AD. Methods: data is derived from the ADDreSS challenge, which is a part of the DementiaBank corpus. The used dataset consists of transcripts of Cookie Theft picture descriptions, produced by 54 subjects in the training part and 24 subjects in the test part. The number of narrative samples is 108 in the training set and 48 in the test set. First, the impact of AD on 99 selected lexical features is studied using both the training and testing parts of the dataset. Then some machine learning experiments were conducted on the task of classifying transcribed speech samples with text samples that were produced by people with AD from those produced by normal subjects. Several experiments were conducted to compare the different areas of lexical complexity, identify the subset of features that help achieve optimal performance, and study the impact of the size of the input on the classification. To evaluate the generalization of the models built on narrative speech, two generalization tests were conducted using written data from two British authors, Iris Murdoch and Agatha Christie, and the transcription of some speeches by former President Ronald Reagan. Results: using lexical features only, state-of-the-art classification, F1 and accuracies, of over 91% were achieved in categorizing language samples produced by individuals with AD from the ones produced by healthy control subjects. This confirms the substantial impact of AD on lexicon processing.
Real-time Traffic Classification for 5G NSA Encrypted Data Flows With Physical Channel Records
Fei, Xiao, Martins, Philippe, Lu, Jialiang
The classification of fifth-generation New-Radio (5G-NR) mobile network traffic is an emerging topic in the field of telecommunications. It can be utilized for quality of service (QoS) management and dynamic resource allocation. However, traditional approaches such as Deep Packet Inspection (DPI) can not be directly applied to encrypted data flows. Therefore, new real-time encrypted traffic classification algorithms need to be investigated to handle dynamic transmission. In this study, we examine the real-time encrypted 5G Non-Standalone (NSA) application-level traffic classification using physical channel records. Due to the vastness of their features, decision-tree-based gradient boosting algorithms are a viable approach for classification. We generate a noise-limited 5G NSA trace dataset with traffic from multiple applications. We develop a new pipeline to convert sequences of physical channel records into numerical vectors. A set of machine learning models are tested, and we propose our solution based on Light Gradient Boosting Machine (LGBM) due to its advantages in fast parallel training and low computational burden in practical scenarios. Our experiments demonstrate that our algorithm can achieve 95% accuracy on the classification task with a state-of-the-art response time as quick as 10ms.
A large sample theory for infinitesimal gradient boosting
Dombry, Clement, Duchamps, Jean-Jil
Infinitesimal gradient boosting (Dombry and Duchamps, 2021) is defined as the vanishing-learning-rate limit of the popular tree-based gradient boosting algorithm from machine learning. It is characterized as the solution of a nonlinear ordinary differential equation in a infinite-dimensional function space where the infinitesimal boosting operator driving the dynamics depends on the training sample. We consider the asymptotic behavior of the model in the large sample limit and prove its convergence to a deterministic process. This population limit is again characterized by a differential equation that depends on the population distribution. We explore some properties of this population limit: we prove that the dynamics makes the test error decrease and we consider its long time behavior.
Physics-constrained Random Forests for Turbulence Model Uncertainty Estimation
Matha, Marcel, Morsbach, Christian
Data-driven approaches, aided by high-fidelity simulations and Machine Learning (ML), are gaining popularity To achieve virtual certification for industrial design, in RANS turbulence modeling (Heyse et al., 2021; quantifying the uncertainties in simulationdriven Matha & Kucharczyk, 2022). The present study, which processes is crucial. We discuss a physicsconstrained was the foundation of the CFD application results in our approach to account for epistemic previous paper (Matha et al., 2023), focuses on the use of uncertainty of turbulence models. In order to data-driven methods in order to identify flow regions with eliminate user input, we incorporate a data-driven potential turbulence model prediction inaccuracies.