Ensemble Learning
Exploiting Boosting in Hyperdimensional Computing for Enhanced Reliability in Healthcare
Jeong, SungHeon, Barkam, Hamza Errahmouni, Yun, Sanggeon, Kim, Yeseong, Angizi, Shaahin, Imani, Mohsen
Hyperdimensional computing (HDC) enables efficient data encoding and processing in high-dimensional space, benefiting machine learning and data analysis. However, underutilization of these spaces can lead to overfitting and reduced model reliability, especially in data-limited systems a critical issue in sectors like healthcare that demand robustness and consistent performance. We introduce BoostHD, an approach that applies boosting algorithms to partition the hyperdimensional space into subspaces, creating an ensemble of weak learners. By integrating boosting with HDC, BoostHD enhances performance and reliability beyond existing HDC methods. Our analysis highlights the importance of efficient utilization of hyperdimensional spaces for improved model performance. Experiments on healthcare datasets show that BoostHD outperforms state-of-the-art methods. On the WESAD dataset, it achieved an accuracy of 98.37%, surpassing Random Forest, XGBoost, and OnlineHD. BoostHD also demonstrated superior inference efficiency and stability, maintaining high accuracy under data imbalance and noise. In person-specific evaluations, it achieved an average accuracy of 96.19%, outperforming other models. By addressing the limitations of both boosting and HDC, BoostHD expands the applicability of HDC in critical domains where reliability and precision are paramount.
Feature Group Tabular Transformer: A Novel Approach to Traffic Crash Modeling and Causality Analysis
Lares, Oscar, Zhen, Hao, Yang, Jidong J.
Reliable and interpretable traffic crash modeling is essential for understanding causality and improving road safety. This study introduces a novel approach to predicting collision types by utilizing a comprehensive dataset fused from multiple sources, including weather data, crash reports, high-resolution traffic information, pavement geometry, and facility characteristics. Central to our approach is the development of a Feature Group Tabular Transformer (FGTT) model, which organizes disparate data into meaningful feature groups, represented as tokens. These group-based tokens serve as rich semantic components, enabling effective identification of collision patterns and interpretation of causal mechanisms. The FGTT model is benchmarked against widely used tree ensemble models, including Random Forest, XGBoost, and CatBoost, demonstrating superior predictive performance. Furthermore, model interpretation reveals key influential factors, providing fresh insights into the underlying causality of distinct crash types.
CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models
Hรถrst, Fabian, Rempe, Moritz, Becker, Helmut, Heine, Lukas, Keyl, Julius, Kleesiek, Jens
Digital Pathology is a cornerstone in the diagnosis and treatment of diseases. A key task in this field is the identification and segmentation of cells in hematoxylin and eosin-stained images. Existing methods for cell segmentation often require extensive annotated datasets for training and are limited to a predefined cell classification scheme. To overcome these limitations, we propose $\text{CellViT}^{{\scriptscriptstyle ++}}$, a framework for generalized cell segmentation in digital pathology. $\text{CellViT}^{{\scriptscriptstyle ++}}$ utilizes Vision Transformers with foundation models as encoders to compute deep cell features and segmentation masks simultaneously. To adapt to unseen cell types, we rely on a computationally efficient approach. It requires minimal data for training and leads to a drastically reduced carbon footprint. We demonstrate excellent performance on seven different datasets, covering a broad spectrum of cell types, organs, and clinical settings. The framework achieves remarkable zero-shot segmentation and data-efficient cell-type classification. Furthermore, we show that $\text{CellViT}^{{\scriptscriptstyle ++}}$ can leverage immunofluorescence stainings to generate training datasets without the need for pathologist annotations. The automated dataset generation approach surpasses the performance of networks trained on manually labeled data, demonstrating its effectiveness in creating high-quality training datasets without expert annotations. To advance digital pathology, $\text{CellViT}^{{\scriptscriptstyle ++}}$ is available as an open-source framework featuring a user-friendly, web-based interface for visualization and annotation. The code is available under https://github.com/TIO-IKIM/CellViT-plus-plus.
Intelligent Gradient Boosting Algorithms for Estimating Strength of Modified Subgrade Soil
Mustapha, Ismail B., Abdulkareem, Muyideen, Hasan, Shafaatunnur, Ganiyu, Abideen, Nabus, Hatem, Lee, Jin Chai
The performance of pavement under loading depends on the strength of the subgrade. However, experimental estimation of properties of pavement strengths such as California bearing ratio (CBR), unconfined compressive strength (UCS) and resistance value (R) are often tedious, time-consuming and costly, thereby inspiring a growing interest in machine learning based tools which are simple, cheap and fast alternatives. Thus, the potential application of two boosting techniques; categorical boosting (CatBoost) and extreme gradient boosting (XGBoost) and support vector regression (SVR), is similarly explored in this study for estimation of properties of subgrade soil modified with hydrated lime activated rice husk ash (HARSH). Using 121 experimental data samples of varying proportions of HARSH, plastic limit, liquid limit, plasticity index, clay activity, optimum moisture content, and maximum dry density as input for CBR, UCS and R estimation, four evaluation metrics namely coefficient of determination (R2), root mean squared error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) are used to evaluate the models' performance. The results indicate that XGBoost outperformed CatBoost and SVR in estimating these properties, yielding R2 of 0.9994, 0.9995 and 0.9999 in estimating the CBR, UCS and R respectively. Also, SVR outperformed CatBoost in estimating the CBR and R with R2 of 0.9997 respectively. On the other hand, CatBoost outperformed SVR in estimating the UCS with R2 of 0.9994. Feature sensitivity analysis shows that the three machine learning techniques are unanimous that increasing HARSH proportion lead to values of the estimated properties respectively. A comparison with previous results also shows superiority of XGBoost in estimating subgrade properties.
RieszBoost: Gradient Boosting for Riesz Regression
Lee, Kaitlyn J., Schuler, Alejandro
Answering causal questions often involves estimating linear functionals of conditional expectations, such as the average treatment effect or the effect of a longitudinal modified treatment policy. By the Riesz representation theorem, these functionals can be expressed as the expected product of the conditional expectation of the outcome and the Riesz representer, a key component in doubly robust estimation methods. Traditionally, the Riesz representer is estimated indirectly by deriving its explicit analytical form, estimating its components, and substituting these estimates into the known form (e.g., the inverse propensity score). However, deriving or estimating the analytical form can be challenging, and substitution methods are often sensitive to practical positivity violations, leading to higher variance and wider confidence intervals. In this paper, we propose a novel gradient boosting algorithm to directly estimate the Riesz representer without requiring its explicit analytical form. This method is particularly suited for tabular data, offering a flexible, nonparametric, and computationally efficient alternative to existing methods for Riesz regression. Through simulation studies, we demonstrate that our algorithm performs on par with or better than indirect estimation techniques across a range of functionals, providing a user-friendly and robust solution for estimating causal quantities.
An Interpretable ML-based Model for Predicting p-y Curves of Monopile Foundations in Sand
Li, Biao, Song, Qing-Kai, Qi, Wen-Gang, Gao, Fu-Ping
Predicting the lateral pile response is challenging due to the complexity of pile-soil interactions. Machine learning (ML) techniques have gained considerable attention for their effectiveness in non-linear analysis and prediction. This study develops an interpretable ML-based model for predicting p-y curves of monopile foundations. An XGBoost model was trained using a database compiled from existing research. The results demonstrate that the model achieves superior predictive accuracy. Shapley Additive Explanations (SHAP) was employed to enhance interpretability. The SHAP value distributions for each variable demonstrate strong alignment with established theoretical knowledge on factors affecting the lateral response of pile foundations.
SALT: Sales Autocompletion Linked Business Tables Dataset
Klein, Tassilo, Biehl, Clemens, Costa, Margarida, Sres, Andre, Kolk, Jonas, Hoffart, Johannes
Foundation models, particularly those that incorporate Transformer architectures, have demonstrated exceptional performance in domains such as natural language processing and image processing. Adapting these models to structured data, like tables, however, introduces significant challenges. These difficulties are even more pronounced when addressing multi-table data linked via foreign key, which is prevalent in the enterprise realm and crucial for empowering business use cases. Despite its substantial impact, research focusing on such linked business tables within enterprise settings remains a significantly important yet underexplored domain. To address this, we introduce a curated dataset sourced from an Enterprise Resource Planning (ERP) system, featuring extensive linked tables. This dataset is specifically designed to support research endeavors in table representation learning. By providing access to authentic enterprise data, our goal is to potentially enhance the effectiveness and applicability of models for real-world business contexts.
TreeLUT: An Efficient Alternative to Deep Neural Networks for Inference Acceleration Using Gradient Boosted Decision Trees
Khataei, Alireza, Bazargan, Kia
Accelerating machine learning inference has been an active research area in recent years. In this context, field-programmable gate arrays (FPGAs) have demonstrated compelling performance by providing massive parallelism in deep neural networks (DNNs). Neural networks (NNs) are computationally intensive during inference, as they require massive amounts of multiplication and addition, which makes their implementations costly. Numerous studies have recently addressed this challenge to some extent using a combination of sparsity induction, quantization, and transformation of neurons or sub-networks into lookup tables (LUTs) on FPGAs. Gradient boosted decision trees (GBDTs) are a high-accuracy alternative to DNNs in a wide range of regression and classification tasks, particularly for tabular datasets. The basic building block of GBDTs is a decision tree, which resembles the structure of binary decision diagrams. FPGA design flows are heavily optimized to implement such a structure efficiently. In addition to decision trees, GBDTs perform simple operations during inference, including comparison and addition. We present TreeLUT as an open-source tool for implementing GBDTs using an efficient quantization scheme, hardware architecture, and pipelining strategy. It primarily utilizes LUTs with no BRAMs or DSPs on FPGAs, resulting in high efficiency. We show the effectiveness of TreeLUT using multiple classification datasets, commonly used to evaluate ultra-low area and latency architectures. Using these benchmarks, we compare our implementation results with existing DNN and GBDT methods, such as DWN, PolyLUT-Add, NeuraLUT, LogicNets, FINN, hls4ml, and others. Our results show that TreeLUT significantly improves hardware utilization, latency, and throughput at competitive accuracy compared to previous works.
Predicting Barge Presence and Quantity on Inland Waterways using Vessel Tracking Data: A Machine Learning Approach
Agorkua, Geoffery, Hernandez, Sarah, Falquez, Maria, Poddar, Subhadipto, Pang, Shihao
This study presents a machine learning approach to predict the number of barges transported by vessels on inland waterways using tracking data from the Automatic Identification System (AIS). While AIS tracks the location of tug and tow vessels, it does not monitor the presence or number of barges transported by those vessels. Understanding the number and types of barges conveyed along river segments, between ports, and at ports is crucial for estimating the quantities of freight transported on the nation's waterways. This insight is also valuable for waterway management and infrastructure operations impacting areas such as targeted dredging operations, and data-driven resource allocation. Labeled sample data was generated using observations from traffic cameras located along key river segments and matched to AIS data records. A sample of 164 vessels representing up to 42 barge convoys per vessel was used for model development. The methodology involved first predicting barge presence and then predicting barge quantity. Features derived from the AIS data included speed measures, vessel characteristics, turning measures, and interaction terms. For predicting barge presence, the AdaBoost model achieved an F1 score of 0.932. For predicting barge quantity, the Random Forest combined with an AdaBoost ensemble model achieved an F1 score of 0.886. Bayesian optimization was used for hyperparameter tuning. By advancing predictive modeling for inland waterways, this study offers valuable insights for transportation planners and organizations, which require detailed knowledge of traffic volumes, including the flow of commodities, their destinations, and the tonnage moving in and out of ports.
Extending XReason: Formal Explanations for Adversarial Detection
Jemaa, Amira, Rashid, Adnan, Tahar, Sofiene
Explainable Artificial Intelligence (XAI) plays an important role in improving the transparency and reliability of complex machine learning models, especially in critical domains such as cybersecurity. Despite the prevalence of heuristic interpretation methods such as SHAP and LIME, these techniques often lack formal guarantees and may produce inconsistent local explanations. To fulfill this need, few tools have emerged that use formal methods to provide formal explanations. Among these, XReason uses a SAT solver to generate formal instance-level explanation for XGBoost models. In this paper, we extend the XReason tool to support LightGBM models as well as class-level explanations. Additionally, we implement a mechanism to generate and detect adversarial examples in XReason. We evaluate the efficiency and accuracy of our approach on the CICIDS-2017 dataset, a widely used benchmark for detecting network attacks.