Goto

Collaborating Authors

 Ensemble Learning





Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection

arXiv.org Artificial Intelligence

Graph-based methods are becoming increasingly popular in machine learning due to their ability to model complex data and relations. Insurance fraud is a prime use case, since false claims are often the result of organised criminals that stage accidents or the same persons filing erroneous claims on multiple policies. One challenge is that graph-based approaches struggle to find meaningful representations of the data because of the high class imbalance present in fraud data. Another is that insurance networks are heterogeneous and dynamic, given the changing relations among people, companies and policies. That is why gradient boosted tree approaches on tabular data still dominate the field. Therefore, we present a novel inductive graph gradient boosting machine (G-GBM) for supervised learning on heterogeneous and dynamic graphs. We show that our estimator competes with popular graph neural network approaches in an experiment using a variety of simulated random graphs. We demonstrate the power of G-GBM for insurance fraud detection using an open-source and a real-world, proprietary dataset. Given that the backbone model is a gradient boosting forest, we apply established explainability methods to gain better insights into the predictions made by G-GBM.


DeepBoost-AF: A Novel Unsupervised Feature Learning and Gradient Boosting Fusion for Robust Atrial Fibrillation Detection in Raw ECG Signals

arXiv.org Artificial Intelligence

Atrial fibrillation (AF) is a prevalent cardiac arrhythmia associated with elevated health risks, where timely detection is pivotal for mitigating stroke-related morbidity. This study introduces an innovative hybrid methodology integrating unsupervised deep learning and gradient boosting models to improve AF detection. A 19-layer deep convolutional autoencoder (DCAE) is coupled with three boosting classifiers-AdaBoost, XGBoost, and LightGBM (LGBM)-to harness their complementary advantages while addressing individual limitations. The proposed framework uniquely combines DCAE with gradient boosting, enabling end-to-end AF identification devoid of manual feature extraction. The DCAE-LGBM model attains an F1-score of 95.20%, sensitivity of 99.99%, and inference latency of four seconds, outperforming existing methods and aligning with clinical deployment requirements. The DCAE integration significantly enhances boosting models, positioning this hybrid system as a reliable tool for automated AF detection in clinical settings.


Bond-Centered Molecular Fingerprint Derivatives: A BBBP Dataset Study

arXiv.org Artificial Intelligence

A strong and fast baseline in molecular property prediction is a Random Forest (RF) trained on ECFP4/ECFP6 descriptors. In practice, the count-based variant of ECFP generally outperforms the binary variant, especially for classification. Recent deep-learning approaches can match or exceed these baselines, including pretrained transformer-CNN models (5) and graph neural networks such as ChemProp or AttentiveFP(6). Chemprop's key architectural choice is directed, bond-centered message passing, in contrast to the more common atom-centered formulations used by many MPNNs. Because much of the remaining architecture is comparable across message-passing GNNs, this raises a focused question: what concrete advantage does the bond-centered formulation confer over atom-centered approaches? To isolate this representational factor, we introduce a static Bond-Centered Fingerprint (BCFP) that mirrors Chemprop's bond-centric view, and we compare it directly against ECFP using a lightweight Random Forest or XGBoost pipeline on the Blood-Brain Barrier Penetration (BBBP) classification task. To our knowledge, this is the first study to propose BCFP and analyze its complementarity with ECFP (7) . Our results indicate that concatenating atom-and bond-centered fingerprints yields efficient and effective models for BBBP prediction, clarifying why bond-centric message passing often appears among top-k performers while offering a simple, fast alternative to full neural architectures.


IMLP: An Energy-Efficient Continual Learning Method for Tabular Data Streams

arXiv.org Artificial Intelligence

Tabular data streams are rapidly emerging as a dominant modality for real-time decision-making in healthcare, finance, and the Internet of Things (IoT). These applications commonly run on edge and mobile devices, where energy budgets, memory, and compute are strictly limited. Continual learning (CL) addresses such dynamics by training models sequentially on task streams while preserving prior knowledge and consolidating new knowledge. While recent CL work has advanced in mitigating catastrophic forgetting and improving knowledge transfer, the practical requirements of energy and memory efficiency for tabular data streams remain underexplored. In particular, existing CL solutions mostly depend on replay mechanisms whose buffers grow over time and exacerbate resource costs. We propose a context-aware incremental Multi-Layer Perceptron (IMLP), a compact continual learner for tabular data streams. IMLP incorporates a windowed scaled dot-product attention over a sliding latent feature buffer, enabling constant-size memory and avoiding storing raw data. The attended context is concatenated with current features and processed by shared feed-forward layers, yielding lightweight per-segment updates. To assess practical deployability, we introduce NetScore-T, a tunable metric coupling balanced accuracy with energy for Pareto-aware comparison across models and datasets. IMLP achieves up to $27.6\times$ higher energy efficiency than TabNet and $85.5\times$ higher than TabPFN, while maintaining competitive average accuracy. Overall, IMLP provides an easy-to-deploy, energy-efficient alternative to full retraining for tabular data streams.


Towards Carbon-Aware Container Orchestration: Predicting Workload Energy Consumption with Federated Learning

arXiv.org Artificial Intelligence

The growing reliance on large-scale data centers to run resource-intensive workloads has significantly increased the global carbon footprint, underscoring the need for sustainable computing solutions. While container orchestration platforms like Kubernetes help optimize workload scheduling to reduce carbon emissions, existing methods often depend on centralized machine learning models that raise privacy concerns and struggle to generalize across diverse environments. In this paper, we propose a federated learning approach for energy consumption prediction that preserves data privacy by keeping sensitive operational data within individual enterprises. By extending the Kubernetes Efficient Power Level Exporter (Kepler), our framework trains XGBoost models collaboratively across distributed clients using Flower's FedXgbBagging aggregation using a bagging strategy, eliminating the need for centralized data sharing. Experimental results on the SPECPower benchmark dataset show that our FL-based approach achieves 11.7 percent lower Mean Absolute Error compared to a centralized baseline. This work addresses the unresolved trade-off between data privacy and energy prediction efficiency in prior systems such as Kepler and CASPER and offers enterprises a viable pathway toward sustainable cloud computing without compromising operational privacy.


SnapBoost: A Heterogeneous Boosting Machine Thomas Parnell

Neural Information Processing Systems

We note that while the subclasses used in practice (e.g., trees) may well be infinite beyond a simple Our proposed method for solving this optimization problem is presented in full in Algorithm 1. The supplemental material contains exemplary code for Algorithm 1 that uses generic scikit-learn regressors.