Ensemble Learning
Centrum: Model-based Database Auto-tuning with Minimal Distributional Assumptions
Lai, Yuanhao, Zheng, Pengfei, Ji, Chenpeng, Li, Yan, Zhang, Songhan, Zhang, Rutao, Wang, Zhengang, Du, Yunfei
Gaussian-Process-based Bayesian optimization (GP-BO), is a prevailing model-based framework for DBMS auto-tuning. However, recent work shows GP-BO-based DBMS auto-tuners significantly outperformed auto-tuners based on SMAC, which features random forest surrogate models; such results motivate us to rethink and investigate the limitations of GP-BO in auto-tuner design. We find the fundamental assumptions of GP-BO are widely violated when modeling and optimizing DBMS performance, while tree-ensemble-BOs (e.g., SMAC) can avoid the assumption pitfalls and deliver improved tuning efficiency and effectiveness. Moreover, we argue that existing tree-ensemble-BOs restrict further advancement in DBMS auto-tuning. First, existing tree-ensemble-BOs can only achieve distribution-free point estimates, but still impose unrealistic distributional assumptions on uncertainty estimates, compromising surrogate modeling and distort the acquisition function. Second, recent advances in gradient boosting, which can further enhance surrogate modeling against vanilla GP and random forest counterparts, have rarely been applied in optimizing DBMS auto-tuners. To address these issues, we propose a novel model-based DBMS auto-tuner, Centrum. Centrum improves distribution-free point and interval estimation in surrogate modeling with a two-phase learning procedure of stochastic gradient boosting ensembles. Moreover, Centrum adopts a generalized SGBE-estimated locally-adaptive conformal prediction to facilitate a distribution-free uncertainty estimation and acquisition function. To our knowledge, Centrum is the first auto-tuner to realize distribution-freeness, enhancing BO's practicality in DBMS auto-tuning, and the first to seamlessly fuse gradient boosting ensembles and conformal inference in BO. Extensive physical and simulation experiments on two DBMSs and three workloads show Centrum outperforms 21 SOTA methods.
Machine Learning Enabled Early Warning System For Financial Distress Using Real-Time Digital Signals
pant, Laxmi, Reza, Syed Ali, Rahman, Md Khalilor, Rahman, MD Saifur, Sharmin, Shamima, Mithu, Md Fazlul Huq, Hasnain, Kazi Nehal, Farabi, Adnan, khanom, Mahamuda, Kabir, Raisul
International Journal of Applied Mathematics Volume 38 No. 5 s, 2025 ISSN: 1311 - 1728 (printed version); ISSN: 1314 - 8060 (on - line version) Received: August 0 7, 2025 550 Abstract The growing instability of both global and domestic economic environments has increased the risk of financial distress at the household level. However, traditional econometric models often rely on delayed and aggregated data, limiting their effectiveness. This study introduces a machine learning - based early warning system that utilizes real - time digital and macroeconomic signals to identify financial distress in near real - time. Using a panel dataset of 750 households tracked over three monitoring rounds spa nning 13 months, the framework combines socioeconomic attributes, macroeconomic indicators (such as GDP growth, inflation, and foreign exchange fluctuations), and digital economy measures (including ICT demand and market volatility). Through data preproces sing and feature engineering, we introduce lagged variables, volatility measures, and interaction terms to capture both gradual and sudden changes in financial stability. We benchmark baseline classifiers, such as logistic regression and decision trees, ag ainst advanced ensemble models including random forests, XGBoost, and LightGBM. Our results indicate that the engineered features from the digital economy significantly enhance predictive accuracy. The system performs reliably for both binary distress dete ction and multi - class severity classification, with SHAP - based explanations identifying inflation volatility and ICT demand as key predictors. Crucially, the framework is International Journal of Applied Mathematics Volume 38 No. 5 s, 2025 ISSN: 1311 - 1728 (printed version); ISSN: 1314 - 8060 (on - line version) Received: August 0 7, 2025 551 By implementing machine learning in a transparent and interpretable manner, this study demonstrates the feasibility and impact of providing near - real - time early warnings of financial distress. This offers actionable insights that can strengthen household resilience and guide preemptive intervention strategies. Keywords: Financial Distress, Early Warning Systems, Machine Learning, Digital Economy, Temporal Classification, Explainable AI 1. Introduction 1.1 Background and Motivation The prediction of financial distress has long been recognized as a critical element for ensuring economic resilience and mitigating systemic risk across households, firms, and national economies.
Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models
Zhang, Xiyuan, Maddix, Danielle C., Yin, Junming, Erickson, Nick, Ansari, Abdul Fatir, Han, Boran, Zhang, Shuai, Akoglu, Leman, Faloutsos, Christos, Mahoney, Michael W., Hu, Cuixiong, Rangwala, Huzefa, Karypis, George, Wang, Bernie
Since the seminal work of TabPFN, research on tabular foundation models (TFMs) based on in-context learning (ICL) has challenged long-standing paradigms in machine learning. Without seeing any real-world data, models pretrained on purely synthetic datasets generalize remarkably well across diverse datasets, often using only a moderate number of in-context examples. This shifts the focus in tabular machine learning from model architecture design to the design of synthetic datasets, or, more precisely, to the prior distributions that generate them. Yet the guiding principles for prior design remain poorly understood. This work marks the first attempt to address the gap. We systematically investigate and identify key properties of synthetic priors that allow pretrained TFMs to generalize well. Based on these insights, we introduce Mitra, a TFM trained on a curated mixture of synthetic priors selected for their diversity, distinctiveness, and performance on real-world tabular data. Mitra consistently outperforms state-of-the-art TFMs, such as TabPFNv2 and TabICL, across both classification and regression benchmarks, with better sample efficiency.
A Multi-Layer Machine Learning and Econometric Pipeline for Forecasting Market Risk: Evidence from Cryptoasset Liquidity Spillovers
We study whether liquidity and volatility proxies of a core set of cryptoassets generate spillovers that forecast market-wide risk. Our empirical framework integrates three statistical layers: (A) interactions between core liquidity and returns, (B) principal-component relations linking liquidity and returns, and (C) volatility-factor projections that capture cross-sectional volatility crowding. The analysis is complemented by vector autoregression impulse responses and forecast error variance decompositions (see Granger 1969; Sims 1980), heterogeneous autoregressive models with exogenous regressors (HAR-X, Corsi 2009), and a leakage-safe machine learning protocol using temporal splits, early stopping, validation-only thresholding, and SHAP-based interpretation. Using daily data from 2021 to 2025 (1462 observations across 74 assets), we document statistically significant Granger-causal relationships across layers and moderate out-of-sample predictive accuracy. We report the most informative figures, including the pipeline overview, Layer A heatmap, Layer C robustness analysis, vector autoregression variance decompositions, and the test-set precision-recall curve. Full data and figure outputs are provided in the artifact repository.
No Intelligence Without Statistics: The Invisible Backbone of Artificial Intelligence
The rapid ascent of artificial intelligence (AI) is often portrayed as a revolution born from computer science and engineering. This narrative, however, obscures a fundamental truth: the theoretical and methodological core of AI is, and has always been, statistical. This paper systematically argues that the field of statistics provides the indispensable foundation for machine learning and modern AI. We deconstruct AI into nine foundational pillars-Inference, Density Estimation, Sequential Learning, Generalization, Representation Learning, Interpretability, Causality, Optimization, and Unification-demonstrating that each is built upon century-old statistical principles. From the inferential frameworks of hypothesis testing and estimation that underpin model evaluation, to the density estimation roots of clustering and generative AI; from the time-series analysis inspiring recurrent networks to the causal models that promise true understanding, we trace an unbroken statistical lineage. While celebrating the computational engines that power modern AI, we contend that statistics provides the brain-the theoretical frameworks, uncertainty quantification, and inferential goals-while computer science provides the brawn-the scalable algorithms and hardware. Recognizing this statistical backbone is not merely an academic exercise, but a necessary step for developing more robust, interpretable, and trustworthy intelligent systems. We issue a call to action for education, research, and practice to re-embrace this statistical foundation. Ignoring these roots risks building a fragile future; embracing them is the path to truly intelligent machines. There is no machine learning without statistical learning; no artificial intelligence without statistical thought.
The Sherpa.ai Blind Vertical Federated Learning Paradigm to Minimize the Number of Communications
Acero, Alex, Jimenez-Gutierrez, Daniel M., Pighin, Dario, Zuazua, Enrique, Del Rio, Joaquin, Uribe-Etxebarria, Xabi
Blind V ertical Federated Learning Paradigm to Minimize the Number of Communications Sherpa.ai Abstract--Federated Learning (FL) enables collaborative decentralized training across multiple parties (nodes) while keeping raw data private. There are two main paradigms in FL: Horizontal FL (HFL), where all participant nodes share the same feature space but hold different samples, and V ertical FL (VFL), where participants hold complementary features for the same samples. While HFL is widely adopted, VFL is employed in domains where nodes hold complementary features about the same samples. Still, VFL presents a significant limitation: the vast number of communications required during training. This compromises privacy and security, and can lead to high energy consumption, and in some cases, make model training unfeasible due to the high number of communications. In this paper, we introduce Sherpa.ai Blind V ertical Federated Learning (SBVFL), a novel paradigm that leverages a distributed training mechanism enhanced for privacy and security. De-coupling the vast majority of node updates from the server dramatically reduces node-server communication. Experiments show that SBVFL reduces communication by 99% compared to standard VFL while maintaining accuracy and robustness. Therefore, SBVFL enables practical, privacy-preserving VFL across sensitive domains, including healthcare, finance, manufacturing, aerospace, cybersecurity, and the defense industry. Federated Learning (FL) [1] enables collaborative training across multiple nodes (parties, clients, devices) while keeping raw data decentralized, sharing only model updates instead of centralizing data as in traditional Machine Learning (ML). FL is typically categorized into Horizontal FL (HFL), where nodes share the same feature space but hold different samples, and V ertical FL (VFL), where nodes hold data with different feature spaces for the same set of samples [2].
Combining ECG Foundation Model and XGBoost to Predict In-Hospital Malignant Ventricular Arrhythmias in AMI Patients
Huang, Shun, Xing, Wenlu, Geng, Shijia, Wang, Hailong, Nie, Guangkun, Tang, Gongzheng, He, Chenyang, Hong, Shenda
Malignant ventricular arrhythmias (VT/VF) following acute myocardial infarction (AMI) are a major cause of in-hospital death, yet early identification remains a clinical challenge. While traditional risk scores have limited performance, end-to-end deep learning models often lack the interpretability needed for clinical trust. This study aimed to develop a hybrid predictive framework that integrates a large-scale electrocardiogram (ECG) foundation model (ECGFounder) with an interpretable XGBoost classifier to improve both accuracy and interpretability. We analyzed 6,634 ECG recordings from AMI patients, among whom 175 experienced in-hospital VT/VF. The ECGFounder model was used to extract 150-dimensional diagnostic probability features , which were then refined through feature selection to train the XGBoost classifier. Model performance was evaluated using AUC and F1-score , and the SHAP method was used for interpretability. The ECGFounder + XGBoost hybrid model achieved an AUC of 0.801 , outperforming KNN (AUC 0.677), RNN (AUC 0.676), and an end-to-end 1D-CNN (AUC 0.720). SHAP analysis revealed that model-identified key features, such as "premature ventricular complexes" (risk predictor) and "normal sinus rhythm" (protective factor), were highly consistent with clinical knowledge. We conclude that this hybrid framework provides a novel paradigm for VT/VF risk prediction by validating the use of foundation model outputs as effective, automated feature engineering for building trustworthy, explainable AI-based clinical decision support systems.
CLIMB: Class-imbalanced Learning Benchmark on Tabular Data
Liu, Zhining, Li, Zihao, Yang, Ze, Wei, Tianxin, Kang, Jian, Zhu, Yada, Hamann, Hendrik, He, Jingrui, Tong, Hanghang
Class-imbalanced learning (CIL) on tabular data is important in many real-world applications where the minority class holds the critical but rare outcomes. In this paper, we present CLIMB, a comprehensive benchmark for class-imbalanced learning on tabular data. CLIMB includes 73 real-world datasets across diverse domains and imbalance levels, along with unified implementations of 29 representative CIL algorithms. Built on a high-quality open-source Python package with unified API designs, detailed documentation, and rigorous code quality controls, CLIMB supports easy implementation and comparison between different CIL algorithms. Through extensive experiments, we provide practical insights on method accuracy and efficiency, highlighting the limitations of naive rebalancing, the effectiveness of ensembles, and the importance of data quality. Our code, documentation, and examples are available at https://github.com/ZhiningLiu1998/imbalanced-ensemble.
Interaction Concordance Index: Performance Evaluation for Interaction Prediction Methods
Pahikkala, Tapio, Numminen, Riikka, Movahedi, Parisa, Karmitsa, Napsu, Airola, Antti
Consider two sets of entities and their members' mutual affinity values, say drug-target affinities (DTA). Drugs and targets are said to interact in their effects on DTAs if drug's effect on it depends on the target. Presence of interaction implies that assigning a drug to a target and another drug to another target does not provide the same aggregate DTA as the reversed assignment would provide. Accordingly, correctly capturing interactions enables better decision-making, for example, in allocation of limited numbers of drug doses to their best matching targets. Learning to predict DTAs is popularly done from either solely from known DTAs or together with side information on the entities, such as chemical structures of drugs and targets. In this paper, we introduce interaction directions' prediction performance estimator we call interaction concordance index (IC-index), for both fixed predictors and machine learning algorithms aimed for inferring them. IC-index complements the popularly used DTA prediction performance estimators by evaluating the ratio of correctly predicted directions of interaction effects in data. First, we show the invariance of IC-index on predictors unable to capture interactions. Secondly, we show that learning algorithm's permutation equivariance regarding drug and target identities implies its inability to capture interactions when either drug, target or both are unseen during training. In practical applications, this equivariance is remedied via incorporation of appropriate side information on drugs and targets. We make a comprehensive empirical evaluation over several biomedical interaction data sets with various state-of-the-art machine learning algorithms. The experiments demonstrate how different types of affinity strength prediction methods perform in terms of IC-index complementing existing prediction performance estimators.
Superior Molecular Representations from Intermediate Encoder Layers
Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we first analyze the information flow in five diverse molecular encoders and find that intermediate layers retain more general-purpose features, whereas the final-layer specializes and compresses information. We then perform an empirical layer-wise evaluation across 22 property prediction tasks. We find that using frozen embeddings from optimal intermediate layers improves downstream performance by an average of 5.4%, up to 28.6%, compared to the final-layer. Furthermore, finetuning encoders truncated at intermediate depths achieves even greater average improvements of 8.5%, with increases as high as 40.8%, obtaining new state-of-the-art results on several benchmarks. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code will be made publicly available.