Statistical Learning
Semi-Parametric Bayesian Additive Regression Trees for Risk Prediction with High-Dimensional Epigenetic Signatures and Low-Dimensional Covariates
Bhandari, Saurabh, Bhatti, Parveen, Chiu, Brian C. -H., Ji, Yuan
In the era of precision medicine, genome-wide epigenetic modifications offer rich data that could inform risk prediction. However, these data are high-dimensional and exhibit complex dependence structures, which makes it difficult to jointly model them with low-dimensional covariates when the goal is to obtain interpretable effect estimates for covariate adjustment. Standard Bayesian additive regression trees (BART) provide strong predictive performance but treat all predictors uniformly within the tree ensemble, obscuring the contributions of significant covariates and complicating variable selection in high-dimensional settings. We propose a semi-parametric BART model (spBART) that addresses this limitation by modeling low-dimensional covariates through a parametric component with interpretable coefficients, while capturing complex nonlinear associations among high-dimensional predictors through the tree ensemble. To perform stable variable selection, we develop a cross-validation-based procedure that aggregates posterior inclusion probabilities across folds and applies Bayesian false discovery rate control. We apply the proposed method to a pooled case--control analysis of high-dimensional genome-wide 5-hydroxymethylcytosine profiles derived from circulating cell-free DNA in two multiple myeloma studies ($N = 869$). The approach identifies a parsimonious set of candidate loci and achieves strong out-of-sample discrimination (AUC $= 0.96$) in a held-out validation set. Overall, spBART provides a unified framework for combining interpretable covariate inference with flexible modeling and variable selection in high-dimensional biomedical studies.
Symbolic Density Estimation for Discrete Distributions
Discrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE), an unsupervised framework that automatically recovers closed-form probability mass functions by composing elementary analytic operations within a structured search space. Our method integrates domain-specific structural priors with evolutionary search and a validity-aware inference stage, and it extends to richer distribution families such as zero inflation and finite mixtures. To support systematic evaluation and future research, we contribute a benchmark dataset spanning a broad collection of commonly used discrete distributions. The proposed algorithm recovers all benchmark families with accurate parameter estimates. A real data application shows that it identifies concise and interpretable mixture models that improve goodness-of-fit over standard models.
Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation
Morelli, Fabian, Eckstein, Stephan
Ensembles of neural networks typically outperform individual networks but incur large computational costs, whereas weight aggregation produces less costly, yet also less accurate, aggregate models. We introduce partial fusion of networks, which interpolates between ensembles and weight aggregation and thus allows for a flexible tradeoff between computational cost and performance. A direct way to achieve this is to extend existing weight aggregation methods based on neuron-level similarity between different networks, where partial fusion then only aggregates weights of neurons which are most similar. We showcase one particular method to jointly identify which neurons are most similar and match them via partial optimal transport. Further, we consider the more general perspective of weight aggregation and partial fusion as generalized pruning of ensemble models, where neurons cannot just be deleted, but also linearly combined. Finally, we show that generalized pruning applied to a single network yields similar benefits as partial fusion by allowing for a tradeoff between isolating, deleting, and linearly combining neurons based on similarity. Our code is available at https://github.com/Fabian-Mor/partial_fusion_nn.
Proxy-Based Approximation of Shapley and Banzhaf Interactions
Thies, Santo M. A. R., Baniecki, Hubert, Witter, R. Teal, Hรผllermeier, Eyke, Muschalik, Maximilian, Fumagalli, Fabian
Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators for these higher-order interactions trade off between speed and accuracy. To overcome this limitation, we introduce ProxySHAP. ProxySHAP reconciles the high sample efficiency of tree-based proxy models with a principled path to consistency via residual correction. On a theoretical level, we derive a polynomial-time generalization of interventional TreeSHAP to compute exact interaction indices for tree ensembles, successfully bypassing exponential tree-depth dependencies in prior methods. Furthermore, we formally analyze the residual adjustment strategy, characterizing the specific conditions under which Maximum Sample Reuse (MSR) corrects proxy bias without its variance scaling exponentially with interaction size. Extensive benchmarking demonstrates that ProxySHAP sets a new state-of-the-art standard for approximation quality, including in large-scale applications with thousands of features. By achieving the lowest error in both small- and large-budget regimes, ProxySHAP significantly outperforms the prior best estimators ProxySPEX and KernelSHAP-IQ, while also delivering superior performance on downstream explainability tasks.
Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning
Deep learning is increasingly viewed as a dynamical process in parameter space, yet many existing theories still treat training as a closed optimization system. This view is limited for real-world AI, where models operate under uncertainty, resource constraints, distribution shift, downstream decision risks, and human feedback. We propose Human-Centered Learning Mechanics (HCLM), a dynamical and information-theoretic framework for open and controlled learning systems. The central idea is that entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory. Otherwise, entropy terms may produce weak, unstable, or misaligned gradients, causing the dynamics to collapse toward ordinary loss minimization. We introduce the notion of effective entropy and study tractable geometric entropy surrogates, including variance-based and log-determinant covariance proxies. The paper makes three contributions. First, it formalizes entropy regularization through effective information force and characterizes degenerate entropy regimes. Second, it derives convergence, entropy-flow, Wasserstein-gradient-flow, and noisy-representation generalization results under explicit assumptions. Third, it offers a conditional dynamical interpretation of scaling-law-like behavior as a balance between information injection, entropy dissipation, and residual risk, without claiming an unconditional derivation of empirical neural scaling laws. Controlled representation-learning experiments support the hypothesis that geometric entropy surrogates, especially log-determinant covariance entropy, induce stronger and more stable information forces than softmax-normalized entropy.
Diffusion-based Denoising Beats Vanilla Score Matching in Parameter Estimation: A Theoretical Explanation
Schwienhorst, Benedikt Lรผtke, Klein, Nadja, Lederer, Johannes
Score matching is an alternative to maximum likelihood estimation when the normalizing constant is unknown or too costly to evaluate. However, vanilla score matching has shown to be inefficient relative to maximum likelihood estimation for multimodal distributions with well-separated modes, which are commonly encountered in practical applications. We compare a novel diffusion-based denoising score matching estimator (DDSME) to the vanilla score matching estimator (SME) in this scenario. In particular, we prove statistical guarantees for both estimators, showing that the error bound for the vanilla SME worsens when the separation between the modes increases, which can be avoided in case of the DDSME with suitable hyperparameter tuning. This provides a novel theoretical explanation for the superior behavior of diffusion-based score matching over the vanilla version.
Uncertainty-aware classification and triage of structural heart disease using electrocardiography and echocardiography metrics
Machine learning methods provide a methodological innovation that can help screen for cardiovascular disease through noninvasive and readily available measurement modalities. Recent investments in using electrocardiogram (ECG) data to screen for structural heart disease (SHD) are one example, where ECGs provide a low-cost, available modality for screening. This has led to the EchoNext dataset, a paired ECG-echocardiogram data repository for testing new methods of SHD detection. However, relatively few studies have investigated how more probabilistic classification through Bayesian inference may improve uncertainty quantification in this setting. Moreover, few studies have considered how triage systems can be developed to alleviate healthcare bottlenecks, such as the review of data from underserved, rural clinics by expert sonographers for SHD assessment. In this study, we leverage existing ECG-echocardiogram data to compare frequentist and Bayesian neural network classifiers. We show that the Bayesian approach is comparable or better than frequentist methods in SHD classification, and that they have a more robust uncertainty quantification attached to them. We provide an example of how this uncertainty-aware classification scheme can be used for screening SHD, providing a proof-of-concept for how machine learning can help with triage in getting individuals expert sonographer input when SHD is highly likely or measurements are highly uncertain.
Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift
Bike-sharing models trained on historical station-hour data may degrade when deployed in later years because travel patterns change over time. This paper studies March Citi Bike demand prediction from 2021 to 2026 as a temporal domain adaptation problem and proposes Gen-ROTDA, a robust optimal transport-guided residual domain adaptation framework. The method fits a target-domain station-time anchor with a small labeled target subset, transfers residual rather than raw demand, applies a deterministic label-preserving residual feature generator, and trims high-cost transport matches before training the final residual predictor. Experiments compare Gen-ROTDA with anchor-only, source-only, target-only, fine-tuning, MMD adaptation, Sinkhorn OTDA, ROTDA, and Gen-OTDA. Gen-ROTDA achieves the lowest MAE on the main 2025 to 2026 task and is the best OT-family method on average across multi-year tasks, although fine-tuning and MMD adaptation remain strong overall baselines. Under abnormal target-unlabeled records, Gen-ROTDA is much more stable than non-robust OT variants, suggesting that robust transport is useful for noisy temporal transfer in bike-sharing demand prediction.
Operationalizing Individual Fairness via Gradient Descent and Bradley-Terry Models
Olson, Conlan, Zhang, Linjun, Deng, Zhun, Sur, Pragya
Individual fairness, the notion that "similar individuals should be treated similarly," provides a strong and flexible fairness guarantee for algorithmic decision makers. However, a barrier to implementing individual fairness in practice is the difficulty of learning the similarity metric over individuals. In this work, we present an algorithm for learning a Mahalanobis similarity metric from triplet queries of the form "is individual $i$ more similar to individual $j$ or $k$?" We work in the standard Bradley-Terry model for pairwise comparisons. Our algorithm consists of a spectral initialization step followed by gradient descent. We provide extensive theoretical guarantees on our algorithm, showing that it converges quickly to the ground truth metric despite the non-convexity of the loss in our model. Because our focus is on fairness, we also show that individual fairness with respect to an estimated metric is sufficient to achieve similar fairness with respect to the true metric. We also discuss potential applications of our work to AI model tuning. Finally, we present experimental results that demonstrate the convergence of our algorithm and the fairness performance of downstream fair predictors trained on our estimated metric.
Coupled Training with Privileged Information and Unlabeled Data
Shi, Jiahao, Hagrass, Omar, Klusowski, Jason M.
In many prediction problems, we have extra information during training (for example, measurements that are expensive or slow to collect) that will not be available when the model is deployed. A common strategy is to first train a model that uses all training information, then use its predictions on unlabeled examples to train a second model that only uses the inputs available at test time. However, when the extra training-only information is weak or noisy, this Two-Stage approach can mislead the deployment model and even hurt accuracy. We propose a joint training method that learns the two models together, so the deployment model can benefit from the extra information only when it actually helps, instead of inheriting its mistakes. We provide guarantees that describe when joint training improves prediction accuracy and analyze a simple alternating training algorithm for large, high-dimensional models. Experiments on synthetic data and real-world prediction tasks show that our approach avoids these failures and robustly outperforms standard Two-Stage baselines.