Genre
Donor-Aware scRNA-seq Benchmarks for IBD Classification
Donor-level disease classification from single-cell RNA sequencing (scRNA-seq) requires strict donor-aware cross-validation: naive pipelines that split cells randomly conflate training and test donors, inflating reported performance through pseudoreplication. We present a donor-aware benchmark evaluating three feature representations across two independent IBD cohorts: centered log-ratio (CLR) transformed cell-type composition, GatedStructuralCFN dependency embeddings, and scVI variational autoencoder latent embeddings. The cohorts are the SCP259 ulcerative colitis atlas (UC vs. Healthy, n=30 donors, 51 cell types) and the Kong 2023 Crohn's disease atlas (CD vs. Healthy, n=71 donors, 55-68 cell types across three intestinal regions). Compartment-stratified CLR composition achieves AUROC 0.956 +/- 0.061 on SCP259; GatedStructuralCFN on the same features achieves 0.978 +/- 0.050. In the Kong cohort, CFN achieves its best performance in the colon region (0.960 +/- 0.055 after feature filtering), exceeding linear CLR (0.900 +/- 0.100), while terminal ileum classification is dominated by linear models (CatBoost CLR 0.967 +/- 0.075 vs. CFN 0.811 +/- 0.164). Cross-dataset transfer (CD->UC, four shared cell types) achieves AUC 0.833 with XGBoost CLR; the reverse direction performs at chance. CFN edge stability analysis shows that compartment-wise composition eliminates spurious unit-sum-induced instability present in global composition (Jaccard 0.026 vs. top-20 recurrence 1.0). CFN shows a consistent numerical advantage over linear models in the colon region of CD (AUROC 0.960 vs. 0.900), though no inter-method comparison reached statistical significance at n<=34 donors per region. Compartment-aware feature construction is critical for both classification performance and structural interpretability. Code: https://github.com/Jonathan-321/sfn-scrna-study
On the Spectral Structure and Objective Equivalence of Orthogonal Multilabel Fisher Discriminants
Keith-Norambuena, Brian, Bekios-Calfa, Juan
We provide a unified theoretical analysis of Linear Discriminant Analysis with simultaneous multilabel scatter matrix formulations and Stiefel orthogonality constraints. Our contributions span both algebraic structure and statistical guarantees. On the algebraic side, we characterize the rank of the multilabel between-class scatter matrix, showing that the effective discriminant dimensionality can strictly exceed the classical single-label bound of $C-1$; we establish a multilabel partition of variance and prove that all four Fisher objectives are equivalent under the $W^\top S_t^{ML} W = I_r$ constraint while characterizing their divergence under the Stiefel constraint; and we prove a two-sided label-distance preservation bound relating projected distances to Hamming distances in label space. On the statistical side, we establish a finite-sample $O(k_{\max}\sqrt{d\log d/n}/gap_r)$ bound on the subspace estimation error under sub-Gaussian noise with a matching $ฮฉ(ฯ^2 d/(n\,gap_r))$ minimax lower bound, establishing a near-minimax-optimal rate (matching up to logarithmic and $k_{\max}$ factors) for multilabel discriminant subspace estimation. We further provide high-probability distance concentration, robustness guarantees under label interactions, and a regularization analysis preserving the spectral structure when $d \gg n$. All results are verified numerically on synthetic data generated from the linear label-effect model, covering both the algebraic identities and the multilabel-specific quantities ($k_{\max}$, $ฮบ(S_t^{ML})$, $\|ฮ/n\|_2$, $ฮ_r$) that govern the statistical bounds. The numerical experiments are designed as a sanity check for the theorems rather than as an empirical benchmark; evaluation on real multilabel datasets is left to future work targeting application-oriented venues.
Imbalanced Classification under Capacity Constraints
Fraiman, Daniel, Fraiman, Ricardo
In many classification settings, the class of primary interest is underrepresented, leading to imbalanced data problems that arise in applications such as rare disease detection and fraud identification. In these contexts, identifying a potential positive instance typically triggers costly follow-up actions, such as medical imaging or detailed transaction inspection, which are subject to limited operational capacity. Motivated by this setting, we consider classification problems where data may arrive sequentially and decisions must be made under constraints on the number of instances that can be selected for further analysis. We propose a classification framework that explicitly controls the rate of positive predictions, enforcing a user-defined bound on the proportion of observations classified as belonging to the minority class while maximizing detection performance. The approach can be implemented using standard learning methods and naturally extends to online settings, where decisions are taken in real time. We show that incorporating capacity constraints leads to substantial improvements over classical approaches, including resampling techniques such as SMOTE, which do not directly control the selection rate.
Adaptive Estimation and Optimal Control in Offline Contextual MDPs without Stationarity
Bhattacharyya, Riddhiman, Chakrabarty, Sayak, Banerjee, Imon
Contextual MDPs are powerful tools with wide applicability in areas from biostatistics to machine learning. However, specializing them to offline datasets has been challenging due to a lack of robust, theoretically backed methods. Our work tackles this problem by introducing a new approach towards adaptive estimation and cost optimization of contextual MDPs. This estimator, to the best of our knowledge, is the first of its kind, and is endowed with strong optimality guarantees. We achieve this by overcoming the key technical challenges evolving from the endogenous properties of contextual MDPs; such as non-stationarity, or model irregularity. Our guarantees are established under complete generality by utilizing the relatively recent and powerful statistical technique of $T$-estimation (Baraud, 2011). We first provide a procedure for selecting an estimator given a sample from a contextual MDP and use it to derive oracle risk bounds under two distinct, but nevertheless meaningful, loss functions. We then consider the problem of determining the optimal control with the aid of the aforementioned density estimate and provide finite sample guarantees for the cost function.
Bandits on graphs and structures
The goal of this thesis is to investigate the structural properties of certain sequential problems in order to bring the solutions closer to a practical use. In the first part, we put a special emphasis on structures that can be represented as graphs on actions. In the second part, we study the large action spaces that can be of exponential size in the number of base actions or even infinite. For graph bandits, we consider the settings of smoothness of rewards (spectral bandits), side observations, and influence maximization. For large structured domains, we cover kernel bandits, polymatroid bandits, bandits for function optimization (including unknown smoothness), and infinitely many-arms bandits. The thesis aspires to be a survey of the author's contributions on graph and structured bandits.
Adaptive graph-based algorithms for conditional anomaly detection and semi-supervised learning
We develop graph-based methods for semi-supervised learning based on label propagation on a data similarity graph. When data is abundant or arrive in a stream, the problems of computation and data storage arise for any graph-based method. We propose a fast approximate online algorithm that solves for the harmonic solution on an approximate graph. We show, both empirically and theoretically, that good behavior can be achieved by collapsing nearby points into a set of local representative points that minimize distortion. Moreover, we regularize the harmonic solution to achieve better stability properties. We also present graph-based methods for detecting conditional anomalies and apply them to the identification of unusual clinical actions in hospitals. Our hypothesis is that patient-management actions that are unusual with respect to the past patients may be due to errors and that it is worthwhile to raise an alert if such a condition is encountered. Conditional anomaly detection extends standard unconditional anomaly framework but also faces new problems known as fringe and isolated points. We devise novel nonparametric graph-based methods to tackle these problems. Our methods rely on graph connectivity analysis and soft harmonic solution. Finally, we conduct an extensive human evaluation study of our conditional anomaly methods by 15 experts in critical care.
Bandits attack function optimization
Preux, Philippe, Munos, Rรฉmi, Valko, Michal
We consider function optimization as a sequential decision making problem under budget constraint. This constraint limits the number of objective function evaluations allowed during the optimization. We consider an algorithm inspired by a continuous version of a multi-armed bandit problem which attacks this optimization problem by solving the tradeoff between exploration (initial quasi-uniform search of the domain) and exploitation (local optimization around the potentially global maxima). We introduce the so-called Simultaneous Optimistic Optimization (SOO), a deterministic algorithm that works by domain partitioning. The benefit of such approach are the guarantees on the returned solution and the numerical efficiency of the algorithm. We present this machine learning approach to optimization, and provide the empirical assessment of SOO on the CEC'2014 competition on single objective real-parameter numerical optimization test-suite.
A Hierarchical Sampling Framework for bounding the Generalization Error of Federated Learning
Filatrella, Dario, Thobaben, Ragnar, Skoglund, Mikael
We study expected generalization bounds for the Hierarchical Federated Learning (HFL) setup using Wasserstein distance. We introduce a generalized framework in which data is sampled hierarchically, and we model it with a multi-layered tree structure that induces dependencies among the clients' datasets. We derive generalization bounds in terms of Wasserstein distance under the Lipschitz assumption on the loss function, by applying a supersample construction that allows us to measure the sensitivity of the algorithm to the change of a single node in the sampling tree. By leveraging the FL structure, we recover and strictly imply existing state-of-the-art conditional mutual information (CMI) bounds in the case of bounded losses. We also show that our bound can be applied together with Differential Privacy assumptions, to recover generalization bounds based on algorithmic privacy. To assess the tightness of our bounds, we study the Gaussian Location Model (GLM) and show that we recover the actual asymptotic rate of the generalization error.
Free Decompression with Algebraic Spectral Curves
Ameli, Siavash, van der Heide, Chris, Hodgkinson, Liam, Mahoney, Michael W.
At the core of scientific computing and much of modern machine learning (ML) lies the challenge of estimating the eigenvalues of high-dimensional Hermitian matrices. Such matrices, including kernels, Hessians, and graph representations, encode the intrinsic geometry and connectivity of the data and models built on them, rendering the pursuit of efficient spectral techniques a primary concern for both theory and practice. Studying eigenspectra has become a prominent approach to understanding performance and guiding training in deep learning [10, 20, 36, 53]. In many cases, the spectra of such matrices have non-trivial structure, often containing spikes, multiple multi-modal bulks, and heavy-tails [14, 25]. Conventional algorithms to extract eigenvalue information from these matrices have required that the data are able to be stored in memory, scratch space, or can at least be accessed as an implicit operator (via matrix-vector products). More recently, a new class of algorithms has emerged that is able to provide highly-accurate estimates of the eigenvalues (or summary functionals thereof [2]) of matrices, even without implicit or explicit access to the full matrix, i.e., of so-called impalpable matrices [1]. One such method, termed Free Decompression (FD), shows great promise as a tool for gaining access to the spectral distributions of such impalpable matrices. The central premise is that by appropriately sampling a small sub-matrix from the large impalpable matrix of interest, one can evolve a partial differential equation (PDE) in the Stieltjes transform of a spectral density in the decompression ratio to the desired matrix dimension.
Amortized Variational Inference for Joint Posterior and Predictive Distributions in Bayesian Uncertainty Quantification
Bayesian predictive inference propagates parameter uncertainty to quantities of interest through the posterior-predictive distribution. In practice, this is typically performed using a two-stage procedure: first approximating the posterior distribution of model parameters, and then propagating posterior samples through the predictive model via Monte Carlo simulation. This sequential workflow can be computationally demanding, particularly for high-fidelity models such as those governed by partial differential equations. We propose a variational Bayesian framework that directly targets the posterior-predictive distribution and jointly learns variational approximations of both the posterior and the corresponding predictive distribution. The formulation introduces a variational upper bound on the Kullback--Leibler divergence together with moment-based regularization terms. The variational distributions are trained in an amortized manner, shifting computational effort to an offline stage and enabling efficient online inference. Numerical experiments ranging from analytical benchmarks to a finite-element solid mechanics problem demonstrate that the proposed method achieves more accurate predictive distributions than conventional two-stage variational inference, while substantially reducing the cost of online predictive inference.