Goto

Collaborating Authors

 cohort


Multimodality Stacking with Blockwise missing values and application to the PIONeeR biomarkers study for prediction of resistance to immunotherapy

arXiv.org Machine Learning

Integrating multimodal datasets in clinical oncology is frequently hindered by high dimensionality and blockwise missingness, where entire data sources are unavailable for specific patient subsets. Standard survival models often struggle with these gaps, leading to biased results or patient exclusion. We introduce Multimodality Stacking with Blockwise missing values (MSB), a late-fusion framework for survival analysis that independently models modality-specific features before aggregating predictions via a cross-validated stacking meta-learner. MSB was validated on the PIONeeR study (n=443 patients, 378 biomarkers across eight heterogeneous sources) to predict progression-free survival in advanced non-small cell lung cancer patients receiving immunotherapy. MSB yielded higher predictive performance (C-index) than baseline algorithms. Improvements varied by baseline strength: linear models showed a 15.9% increase (p<0.001 for the Wilcoxon signed-rank test), random survival forests gained 5.4% (p=0.002), and gradient boosting methods improved by 2.1% (p=0.030). Beyond discrimination, MSB reduced the generalization gap (train-test difference in 5 folds cross-validation repeated 3 times: 0.055 vs 0.380 for linear models). Permutation importance analysis identified routine laboratory markers, clinical features, and PD-L1 expression as primary predictive drivers. Missing block indicators showed negligible importance, suggesting the model learned from biomarker values rather than data availability patterns. MSB provides a statistically validated framework for multimodal survival prediction with blockwise missingness. By enabling systematic biomarker evaluation without requiring complete data, MSB offers a practical tool for predictive modeling in biomedical research, pending external validation. Implementation is available at https://github.com/MohamedBoussena/MSB under Inria license.


SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction

arXiv.org Machine Learning

Microsimulation models used by ministries of finance and central banks rely on parametric processes for lifetime earnings that capture only first and second moments of the conditional distribution and miss long-range nonlinear structure. We propose SAGA, a decoder-only transformer for irregular tabular panel sequences, paired with a split conformal calibration wrapper that delivers individual-level prediction intervals with finite-sample marginal coverage guarantees. Trained on the longitudinal Swedish LISA register over 1990 to 2022, comprising 2,143,817 individuals and 61,284,903 person-years, the model forecasts annual labor earnings at horizons of one to thirty years and aggregates them by Monte Carlo into present-discounted lifetime earnings distributions. Against the canonical Guvenen, Karahan, Ozkan, and Song parametric process and tabular and recurrent baselines, SAGA reduces continuous ranked probability score by 31.9 percent at the ten-year horizon and mean absolute error by 37.7 percent at the twenty-year horizon. Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst-case demographic subgroup. The reconstructed lifetime earnings Gini coefficient is 0.327 against the partially observed truth of 0.341 and the GKOS estimate of 0.378. Model weights, calibration tables, and a synthetic equivalent dataset are released for replication outside the protected SCB MONA environment.


Your SaaS Is an Insurance Product: A Modeling Framework

arXiv.org Machine Learning

Capped-usage SaaS products -- LLM subscriptions such as Claude Code and ChatGPT, cloud platforms such as Vercel and Cloudflare Workers, corporate benefit platforms, identity-verification services with liability transfer -- share a structural signature with insurance products: a fixed premium decoupled from realized consumption, stochastic per-user demand with heavy-tailed severity, a non-fungible cap that resets on a fixed schedule, and a portfolio-level exposure that requires reserve adequacy under tail risk. We argue that this is not an analogy. It is the same operational problem actuarial science has been tooled for decades to address, restated with new dependent variables (tokens, bandwidth bytes, function-invocations, gym check-ins) in place of medical claims. This paper proposes a modeling framework for capped-usage SaaS pricing built from frequency-severity decomposition, premium calculation principles, and Monte Carlo reserve adequacy. We map the framework to publicly observable subscription tiers in two domains (LLM services and cloud platforms), ground it in canonical health-insurance economics (Arrow 1963; Pauly 1968; Manning et al. 1987; Brot-Goldberg et al. 2017), and demonstrate divergence from traditional unit economics through a worked example. The contribution is operational rather than theoretical: not a new theorem, but vocabulary and tools currently absent from cs.LG/stat.ML practice.


Unsupervised Domain Shift Detection with Interpretable Subspace Attribution

arXiv.org Machine Learning

We developed a tool for detecting domain shifts, namely subtle differences in the probability distributions of datasets. We identify these shifts using an algorithm designed to detect localised density anomalies in high-dimensional feature spaces. If an anomaly is present, we then identify the feature subspace in which the anomaly is most pronounced. This allows us to trace the domain shift to a small set of features, making the shift interpretable. Moreover, we provide a protocol for compensating domain shifts by extracting, from two unlabelled datasets, subsets of samples with no detectable residual distributional difference. We validate the framework on controlled 20-dimensional benchmarks with known ground truth, recovering both broad and localized shifts together with their supporting feature subspaces. We then apply it to healthy electrocardiogram (ECG) recordings represented by 782 features. In age- and sex-matched cohort comparisons differing in measurement-device composition, the method detects device-induced shifts, extracts representative subsets enriched in the imbalanced device components, and identifies ECG features associated with the acquisition contrast. These results suggest that density-shift detection and subspace attribution provide a practical framework for uncovering hidden cohort biases before downstream modelling.


FRESH: Information-Geometric Calibration of Patient-Level Models to Aggregate Evidence

arXiv.org Machine Learning

Many decision in clinical science and epidemiology -- estimating probability of technical success for a clinical trial, assessing comparative effectiveness of two therapies, imputing a placebo effect onto natural history data -- rely on combining sources of information about a clinical cohort that comes from different kinds of studies. Specifically we contrast patient-level sources that provide granular pictures of individual disease course (clinical trial, registries, or electronic health records) with aggregate sources such as published clinical trial results and the TFLs (tables figures and listings). One strategy for combining aggregate with patient-level data sources is to bring each into a common format for a unified analysis. If one wants to maintain the analytic flexibility of patient-level data, then a natural solution is to convert the aggregate data information into a simulated patient-level dataset that recapitulate those aggregate statistics. This is an under-determined inverse problem in that there are many such datasets, and it cannot be well specified without further constraints. FRESH(Fusion of Recent Evidence with Subject Histories) provides a well-defined method for solving this problem, and therefore providing maximal analytic flexibility.


Disease Is a Spectral Perturbation

arXiv.org Machine Learning

We propose a novel method of understanding disease transformation from a healthy baseline with biomarker-level explainability. By modeling the biomarker covariance matrices of healthy controls and disease states, the perturbation can be individually characterized to accomplish mechanistic explanations of disease trajectories, both at a molecular level and for individual patients. Given a cohort of n patients each measured on p biomarkers, we define the biomarker "Hamiltonian" H = X^T X / n \in R^{p \times p}, where X \in R^{n \times p} is the covariant biomarker matrix. The eigenvectors of H define a set of normal modes of biomarker coordination, and the eigenvalues quantify the energy carried by each mode. In the healthy state, the reference Hamiltonian H_0 governs this structure where disease perturbs H_0 by an additive operator ΔH, thus shifting eigenvalues and rotating eigenvectors in proportion to the severity of pathological disruption. We formalize this framework, derive the spectral change given a disease perturbation, and demonstrate that the projection of a newly diagnosed patient's cumulative biomarker covariance structure onto disease-discriminant eigenmodes constitutes an optimal prognostic statistic for greater precision in disease prognosis. This work serves as a veritable white paper with application across a panoply of disease frameworks from cancer to neurodegenerative disorders.


Donor-Aware scRNA-seq Benchmarks for IBD Classification

arXiv.org Machine Learning

Donor-level disease classification from single-cell RNA sequencing (scRNA-seq) requires strict donor-aware cross-validation: naive pipelines that split cells randomly conflate training and test donors, inflating reported performance through pseudoreplication. We present a donor-aware benchmark evaluating three feature representations across two independent IBD cohorts: centered log-ratio (CLR) transformed cell-type composition, GatedStructuralCFN dependency embeddings, and scVI variational autoencoder latent embeddings. The cohorts are the SCP259 ulcerative colitis atlas (UC vs. Healthy, n=30 donors, 51 cell types) and the Kong 2023 Crohn's disease atlas (CD vs. Healthy, n=71 donors, 55-68 cell types across three intestinal regions). Compartment-stratified CLR composition achieves AUROC 0.956 +/- 0.061 on SCP259; GatedStructuralCFN on the same features achieves 0.978 +/- 0.050. In the Kong cohort, CFN achieves its best performance in the colon region (0.960 +/- 0.055 after feature filtering), exceeding linear CLR (0.900 +/- 0.100), while terminal ileum classification is dominated by linear models (CatBoost CLR 0.967 +/- 0.075 vs. CFN 0.811 +/- 0.164). Cross-dataset transfer (CD->UC, four shared cell types) achieves AUC 0.833 with XGBoost CLR; the reverse direction performs at chance. CFN edge stability analysis shows that compartment-wise composition eliminates spurious unit-sum-induced instability present in global composition (Jaccard 0.026 vs. top-20 recurrence 1.0). CFN shows a consistent numerical advantage over linear models in the colon region of CD (AUROC 0.960 vs. 0.900), though no inter-method comparison reached statistical significance at n<=34 donors per region. Compartment-aware feature construction is critical for both classification performance and structural interpretability. Code: https://github.com/Jonathan-321/sfn-scrna-study


CLVAE: A Variational Autoencoder for Long-Term Customer Revenue Forecasting

arXiv.org Machine Learning

Predicting customers' long-term revenue from sparse and irregular transaction data is central to marketing resource allocation in non-contractual settings, yet existing approaches face a trade-off. Traditional probabilistic customer base models deliver robust long-horizon forecasts by imposing strong structural assumptions, while flexible machine-learning models often require substantial training data and careful tuning. We propose a variational-autoencoder-based model that preserves the process-based likelihood of established attrition-transaction-spend models conditional on customer heterogeneity, but replaces the restrictive parametric mixing distribution with a flexible latent representation learned by encoder-decoder networks. The resulting approach (i) provides a single model for customer attrition, transactions and spending, (ii) remains reliable when contextual covariates are unavailable, and (iii) flexibly incorporates rich covariates and nonlinear effects when they are available. This design balances structural stability with the flexibility needed to capture complex purchase dynamics. Across multiple real-world datasets and prediction horizons, the proposed model improves upon the latest benchmarks. Businesses benefit directly, as a better assessment of customers' future revenues improves the efficiency of campaign targeting. For research, this work provides guidance on how to embed domain-specific models into the variational autoencoder framework, enabling flexible representation learning while retaining an econometrically meaningful process structure.



root

Neural Information Processing Systems

The high-level architecture of our simulator is illustrated in Figure 1 of Section 4. Additional details (with references to objects in the source code) are provided below. Simulations were run in Python 3.8 on an Intel(R) Xeon(R) CPU E5-2667 One direction is to extend the feature description of the ads (beyond topic) to include features that reflect ad quality and location. Baseline parameters are: µ = 0 . The resulting cohort errors are consistent with Figure 1 of Section 5.1. For the fully informative prior, the agent is completely certain of users' cohorts for Lastly, for the uninformative prior, revelation of a user's cookie does not inform the The agent's ability to distinguish users based on their responses depends on the similarities of affinities across users in different cohorts.