Statistical Learning
Partial Multi-Label Learning with Probabilistic Graphical Disambiguation
In partial multi-label learning (PML), each training example is associated with a set of candidate labels, among which only some labels are valid. As a common strategy to tackle PML problem, disambiguation aims to recover the ground-truth labeling information from such inaccurate annotations. However, existing approaches mainly rely on heuristics or ad-hoc rules to disambiguate candidate labels, which may not be universal enough in complicated real-world scenarios. To provide a principled way for disambiguation, we make a first attempt to explore the probabilistic graphical model for PML problem, where a directed graph is tailored to infer latent ground-truth labeling information from the generative process of partial multi-label data. Under the framework of stochastic gradient variational Bayes, a unified variational lower bound is derived for this graphical model, which is further relaxed probabilistically so that the desired prediction model can be induced with simultaneously identified ground-truth labeling information. Comprehensive experiments on multiple synthetic and real-world data sets show that our approach outperforms the state-of-the-art counterparts.
Value-Aware Product Recommendation by Customer Segmentation using a suitable High-Dimensional Similarity Measure
Acosta, Marรญa Florencia, Arancibia, Rodrigo Garcรญa, Llop, Pamela, Lovatto, Mariel, Mansilla, Lucas
This paper presents a novel value-aware approach to product recommendation that simultaneously addresses the high dimensionality and sparsity of user-item data while explicitly incorporating the contribution of each product and user to overall sales revenue. The proposed framework encodes revenue contributions in the user-item matrix and computes customer similarity directly on this basis using suitable distance measures. This enables the segmentation of users according to the revenue-based similarity of their purchase baskets and supports recommendations aligned with profitability objectives. We compare conventional similarity metrics with a novel alternative tailored to high-dimensional contexts and propose three recommendation strategies based on revenue share, product popularity, and expected profit generation. The effectiveness of the proposed method is validated through simulation experiments and a real-world application using the UCI Online Retail dataset.
SCOPE-FE: Structured Control of Operator and Pairwise Exploration for Feature Engineering
Park, Minhee, Son, Seongyeon, Lee, Yonghyun, Kim, Eunchan
Automatic feature engineering is an effective approach for improving predictive performance in tabular learning. However, expand-and-reduce methods, such as OpenFE, become increasingly computationally expensive as the input dimensionality grows. This limitation arises primarily from the combinatorial explosion of candidate features generated through operator-feature combinations. To address this issue, we propose SCOPE-FE, a structured search space control framework that improves efficiency by reducing the candidate space prior to feature generation. SCOPE-FE jointly regulates two major sources of combinatorial growth: the operator space and feature-pair space. First, OperatorProbing estimates the dataset-specific utility of candidate operators and eliminates low-contribution operators in advance. Second, FeatureClustering employs spectral embedding and fuzzy c-means clustering to group structurally related features, thereby restricting candidate generation to relevant within-cluster combinations. In addition, we introduce ReliabilityScoring, which incorporates variance across subsamples to stabilize pruning decisions. Experiments on ten benchmark datasets demonstrate that SCOPE-FE substantially reduces feature engineering time while maintaining competitive predictive performance relative to existing baselines. The efficiency gains are particularly pronounced for high-dimensional datasets. These results indicate that structured control of the search space is an effective strategy for scalable automatic feature engineering. The code will be made publicly available upon acceptance.
Linear Models, Variable Selection, Artificial Intelligence
Alrawkan, By Riyadh, Boone, Edward, Ghanam, Ryad, Westveld, Anton
Variable selection in linear regression models has been a problem since hypothesis testing began. Which variables to include or exclude from a model is not an easy task. Techniques such as Forward, Back ward, Stepwise Regression sequentially add or delete variables from a model. Penalized likelihood methods such as AIC, BIC, etc. seek to choose variables that have a significant contribution to the likelihood. Penalized sum of square methods such as LASSO and Elastic Net have been used to penalize small coefficients to only allow variables with large coefficients in the model. This work introduces an Artificial Intelligence approach to model selection where an ANN is trained to determine the significance of the variables based on OLS estimates. A simulation study shows the accuracy across various sample sizes and variances. Furthermore, a simulation study is conducted to compare the performance of the approach against Forward, Backward, AIC, BIC and LASSO. The approach is illustrated using a dataset from the World Health Organization regarding Life Expectancy. A github link is provided to the pretrained ANN that can handle up to 100 predictor variables, the original WHO dataset and the subset used in this work.
Bayesian X-Learner: Calibrated Posterior Inference for Heterogeneous Treatment Effects under Heavy-Tailed Outcomes
Conditional Average Treatment Effect (CATE) estimation in practice demands three properties simultaneously: heterogeneous effects ฯ(x), calibrated uncertainty over them, and robustness to the heavy tails that contaminate real outcome data. Meta-learners (Kรผnzel et al., 2019) give (i); causal forests and BART give (i)-(ii) with Gaussian-tail assumptions; no widely used tool gives all three. We present Bayesian X-Learner, an X-Learner built on cross-fitted doubly robust pseudo-outcomes (Kennedy, 2020) with a full MCMC posterior over ฯ(x) via a Welsch redescending pseudo-likelihood. On Hill's IHDP benchmark the default configuration attains mean ฮตPEHE = 0.56 on 5 replications (lowest mean; differences from S-/T-/X-learners, full-config Causal BART, and a causal forest baseline are not significant at ฮฑ = 0.05, and rank ordering is unstable at 10 replications -- IHDP comparisons are competitive rather than dominant). On contaminated "whale" DGPs with up to 20-25% tail density, a one-flag extension (contamination_severity) that selects a Huberฮด nuisance loss per Huber's minimax-ฮด relation recovers RMSE 0.13 with tight credible intervals (single-cross-fit 30-seed coverage 83% [Wilson 66%, 93%] at 20% density; modularBayes pooling with Bayesian-bootstrap nuisance draws restores nominal 95% coverage). We validate on the Hillstrom email-marketing RCT (N = 42,613), demonstrating consistent behaviour on real heavy-tailed outcome data, and report covariate-stratified ฯ(x) coverage across covariate quintiles to substantiate calibration for heterogeneous effects beyond scalar summaries. We draw a clean distinction between tails-as-contamination (handled by Welsch + Huber nuisance) and tails-as-signal (handled by a tail-aware CATE basis); an empirical probe confirms a tail-aware basis recovers ฯtail with full subgroup coverage, while the library's Hill-estimator path is contamination-directed and should not be used for heterogeneous ฯ. We map six empirical boundaries (contamination ceiling, clean-data efficiency cost, basis sensitivity, sample size, treatment type, compute) and show where other tools are preferable. Code and reproducible benchmarks are released.
The Bernstein-von Mises theorem for Bayesian one-pass online learning
Lee, Jeyong, Choi, Junhyeok, Kim, Dongguen, Chae, Minwoo
Bayesian online learning provides a coherent framework for sequential inference. However, its theoretical understanding remains limited, particularly in the one-pass setting. Existing theoretical guarantees typically require the mini-batch sample size to diverge, a condition that fails in the one-pass regime. In this paper, we propose a new Bayesian online learning algorithm tailored to the one-pass setting, which incorporates a warm-start phase to ensure stable sequential updates. For this algorithm, we show that the sequentially updated posterior attains the optimal convergence rate. Building on this, we establish an online analogue of the Bernstein-von Mises theorem, which guarantees valid uncertainty quantification without diverging mini-batch sample sizes. Our analysis is based on a novel theoretical framework that differs fundamentally from existing approaches in the online learning literature. Numerical experiments on generalized linear models show that the proposed method matches the performance of the batch estimator while outperforming existing online procedures.
Mind the Gap: Structure-Aware Consistency in Preference Learning
Abstractsurrogate loss (e.g., the logistic loss) as a proxy for the true objective: the non-convex, discontinuous 0-1 ranking Preference learning has become the foundationloss. This reliance raises a fundamental theoretical question of aligning Large Language Models (LLMs) withthat remains largely unanswered for deep networks: Does human intent. Popular methods, such as Direct Preference Optimization (DPO), minimize surrominimizing these surrogate losses actually guarantee the minimization of the true ranking error? However, we demonstrate that for In this work, we investigate this question through the lens the equicontinuous hypothesis sets typical of neu-of H-consistency (Mao, Mohri, and Zhong, 2023e). We ral networks, these standard surrogates are theo-formulate LLM preference learning as a pairwise ranking retically inconsistent, yielding vacuous general-problem and derive a series of results that bridge the gap between learning theory and practical fine-tuning. To resolve this, we formulate LLM alignment within a margin-shifted rankingwe identify a fundamental theoretical deficiency in standard framework. We demonstrate that for equicontinuous hypothbounds that depend on enforcing a separationesis sets, a property satisfied by neural networks, standard margin ฮณ. Crucially, we extend this to Structure-surrogate minimization yields vacuous consistency guaranAware H-consistency, introducing a novel ob-tees. Specifically, without explicit constraints, a model can achieve arbitrarily low surrogate risk while maintaining ajective (SA-DPO) that adapts the margin based on the semantic distance between responses tohigh ranking error, effectively "cheating" the objective by handle synonyms and hard pairs. Finally, weshrinking score differences rather than learning the correct analyze the trade-off between consistency andordering. We prove that enforcing a confidence the Polynomial Hinge family) offer superior con-gap ฮณ is not merely a heuristic, but a strict requirement for sistency guarantees for capacity-bounded models H-consistency in the deep learning regime. However, while compared to the standard logistic loss used in DPO. a uniform margin restores consistency, it is a blunt instrument. We show that demanding a large, fixed margin on semantically identical pairs (synonyms) forces the model to hallucinate differences where none exist, introducing bias 1. Introductionand instability. To address this, we propose Structure-Aware H-consistency and a corresponding objective, StructureThe alignment of Large Language Models (LLMs) has shifted from explicit Reward Modeling (Stiennon et al., Aware DPO (SA-DPO).
Linear-Core Surrogates: Smooth Loss Functions with Linear Rates for Classification and Structured Prediction
The choice of loss function in classification involves a fundamental trade-off: smooth losses (like Cross-Entropy) enable fast optimization rates but yield slow square-root consistency bounds, while piecewise-linear losses (like Hinge) offer fast linear consistency rates but suffer from non-differentiability. We propose Linear-Core (LC) Surrogates, a new family of convex loss functions that resolve this tension by stitching a linear core to a smooth tail. We prove that these surrogates are differentiable everywhere while retaining strict linear $H$-consistency bounds, effectively combining the optimization benefits of smoothness with the statistical efficiency of margin-based losses. In the structured prediction setting, we show that this smoothness unlocks a massive computational and energy advantage: it allows for an unbiased stochastic gradient estimator that bypasses the quadratic complexity $O(|\mathscr{Y}|^2)$ of exact inference (e.g., Viterbi). Empirically, our method achieves a 23$\times$ speedup over Structured SVMs on large-vocabulary sequence tagging tasks and demonstrates superior robustness to instance-dependent label noise, outperforming Cross-Entropy by 2.6% on corrupted CIFAR-10.
Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing
In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact realization of training data; this drives the systematic ``generalization gap'', where the train error becomes an unreliable proxy for test error. Existing approaches either argue this gap is benign through complex analysis or sacrifice data to a validation set. In contrast, we introduce decoupled descent (DD), a novel theory-based training algorithm that satisfies a train-test identity -- enforcing the train error to asymptotically track the test error for stylized Gaussian mixture models. Within this specific regime, leveraging approximate message passing theory, DD iteratively cancels the biases due to data reuse, rigorously demonstrating the feasibility of zero-cost validation and $100\%$ data utilization. Moreover, DD is governed by a low-dimensional state evolution recursion, rendering the dynamics of the algorithm transparent and tractable. We validate DD on XOR classification, yielding superior performance compared to GD; additionally, we implement noisy MNIST and non-linear probing of CIFAR-10, demonstrating that even when our stylized assumptions are relaxed, DD narrows the generalization gap compared to GD.