xij
Empirical Bayes Rebiasing
Ling, Wanyi, Li, Sida, Guan, Junming, Ignatiadis, Nikolaos
We study methods for simultaneous analysis of many noisy and biased estimates, each paired with an even noisier estimate of its own bias. The analyst's goal is to construct short calibrated intervals for each parameter. The standard debiasing approach, which subtracts the bias estimate from each biased estimate, inflates variance and yields long intervals. In this paper, we propose an empirical Bayes rebiasing strategy that starts from the fully debiased estimates and learns from data how much bias to reintroduce by estimating the unknown bias distribution. We provide convergence rates for the coverage of our intervals when the bias distribution is estimated using nonparametric maximum likelihood. Furthermore, we demonstrate substantial precision gains in prediction-powered inference, including pairwise LLM win-rate evaluations, as well as for inference of direct genetic effects in family-based GWAS.
Stable Blanket with Hidden Variables and Cycles
Stabilized regression aims to identify a set of predictors whose conditional relationship with a response variable remains invariant across different environments. Existing graphical characterizations of the stable blanket are mainly developed for structural causal models (SCMs) without hidden variables or causal cycles. However, latent variables and feedback relationships naturally arise in many applications, and they can change both the Markov blanket and the set of predictors that remain stable under interventions. This paper studies stable blankets in graphical causal models with hidden variables, causal cycles, and both features simultaneously. For models with hidden variables, we use acyclic directed mixed graphs (ADMGs) and $m$-separation to characterize the Markov blanket and to construct intervention-stable predictor sets. We introduce the notion of an intervened sub-district and use it to describe how interventions may affect districts connected to the response. For models with cycles, we work with directed graphs (DGs) and directed mixed graphs (DMGs) together with $σ$-separation, treating strongly connected components (SCCs) as the basic graphical units. We then combine these ideas to analyze models with both hidden variables and cycles. The main results give graphical characterizations of Markov blankets, stable frontiers, and stable blankets in these generalized settings. In particular, we identify conditions under which the response is conditionally independent of intervention variables given a suitable predictor set, and we describe when such sets are minimal or unique. These results extend the graphical interpretation of stabilized regression beyond acyclic fully observed models.
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
Agazzi, Andrea, Bruno, Giuseppe, García, Eloy Mosig, Saviozzi, Samuele, Romito, Marco
The transformer architecture [52], which underlies present-day Large Language Models, has been one of the main drivers of recent advances in machine learning and artificial intelligence. At each layer, the hidden state of the network is updated by sequentially applying two distinct operations: attention modules [3], which capture long-range interactions in the input sequence, and classical MultiLayer Perceptrons (MLPs), acting separately on each element of that sequence. Despite their empirical success, the mechanisms governing information propagation through depth, and the way attention and MLP blocks jointly shape internal representations, remain only partially understood from a theoretical viewpoint. Recent progress has come from viewing transformers in suitable scaling limits as deterministic mean-field interacting particle systems modeling the evolution of N tokens1 through the layers of the neural network architecture (the so-called residual stream dynamics), see, among others, [46, 26, 27, 45]. In these descriptions, depth plays the role of a continuous time variable, and, in the large-context regime (N), the evolution of token representations is encoded by a PDE for their empirical distribution. This viewpoint is closely connected to the literature on scaling laws, where the effect of various scaling exponents controlling the relative size of the network's hyperparameters (e.g., depth, width, context length) on the effective dynamics of the model
Supplementary Material MixACM: Mixup-Based Robustness Transfer via Distillation of Activated Channel Maps
Specifically, robustness with only ACM loss is 48.38%, the addition of soft-labels improves it to 49.53%, the addition of mixup improves it to 52.29%, and the addition of both of these components make final robustness to 56.65%. Also, note that only soft labels are not enough to transfer robustness in this case, as shown by KDOnly column. This is in line with the observations of Goldblum et al. [4]. A.4.2 Role of Intermediate Features To understand the role of low, mid, and high-level features, we performed experiments on CIFAR-10 by progressively changing blocks used for distillation. For this ablation study, we kept all the standard settings reported in the Section A.1. Our correspondence of blocks and features is as follows: block 2: low-level features; block 3: mid-level features; block 4: high-level features. Please note that block 1 corresponds to the output of the first layer only. Therefore, we do not call it low-level features.
Optimistic Rates for Multi-Task Representation Learning
We study the problem of transfer learning via Multi-Task Representation Learning (MTRL), wherein multiple source tasks are used to learn a good common representation, and a predictor is trained on top of it for the target task. Under standard regularity assumptions on the loss function and task diversity, we provide new statistical rates on the excess risk of the target task, which demonstrate the benefit of representation learning. Importantly, our rates are optimistic, i.e., they interpolate between the standard O(m 1/2)rate and the fast O(m 1)rate, depending on the difficulty of the learning task, where m is the number of samples for the target task. Besides the main result, we make several new contributions, including giving optimistic rates for excess risk of source tasks (Multi-Task Learning (MTL)), a local Rademacher complexity theorem for MTRL and MTL, as well as a chain rule for local Rademacher complexity for composite predictor classes.
Optimistic Rates for Multi-Task Representation Learning
We study the problem of transfer learning via Multi-Task Representation Learning (MTRL), wherein multiple source tasks are used to learn a good common representation, and a predictor is trained on top of it for the target task. Under standard regularity assumptions on the loss function and task diversity, we provide new statistical rates on the excess risk of the target task, which demonstrate the benefit of representation learning. Importantly, our rates are optimistic, i.e., they interpolate between the standard O(m 1/2)rate and the fast O(m 1)rate, depending on the difficulty of the learning task, where m is the number of samples for the target task. Besides the main result, we make several new contributions, including giving optimistic rates for excess risk of source tasks (Multi-Task Learning (MTL)), a local Rademacher complexity theorem for MTRL and MTL, as well as a chain rule for local Rademacher complexity for composite predictor classes.
Neural Generalized Mixed-Effects Models
Slavutsky, Yuli, Salazar, Sebastian, Blei, David M.
Generalized linear mixed-effects models (GLMMs) are widely used to analyze grouped and hierarchical data. In a GLMM, each response is assumed to follow an exponential-family distribution where the natural parameter is given by a linear function of observed covariates and a latent group-specific random effect. Since exact marginalization over the random effects is typically intractable, model parameters are estimated by maximizing an approximate marginal likelihood. In this paper, we replace the linear function with neural networks. The result is a more flexible model, the neural generalized mixed-effects model (NGMM), which captures complex relationships between covariates and responses. To fit NGMM to data, we introduce an efficient optimization procedure that maximizes the approximate marginal likelihood and is differentiable with respect to network parameters. We show that the approximation error of our objective decays at a Gaussian-tail rate in a user-chosen parameter. On synthetic data, NGMM improves over GLMMs when covariate-response relationships are nonlinear, and on real-world datasets it outperforms prior methods. Finally, we analyze a large dataset of student proficiency to demonstrate how NGMM can be extended to more complex latent-variable models.