AITopics

The Fisher Information Matrix (FIM) is the canonical local measure of the curvature of a statistical model's log-likelihood surface, and its dominant eigenvalue λmax quantifies the worst-case sensitivity of the model's output distribution to infinitesimal parameter perturbation [1, 2]. The spectral properties of the FIM of neural networks have been studied directly in the random matrix theory literature. Pennington and Worah [4] derive the limiting spectral density of the FIM of a single-hidden-layer network in the high-dimensional asymptotic regime, building on the broader programme of analysing neural network Hessian and kernel spectra via random matrix methods [5, 6], with subsequent work extending these techniques to deeper architectures and non-asymptotic regimes [7, 8]. These results characterize the typical (bulk and edge) spectral behaviour of the FIM for a fixed network and a random or structured input ensemble. This paper studies a complementary question, posed as a perturbation problem rather than an asymptotic-spectrum problem: how does the dominant eigenvalue of a fixed, evaluated empirical FIM change under two specific structured perturbations of the underlying distribution? The first perturbation is a change in the conditioning input away from a reference (in-distribution) ensemble. The second is a structured additive perturbation of the model's own parameters by finite-precision quantization noise -- a perturbation of independent mathematical interest, since it falls outside the i.i.d.-input asymptotic regime treated in the random matrix literature cited above, and instead concerns a fixed network whose parameters, not its input distribution, are perturbed by a noise process with a specific, analytically tractable structure (Definition 4.1). To our knowledge, this parameterperturbation question for the FIM's dominant eigenvalue, under either source of departure, has not been previously formalized.

artificial intelligence, machine learning, perturbation, (18 more...)

2606.28432

Country: Asia > Azerbaijan (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

On Local Population-Risk Certificates

Song, Mingzhi

We develop finite-sample certificates for local population-risk increments $Pδ_v=R(θ_0+v)-R(θ_0)$, $v\in\mathcal D$. The primitive object is an expected-valid upper endpoint $\widehat{\mathsf U}_{\mathcal D}$ satisfying $\mathbb E\sup_{v\in\mathcal D} \{Pδ_v-\widehat{\mathsf U}_{\mathcal D}(v)\}\le0$. This uniform criterion certifies any measurable update selected from the same sample and allows penalties to depend on empirical geometry. The main construction is a cross-fitted ridge calibration for linear feature classes. A pilot fold learns the ridge metric, the complementary fold calibrates the squared mean error in that metric, and complete split averaging recovers the full empirical covariance in the directional quadratic form $\widehat q_{X,λ}$. The optimized diagnostic scale is $\{\widehat q_{X,λ}(h) \widehat r_{X,n_{\rm p},λ}^{\rm cf}/n\}^{1/2}$, and the calibrated trace factor $\widehat r_{X,n_{\rm p},λ}^{\rm cf}$ is compared with the ordinary ridge effective dimension $\widehat r_{X,λ}$. For nonsmooth losses, an exact fixed-mask decomposition $δ_v=J_v^0+R_v^\circ+C_v$ separates frozen Taylor fluctuations, good-path remainders, and interface crossings. Applying the linear and composite certificates componentwise yields endpoints for same-sample expected local search and concentrated release rules.

artificial intelligence, certificate, machine learning, (18 more...)

2606.19147

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.66)

Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms

Su, Ziwei, Ren, Junyu, Veitch, Victor

Contrastive embedding models trained with scale-invariant losses are typically paired with distance metrics like cosine similarity, effectively ignoring embedding magnitudes. However, surprisingly, empirical studies reveal that despite this, these "discarded" norms seem to correlate with semantic properties such as concept specificity, token frequency, and human uncertainty. In this work, we provide a formal theoretical framework explaining this phenomenon. By analyzing the optimization dynamics, we derive an analytic formula demonstrating that embedding length naturally encodes this information as a byproduct of the training process. We also show how this gives rise to signals that can serve as "free" calibration tools in specific models and retrieval tasks, providing a grounded explanation for a previously heuristic observation.

large language model, machine learning, natural language, (18 more...)

2606.30625

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Full Conformal Prediction under Stochastic Non-Conformity Measure

Sornwanee, Thanawat

The theory of full conformal prediction uses deterministic non-conformity measure, but modern usage of full conformal prediction often relies on machine learning training, making stochasticity inevitable. A simple sufficient condition of almost sure permutation invariance of the non-conformity measure can be too restrictive, so many have suggested the relaxation to permutation in distribution as a condition for full conformal prediction validity. We, however, show that this commonly known condition is actually insufficient. We then provide a correct sufficient condition: Conditional Independence & Permutation Invariance in Distribution, which encompasses several stochastic settings that may be used in machine learning.

artificial intelligence, conformal prediction, machine learning, (13 more...)

2606.2873

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Sahoo, Subramanyam, Chadha, Aman, Jain, Vinija, Chaudhary, Divya

Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models

Conservative offline training is widely advocated as a safe foundation for subsequent online adaptation: if a policy stays close to well-supported behaviour, the argument goes, it is less likely to exploit imperfections in a learned reward model. We challenge this intuition empirically and mechanistically. We train a Qwen3-14B policy under Direct Preference Optimisation (DPO) with three levels of conservatism ($β\in \{β_{\mathrm{lo}}, β_{\mathrm{mid}}, β_{\mathrm{hi}}\}$ derived from empirical log-ratio percentiles), then adapt each checkpoint online against a learned reward ensemble (3\,$\times$\,Qwen3-1.7B) while measuring true performance on GSM8K exact-answer accuracy. We find that \emph{higher offline conservatism monotonically increases reward-hacking damage}, measured by the Goodhart gap and its area under the curve (AUGC), with Spearman $ρ= 1.0$ across all three conditions. Mechanistic analysis reveals a three-link causal chain: (i) high-$β$ DPO compresses policy entropy, (ii) Low-entropy policies generate responses with reduced diversity, concentrating in a narrow region of the reward model's training distribution (lower pairwise cosine distance), and (iii) despite this proximity, ensemble disagreement (epistemic uncertainty) increases with $β$ and is exploited faster during online optimisation. We further fit a power-law curve to the $(β, \augc)$ data and identify a practical optimal conservatism level $β^{\star}$ that balances alignment fidelity against hacking vulnerability. Our results suggest that the field needs \emph{calibrated}, not \emph{maximal}, conservatism.

artificial intelligence, augc, machine learning, (15 more...)

2606.30627

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Learning heterogeneous treatment effects under principal stratification

Tong, Jiaqi, Li, Fan

Principal stratification provides a foundational framework for causal inference with intermediate outcomes by defining causal effects within subpopulations, yet existing work has largely focused on average effects across strata rather than treatment effect heterogeneity within strata. Such within-stratum heterogeneity informs individualized treatment decisions but the associated methods are sparse. We address this gap by studying the identification and estimation of the conditional principal causal effects under principal ignorability combined with an odds ratio sensitivity parameterization, which relaxes the monotonicity assumption. To efficiently learn these estimands, we propose a novel doubly cross-fit doubly robust machine learner that resolves the nested nuisance structure inherent to principal stratification. Leveraging sequential orthogonal learning with regularized least-squares sieves, we derive $\mathcal{L}^2$ and uniform limit theory, establish oracle efficiency, and construct uniform confidence bands for the proposed estimator. We use simulations to demonstrate the finite-sample performance of our estimator, and provide an empirical analysis of a randomized trial in acute lung injury, revealing informative patterns of treatment effect heterogeneity within the always-survivor subpopulation.

artificial intelligence, dd0d1, machine learning, (19 more...)

2606.29076

Genre: Research Report (0.90)

Industry: Health & Medicine (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Variance Reduction for Stochastic Gradient Generalized Non-reversible Langevin Monte Carlo Algorithms

Ni, Bingye, Wang, Xiaoyu, Wang, Yingli, Zhu, Lingjiong

We study the leading-order fluctuation of stochastic gradient Euler-Maruyama estimators for generalized non-reversible Langevin dynamics. Under structural assumptions tailored to the small-stepsize central limit theorem and under an unbiased stochastic gradient oracle, we prove that the empirical average over a horizon of order the inverse squared stepsize satisfies a central limit theorem in the vanishing-stepsize regime. The limiting variance is characterized through the Poisson equation of the limiting full-gradient diffusion. We then rewrite this constant in an operator form that links it to the continuous-time asymptotic variance and, under standard operator-theoretic assumptions, derive a sufficient condition under which an anti-symmetric perturbation strictly reduces the leading-order fluctuation constant relative to the reversible baseline. We also identify bounded smooth predictive observables that re directly covered by the main theorem. As a separate Gaussian calculation beyond the bounded-test-function regime, we obtain closed-form formulas for quadratic Hamiltonians and linear observables. The framework covers non-reversible Langevin dynamics and augmented-state examples including Hessian-free high-resolution dynamics and a positive-definite subclass of gradient-adjusted underdamped Langevin dynamics that allow stochastic gradients. Numerical experiments on basic examples and Bayesian linear regression using synthetic data, and Bayesian logistic regression using real data support the predicted Gaussian fluctuations and show that the non-reversible schemes consistently reduce the root mean squared error (RMSE) relative to their reversible baselines.

artificial intelligence, assumption 2, machine learning, (16 more...)

2606.28808

Country:

Asia > China (0.46)
North America > United States (0.45)

Genre:

Research Report > New Finding (0.34)
Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Liu, Peilin, Zhou, Ding-Xuan

Generalization Analysis of Transformers in Distribution Regression

In recent years, models based on the Transformer architecture have seen widespread applications and have become one of the core tools in the field of deep learning. Numerous successful techniques, such as parameter-efficient fine-tuning and efficient scaling, have been proposed surrounding their applications to further enhance performance. However, the success of these strategies has always lacked the support of rigorous mathematical theory. To study the underlying mechanisms behind Transformers and related techniques, we first propose a Transformer learning framework motivated by distribution regression, with distributions being inputs, connect a two-stage sampling process with natural language processing, and present a mathematical formulation of the attention mechanism called attention operator. We demonstrate that by the attention operator, Transformers can compress distributions into function representations without loss of information. Moreover, with the advantages of our novel attention operator, Transformers exhibit a stronger capability to learn functionals with more complex structures than convolutional neural networks and fully connected networks. Finally, we obtain a generalization bound within the distribution regression framework. Through the aforementioned theoretical results, we further discuss some successful techniques emerging with large language models (LLMs), such as prompt tuning, parameter-efficient fine-tuning, and efficient scaling. We also provide theoretical insights behind these techniques within our novel analysis framework.

attention operator, large language model, machine learning, (18 more...)

doi: 10.1162/neco_a_01726

2606.29256

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

TimeLAVA: Learning-Agnostic Valuation for Time Series Data

Liu, Wenqin, Quan, Weizhi, Zuo, Aoqi, Gao, Erdun, Nguyen, Vu, Sejdinovic, Dino, Bondell, Howard, Gong, Mingming

Data valuation quantifies the intrinsic quality of individual samples to enable principled data curation, quality control, and robust learning. For time series in critical domains such as healthcare, finance, and industrial monitoring, effective valuation methods are essential yet fundamentally lacking. Existing approaches are either model-dependent, limiting their generalizability, or designed for i.i.d. data and thus fail to capture temporal dependencies, multi-scale patterns, and non-stationary dynamics inherent to sequential data. We introduce TimeLAVA, a learning-agnostic framework that values temporal segments by their marginal contribution to minimizing distributional discrepancy between evaluated and reference data. At its core is a novel Selective Wavelet-based Wasserstein discrepancy combining multi-scale wavelet transforms for temporal localization with unbalanced optimal transport for robustness to distributional shifts. Segment values are efficiently computed via sensitivity analysis without requiring model training and aggregated into point-wise scores. We provide theoretical guarantees linking valuation to model-agnostic generalization and prove bounded sensitivity to outlier contamination. Extensive experiments across anomaly detection, data pruning, and label noise detection demonstrate that TimeLAVA produces significantly more informative value scores than existing methods on diverse real-world datasets.

data mining, learning-agnostic valuation, machine learning, (17 more...)

2606.18729

Country:

North America (0.28)
Asia (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.70)
Information Technology > Data Science > Data Quality > Data Transformation (0.67)

Hegde, Disha, Cockayne, Jon, Oates, Chris. J.

Extrapolating from Regularised Solutions for Solving Ill-Conditioned Linear Systems in Machine Learning

Rapid prototyping of algorithms is a critical step in modern machine learning. Most algorithms exploit linear algebra, creating a need for lightweight numerical routines which -- while potentially sub-optimal for the task at hand -- can be rapidly implemented. For the numerical solution of ill-conditioned linear systems of equations, the standard solution for prototyping is Tikhonov-regularised inversion using a nugget. However, selection of the size of nugget is often difficult, and the use of data-adaptive procedures precludes automatic differentiation, introducing instabilities into end-to-end training. Further, while data-adaptive procedures perform multiple linear solves to select the size of nugget, only the result of one such solve is returned, which we argue is wasteful. This paper aims to circumvent the above difficulties, presenting autonugget; a Python package for automatic and stable numerical solution of linear systems suitable for rapid prototyping, and fully compatible with automatic differentiation using JAX. autonugget combines multiple linear solves using Richardson extrapolation to determine the solution of the ill-conditioned system, improving in accuracy over approximations based on a single nugget.

artificial intelligence, learning research, machine learning, (14 more...)

2606.30328

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.34)