AITopics

Neural networks are known to be susceptible to over-reliance on spurious correlations. However, the precise mechanism by which models exploit shortcut features is not fully understood, and algorithms to mitigate this behavior rely on as yet unjustified assumptions about the learned representations. In this work, we provide the first end-to-end theoretical characterization of spurious feature learning for two-layer ReLU neural networks trained by online minibatch SGD on the logistic loss. We consider data drawn from the high-dimensional Boolean hypercube with a quadratic signal function (namely XOR) and a linear spurious correlation. We show that SGD learns the spurious feature first, and exponentially fast. Moreover, the optimization dynamics couple the spurious and signal features, with a stronger spurious component inhibiting signal feature learning. Our analysis reveals precise phase transitions in the learning dynamics. In the first phase, alignment between the signs of the spurious feature and second-layer weight drives rapid growth of the spurious feature. In the second phase, large majority group margin slows learning and the signal feature remains suppressed. When the spurious correlation is maximally strong, we show theoretically that the spurious feature dominates even at the sample complexity threshold where XOR would be learned in isolation (i.e., if the spurious feature was absent). In contrast, when the correlation strength is constant, we provide preliminary empirical evidence that the model can eventually learn the XOR signal, although the spurious feature is not forgotten.

artificial intelligence, deep learning, machine learning, (20 more...)

2606.30444

Genre: Research Report (0.50)

Industry: Health & Medicine (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Multi-Source Transfer Learning of Sparse Single-Index Models

Tian, Ye

Transfer learning leverages knowledge from related source domains to improve learning in a target domain. Recent theoretical advances cover a broad range of regression settings within (generalized) linear models. Despite their diversity, these methods share two common constraints: they assume a known link function or linear structure and require direct access to raw source data. To move beyond these constraints, we propose a source-data-free transfer learning framework based on the single-index model (SIM). Instead of requiring raw source data, our method transfers only summary statistics derived from a generalized Stein's lemma in a one-time communication. This design preserves privacy and avoids side effects caused by dissimilarities of unknown nonlinear link functions across domains. To capture flexible, unknown nonlinearity, we employ a multilayer perceptron guided by the pre-estimated index from the transferred statistics, which significantly mitigates overfitting. Extensive experiments on synthetic data and a real-world application demonstrate consistent improvements over existing (generalized) linear model-based approaches. The proposed framework thus offers a practical, privacy-preserving, and nonlinear-adaptive solution for transfer learning.

artificial intelligence, estimator, machine learning, (19 more...)

2606.29658

Genre: Research Report (0.64)

Industry: Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)

Ziliaskopoulos, Konstantinos, Vinel, Alexander, Smith, Alice E.

Decision-Value Attribution in Predict-then-Optimize Systems

Predictive models are increasingly embedded in operational decision-making, yet standard explanation methods typically explain forecasts rather than the decisions those forecasts induce. This distinction is important in predict-then-optimize systems: large forecast changes may leave the optimizer's action unchanged, while small changes can alter the selected decision and its realized value. We propose Decision Value Attribution (DVA), a Shapley-based framework for attributing the value of a fixed prediction--optimization pipeline. The framework defines cooperative games whose payoff is the downstream decision value, allowing the players to be information sources, optimization or design parameters, or both. We present three variants: InfoDVA attributes value to features, DesignDVA attributes value to operational configurations, and Decision-Value Interactions (DVI) quantifies how information and design jointly create value. We further distinguish post-DVA, which evaluates decisions using realized outcomes, from pre-DVA, which evaluates decisions under the model's full prediction. This separation turns attribution into a decision-level diagnostic of whether the model's operational beliefs align with realized performance. The resulting attributions are expressed in the units of the operational objective and decompose the gain or loss relative to a baseline. Case studies in electricity storage arbitrage and emergency medical service coverage show that predictive explanations can be poor proxies for operational value, that DVA can guide targeted information-control interventions, and that optimization configurations determine when predictive information is decision-relevant.

artificial intelligence, modeling & simulation, total interaction, (16 more...)

2606.29878

Country: North America > United States > Alabama (0.28)

Genre: Research Report (0.40)

Industry: Energy > Energy Storage (0.34)

Technology:

Information Technology > Artificial Intelligence (0.69)
Information Technology > Modeling & Simulation (0.54)
Information Technology > Data Science (0.54)
Information Technology > Information Management (0.54)

Varam, Dara, Alhajri, Mohamed I.

Not All Objectives Are Born Equal: Priority-Constrained Descent for Hierarchical Multi-Objective Optimization

Deep learning problems rarely involve objectives that are equal in importance. A primary objective defines the goal, whilst secondary objectives, such as sparsity, compression, or robustness constrain the solution. While existing multi-objective methods have proven effective in practice, they have a clear symmetry problem and neglect the inherent objective hierarchy built into these objective spaces. We introduce Priority-Constrained Descent (PCD), a gradient-based optimization framework designed to explicitly exploit hierarchical objective structures. PCD preserves the direction of primary descent whilst allowing for the minimal distortion necessary to guarantee progress on secondary objectives, controlled by a single $τ\in [0, 1]$ that dictates the strength of the distortion. The resulting formulation is invariant to objective scaling and admits exact closed-form solutions for problems with two and three objectives. We evaluate PCD within structured network compression settings, unstructured sparsity and low-rankness, and across a variety of synthetic experiments, showing Pareto dominance and better per-objective performance with secondary progress guarantees over existing methods, further exhibiting the interpretable trade-off that $τ$ provides.

artificial intelligence, machine learning, objective, (19 more...)

2606.29521

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

Shirodkar, Tejas Pradeep

A deep network's loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam's per-coordinate preconditioner drifts along each symmetry orbit, which pulls the trajectory off the symmetry quotient where the optimization lives and blurs the singular-learning rate the quotient makes readable. We build DDC, a Dead-Direction Conditioner that lifts a base optimizer into a $G$-equivariant one: it conditions the optimizer's state in the orbit decomposition of a $G$-invariant metric, so the trajectory stays a preconditioned gradient flow on the quotient $\barΘ= Θ/G$. The construction carries four architectural gauges (cross-entropy shift, ReLU and SwiGLU rescaling, LayerNorm and RMSNorm scale, and a per-head $O(d_{\rm head})$ attention rotation matched to RoPE), proves exactly equivariant on an Adam base, and composes with a Muon base through a gauge-equivariant orthogonaliser. Respecting the symmetry changes both the minimum the optimizer reaches and what it leaves measurable there. On a language model trained past the point of fit, DDCAdam resists the over-training collapse AdamW falls into, holding a validation-train loss gap of 0.67 against 5.88, and reads the dead-direction rate in 32 of 65 layer-by-observable cells where AdamW reads it in 7. A vision transformer trained from scratch reaches lower validation loss (1.71 against 2.12) while compressing spare feed-forward capacity a matched AdamW leaves intact. On a Muon base, where the rotation gauge composes exactly, DDCMuon groks ten of eleven seeds at depth 24 that a plain Muon never reaches. Built into the optimizer, a network's gauge symmetry sharpens the minimum it finds and turns that minimum's geometry into something the trajectory can measure.

artificial intelligence, gauge, machine learning, (17 more...)

2606.29176

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)

Valsecchi, Davide, Donegà, Mauro, Wallny, Rainer

Factorizable Normalizing Flows for parameter-dependent density morphing

Normalizing Flows excel at modeling a single fixed density, yet many problems across the sciences, such as high energy physics, instead require modeling how that density deforms as a function of continuous parameters: the strength of a physical effect, a calibration constant, or a source of systematic uncertainty. Learning a separate flow for every parameter configuration quickly becomes intractable, since the number of joint settings grows exponentially with the number of parameters. We introduce Factorizable Normalizing Flows (FNFs), which represent the parameter-dependent density as a fixed, high-fidelity flow for a reference configuration composed with a learnable transformation that is polynomial in the parameters and factorized over them. This structure has a practical consequence: each parameter's effect is learned in isolation, from samples in which that parameter alone is varied. The combined response of many parameters is then recovered by summation at inference, without ever sampling their combinatorially large joint space. On a controlled problem with two interpretable deformations applied jointly to the data, the learned transformation reproduces the true deformations and matches the optimal likelihood, while optional interaction terms capture residual correlations when several parameters vary strongly at once. The resulting model is interpretable, scales linearly with the number of parameters, and keeps the likelihood tractable. This provides a general tool for any inference workflow requiring continuous density morphing, and directly enables the next generation of unbinned likelihood fits in high energy physics.

artificial intelligence, machine learning, normalizing flow, (18 more...)

2606.30489

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

All you need is log

Balsubramani, Akshay

Comparing two probability distributions is a basic building block of statistics and machine learning, and the right family is well understood: the Rényi divergences of order $α\in[0,\infty]$ are the unique family monotone under data processing and additive on independent products. Many problems instead compare more than two distributions at once -- multi-population fairness, multi-prior PAC-Bayes bounds, multi-hypothesis testing -- and the right multi-distribution generalization of the Rényi family has been an open question. We characterize it. Every functional of $W$-tuples of distributions that is monotone under data processing and additive on independent products is a positive integral of multi-way coincidence divergences $C_α(π_1,\dots,π_W) := -\log\int π_1^{α_1}\cdotsπ_W^{α_W}$ (with $\sum_k α_k = 1$) over a parameter space with four strata: the simplex interior; mixed-sign exponent cones (the analogue of Rényi orders $>1$); a tropical boundary at infinity carrying max-divergences; and pairwise Kullback-Leibler edges at the simplex vertices. Each stratum is necessary -- the destination of an explicit data-processing-monotone, product-additive divergence the others cannot reproduce -- and each is a clean limit of simplex-interior atoms. The same family arises from several independent routes -- the structural axioms, Kolmogorov-Nagumo means with Rényi's entropy axiomatics, classical entropy characterizations, multi-hypothesis testing error exponents, and a multi-lottery betting interpretation -- structural evidence that this is the canonical multi-distribution Rényi calculus rather than an artefact of any one axiomatic input. The two-prior case recovers the standard Rényi result; a worked $W=3$ instance, numerical verification, and a conditional extension round out the treatment.

artificial intelligence, divergence, machine learning, (18 more...)

2606.27349

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Davis, Geoffrey, Renganathan, Ashwin

A Bayesian latent Gaussian process framework for aerodynamic uncertainty quantification

Predicting the aerodynamic performance (e.g. lift, drag, and moment coefficients) of an aircraft is challenging -- computational models are biased and direct simulations are prohibitive. A pragmatic way to overcome this limitation is by calibrating low-fidelity computational predictions with experimental measurements. This, however, requires calibrating against \emph{sparse} measurements contaminated with \emph{uncertainty} in both the control inputs and the measured aerodynamic response. We develop a methodology to address this problem based on Gaussian process surrogates and the classical Kennedy-O'Hagan calibration. A surrogate model learned on abundant-but-cheap low-fidelity data is calibrated with a sparse set of measurement data. Crucialy, we develop a Bayesian latent Gaussian process based approach that marginalizes the calibrated surrogate model over the input uncertainty, while also matching the marginal mean and variance of the measured output uncertainty. Once calibrated, our surrogate model predicts the uncertainty in aerodynamic coefficients with very high accuracy, including at extrapolative input settings. We validate our calibrated surrogate model predictions against measurement data with \emph{true} uncertainty intervals to demonstrate that the model places $94.2-95.8\%$ of its predictive samples inside the released $95\%$ truth intervals, with endpoint cumulative probabilities very close to the nominal 0.025 and 0.975 levels.

artificial intelligence, calibration, machine learning, (17 more...)

2606.28871

Country: North America > United States > Pennsylvania (0.50)

Genre: Research Report (0.82)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Berthier, Louis, Shokry, Ahmed, Moreaud, Maxime, Ramelet, Guillaume, Dieuleveut, Aymeric

Self-Organized Conformal Prediction: Reducing Regional Coverage Gaps with Unsupervised Group Discovery

Conformal prediction guarantees marginal coverage, but pooled calibration averages over heterogeneous regions and can mask regional undercoverage in safety-critical subgroups. We introduce Self-Organized Conformal Prediction (SOCP), a calibration scheme that discovers input-space groups with a Self-Organizing Map (SOM) and, at test time, draws a local calibration buffer from the query's best-matching unit (BMU) cell or a fixed grid neighborhood. The same retrieval rule applies to regression and classification tasks across tabular features and image embeddings, leaving the predictor and nonconformity score untouched. SOCP gives exact validity for BMU-cell retrieval and fixed retrieved-set validity for neighborhood buffers; central-cell validity for neighborhood retrieval holds up to a Kolmogorov-Smirnov (KS) bias term. A split-routed extension recovers fixed retrieved-set validity conditional on the routing split. On eight regression and classification benchmarks, SO-SCP reduces the weighted regional coverage gap on $7/8$ datasets (mean paired change $-7.1\%$) for a mean prediction-set size increase of $6.2\%$, with negligible overhead on the largest six datasets; SO-CQR yields smaller gains, since quantile regression already absorbs much of the heterogeneity. By learning groups directly from the input geometry, SOCP provides group-local calibration with exact fixed-group guarantees and approximate central-cell guarantees, without supervised partitions or predictor retraining.

artificial intelligence, machine learning, threshold, (17 more...)

2606.29403

Country: North America > United States (0.29)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Bidirectional Autoregressive Latent Diffusion for Forward and Inverse Magnetohydrodynamics

Scheinker, Alexander

artificial intelligence, bidirectional autoregressive latent diffusion, machine learning, (16 more...)

This work presents a new bidirectional autoregressive latent diffusion approach for predicting the evolution of multiple fields (mass density, pressure, velocity, and magnetic field components) for magnetohydrodynamics. We show that this bidirectional flow can be used as a self-supervised consistency metric for uncertainty and error estimation, which enables the model to estimate test-time uncertainty and error without access to ground truth, by comparing how closely flowing forwards and backwards in time returns to the same predicted fields. We also demonstrate this methods's potential to serve as a non-invasive plasma diagnostic, and show how adaptive feedback can be used to make the model more robust based on sparse diagnostics or limited views/measurements.

2606.2962

Genre: Research Report (0.50)

Industry: Energy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)