AITopics | proposition 6

Collaborating Authors

proposition 6

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Homogenization of $\ell_2$-Adversarial Training in High-Dimensions: Exact Dynamics under Stochastic Gradient Descent

Sabelli, Fabrizzio

arXiv.org Machine LearningJul-2-2026

We develop a framework for analyzing the learning dynamics of $\ell_2$-adversarial training of single-index models on Gaussian mixtures in the high-dimensional limit under streaming stochastic gradient descent (SGD). We derive deterministic equivalents for a broad class of statistics of the SGD iterates, including the adversarial risk and distance to adversarial optimality, in terms of the solution to a system of ODEs. We use them to study two idealized learning rate schedules: the Polyak stepsize and exact line search. In the case of $\ell_2$-adversarial least squares with a single class, we show that, unlike noiseless standard least squares, no constant learning rate guarantees monotone descent of SGD towards a minimizer of the adversarial risk. We identify anisotropic covariance and a mismatch in ridge parameters as the main sources of suboptimality of exact line search relative to the Polyak stepsize. We also introduce a stochastic differential equation (SDE), called adversarial homogenized SGD, that captures the evolution of statistics of the iterates of SGD. For $\ell_2$-adversarial least squares, using this SDE, we show the evolution of the risk is equivalent, up to dimension-free constants, to that of SGD on standard least squares with an adaptive learning rate and adaptive $\ell_2$-regularization. When the dynamics converge, the limiting adversarial risk and SGD iterate are determined by a fixed-point equation, with the limiting iterate being equivalent to the solution of a ridge regression problem whose regularization parameter is the limiting effective regularization of SGD.

artificial intelligence, def, machine learning, (17 more...)

arXiv.org Machine Learning

2607.00207

Country:

North America > United States (0.45)
North America > Canada (0.27)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Testing hypotheses via orthogonalization

Dharamshi, Ameer, Zou, Runjia, Witten, Daniela

arXiv.org Machine LearningJun-30-2026

Classical hypothesis testing frameworks break down in contemporary settings in which null hypotheses are increasingly abstract, the same data are used to both generate and test hypotheses, and minimal assumptions about the underlying data are made. In this work, we propose a new framework for conducting valid hypothesis tests in broad contexts. We propose to add and subtract external noise generated from a symmetric shift-family to our data, $X$, to partition it into two pieces, $X^{(1)}$ and $X^{(2)}$. We provide a generic strategy for orthogonalizing $X^{(2)}$ against $X^{(1)}$ under the null hypothesis $H_0$, then show that testing whether the orthogonalization was successful provides a valid test of $H_0$ under mild assumptions. Remarkably, this framework extends naturally to the post-selection inference setting: we simply select a hypothesis on $X^{(1)}$, then perform orthogonalization under the selected null. As our approach neither requires pre-specification of the selection mechanism, nor is restricted to a small class of data-generating distributions, it dramatically expands the settings for which valid post-selection inference can be conducted. We showcase the flexibility of our proposal in several case studies involving challenging pre-specified null hypotheses and post-selection inference scenarios.

artificial intelligence, hypothesis, machine learning, (18 more...)

arXiv.org Machine Learning

2606.29732

Country: North America > United States (0.67)

Genre:

Workflow (1.00)
Research Report (1.00)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Estimation of the sub-Gaussian parameter

Liu, Jason, Xu, Min, Xing, Jinchuan

arXiv.org Machine LearningJun-5-2026

The sub-Gaussian parameter (also called the variance proxy) of a mean-zero random variable $X$ is defined as $ξ^2_* = \sup_{λ\in \mathbb{R}} L(λ)$ where $L(λ) = \frac{2}{λ^2} \log \mathbb{E} e^{λX}$ is a weighted cumulant generating function. Despite the ubiquity of sub-Gaussian random variables, the estimation of $ξ^2_*$ has received little attention and is not yet well understood. In this work, we study a natural estimator of $ξ^2_*$ based on constrained maximization of the empirical analogue of $L$. We prove that the estimator is consistent bound the rates of convergence under assumptions on $L$: if $L$ has an maximizer, then our bound is $O_p(n^{-1/2 + \varepsilon})$ for any $\varepsilon > 0$; if the argmax of $L$ is also bounded, then the bound improves to $O_p(n^{-1/2})$. We show that our assumptions on $L$ are necessary by proving that the minimax risk over all sub-Gaussian distributions is $Ω(1)$; imposing increasingly strong assumptions on the tail growth of $L$ yields a continuum of classes whose minimax lower bound interpolates between $Ω(1/\log n)$ and $Ω(1)$. Root-n rate is possible if we restrict to a subclass of distributions where $L$ attains its supremum in a bounded region, in which case our estimator is minimax optimal. If the underlying distribution is not sub-Gaussian, we show that our estimator goes to infinity with a divergence rate controlled by the tail of the distribution. Finally, we apply our estimator in a Gene Ontology (GO) enrichment study to construct p-values for a large-scale permutation test, showing that it can serve as a reliable alternative to the peaks-over-threshold approach, particularly in regimes where the peaks-over-threshold method is of uncertain validity.

artificial intelligence, inequality, logn, (17 more...)

arXiv.org Machine Learning

2606.06384

Country:

Europe > United Kingdom > England (0.28)
North America > United States (0.28)

Genre: Research Report > Experimental Study (0.35)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.54)

Add feedback

Deterministic Envelopes for Tamed SGLD: Decoupling Stochastic-Gradient Noise and Localizing Taming

Zhou, Yiwei, Chen, Ziheng

arXiv.org Machine LearningJun-5-2026

Stochastic-gradient Langevin algorithms often use tamed denominators to stabilize non-globally Lipschitz drifts. This paper shows that when the denominator depends on the same stochastic-gradient realization as the numerator, the taming step changes the stochastic oracle itself and can create a stationary bias even if the original stochastic gradient is unbiased. We propose a structure-preserving framework for designing tamed denominators. It fixes the denominator before the oracle noise is sampled and uses localized deterministic envelopes to avoid unnecessary taming in typical regions. These kernels keep the stabilizing effect of taming while avoiding the bias introduced by a gradient-dependent denominator. Our theory explains how the stationary error splits into the bias caused by oracle-dependent taming and the remaining error introduced by deterministic stabilization. Within this deterministic-envelope family, the analysis identifies a far-tail condition that explains the limitation of local soft envelopes and motivates a hybrid member: soft in the typical region, but protected by hard-tail control on rare excursions. Experiments confirm the predicted stationary distortions of random denominators, the bias reduction of deterministic-envelope designs, and the stabilizing effect of the hybrid construction.

artificial intelligence, envelope, machine learning, (15 more...)

arXiv.org Machine Learning

2606.05242

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Fast Spawn\&Prune (FS\&P): Global convergence of stochastic conic particle gradient descent via birth/death process

De Castro, Yohann, Gadat, Sébastien, Marteau, Clément

arXiv.org Machine LearningMay-20-2026

We investigate the global optimization of the objective function arising in continuous sparse regression, specifically the Beurling LASSO (BLASSO), over the space of measures. While Conic Particle Gradient Descent (CPGD) methods are computationally efficient, they may become trapped in local minima due to the non-convexity of the parameterization. To overcome this limitation, we introduce Fast Spawn\&Prune (FS\&P), a stochastic algorithm that extends FastPart introduced in De Castro et al. (2025) and combines CPGD with a birth-death process. The birth mechanism ensures asymptotic global exploration by introducing particles in regions where first-order optimality conditions are violated, while the death process preserves computational efficiency by pruning non-informative particles. We provide the first theoretical guarantee of global convergence for this class of discrete-time stochastic algorithms, without requiring exponentially large initializations. Furthermore, we derive explicit convergence rates for the excess risk, which scale as $\mathcal{O}\big(\left(\log K / K\right)^{\frac{1}{2(2+d)}}\big)$, where $K$ denotes the number of iterations and d the dimension of the domain, thereby quantifying the trade-off between global exploration and local refinement. Moreover, the sample complexity is $\mathcal{O}\big(N^{-\frac{1}{4(2+d)}}\big)$ (up to logarithmic factors). We also propose a horizon-free variant that does not require prior knowledge of the iteration budget.

artificial intelligence, assumption, machine learning, (18 more...)

arXiv.org Machine Learning

2605.19784

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.34)

Add feedback

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Wang, Yikai, Liu, Shang, Blanchet, Jose

arXiv.org Machine LearningMay-19-2026

Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$-ground-cost Wasserstein ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.

machine learning, reinforcement learning, reward model, (19 more...)

arXiv.org Machine Learning

2605.00155

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

PRCD-MAP: Learning How Much to Trust Imperfect Priors in Causal Discovery

Shan, Xihang, Zhou, Da

arXiv.org Machine LearningMay-8-2026

External priors of unknown reliability create a brittle trade-off in causal discovery: blind trust amplifies errors, blind rejection wastes signal. Real priors are also heterogeneously reliable -- physical laws are trustworthy, LLM-suggested edges are speculative -- yet existing methods either ignore priors or impose them through globally uniform trust. We propose PRCD-MAP, a soft prior-consumption layer that assigns per-edge trust to an imperfect prior and uses it to modulate a prior-aware $\ell_1$ and prior-weighted $\ell_2$ regularizer in a MAP objective. Trust is calibrated by empirical Bayes on a Laplace-approximated marginal likelihood and propagated along the prior graph by an MLP, so data-confirmed neighborhoods boost trust and contradictions suppress it. PRCD-MAP enjoys a population-level safety guarantee: it is $\varepsilon$-safe in expectation over the prior-generation distribution, with $\varepsilon\leq C\cdot\mathrm{acc}(1{-}\mathrm{acc})\cdot d^2/T$ at the parametric $T^{-1}$ rate and vanishing at the prior-quality endpoints. When the prior is uninformative, learned trust provably collapses to its floor and the method recovers a no-prior baseline. Empirically, on real CausalTime data PRCD-MAP exploits informative LLM priors (LLM-prior gain $+0.067/+0.089$ AUROC on AQI/Medical over a no-prior PRCD-MAP backbone; combined backbone+prior lead $+0.123/+0.043$ over PCMCI+), auto-attenuates on the anonymous-variable Traffic stress test, and retains a lead at $d{=}300$; against BayesDAG, the closest soft-Bayesian baseline, PRCD-MAP wins on every CausalTime dataset under a matched $W_0$-only protocol. A four-way ablation isolates each component: EB calibration and MLP trust propagation jointly carry the plurality of the gain, with positive sign on every dataset. Extensions to nonlinear (NAM) and cross-sectional settings show the calibrated-trust principle is setting-agnostic.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2605.01669

Country: Asia > China (0.28)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.46)

Industry: Energy (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Mind the Gap: Structure-Aware Consistency in Preference Learning

Mohri, Mehryar, Zhong, Yutao

arXiv.org Machine LearningMay-1-2026

Abstractsurrogate loss (e.g., the logistic loss) as a proxy for the true objective: the non-convex, discontinuous 0-1 ranking Preference learning has become the foundationloss. This reliance raises a fundamental theoretical question of aligning Large Language Models (LLMs) withthat remains largely unanswered for deep networks: Does human intent. Popular methods, such as Direct Preference Optimization (DPO), minimize surrominimizing these surrogate losses actually guarantee the minimization of the true ranking error? However, we demonstrate that for In this work, we investigate this question through the lens the equicontinuous hypothesis sets typical of neu-of H-consistency (Mao, Mohri, and Zhong, 2023e). We ral networks, these standard surrogates are theo-formulate LLM preference learning as a pairwise ranking retically inconsistent, yielding vacuous general-problem and derive a series of results that bridge the gap between learning theory and practical fine-tuning. To resolve this, we formulate LLM alignment within a margin-shifted rankingwe identify a fundamental theoretical deficiency in standard framework. We demonstrate that for equicontinuous hypothbounds that depend on enforcing a separationesis sets, a property satisfied by neural networks, standard margin γ. Crucially, we extend this to Structure-surrogate minimization yields vacuous consistency guaranAware H-consistency, introducing a novel ob-tees. Specifically, without explicit constraints, a model can achieve arbitrarily low surrogate risk while maintaining ajective (SA-DPO) that adapts the margin based on the semantic distance between responses tohigh ranking error, effectively "cheating" the objective by handle synonyms and hard pairs. Finally, weshrinking score differences rather than learning the correct analyze the trade-off between consistency andordering. We prove that enforcing a confidence the Polynomial Hinge family) offer superior con-gap γ is not merely a heuristic, but a strict requirement for sistency guarantees for capacity-bounded models H-consistency in the deep learning regime. However, while compared to the standard logistic loss used in DPO. a uniform margin restores consistency, it is a blunt instrument. We show that demanding a large, fixed margin on semantically identical pairs (synonyms) forces the model to hallucinate differences where none exist, introducing bias 1. Introductionand instability. To address this, we propose Structure-Aware H-consistency and a corresponding objective, StructureThe alignment of Large Language Models (LLMs) has shifted from explicit Reward Modeling (Stiennon et al., Aware DPO (SA-DPO).

large language model, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2604.27733

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

Cirrincione, Giansalvo

arXiv.org Machine LearningApr-28-2026

A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN "plays no role" is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP's irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal. A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature -- rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse -- are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer's forward pass.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Machine Learning

2604.23681

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Parameter Tuning

Neural Information Processing SystemsApr-27-2026, 09:50:56 GMT

If observations from the joint distribution of (A,Y,Z,W) are available in both stages, we can tune the regularization parameters λ1,λ2 using the approach proposed in Singh et al. [30], Xu et al. [35]. Let the complete data of stage 1 and stage 2 be denoted as (ai,yi,zi,wi) and ( ai, yi, zi, wi). Then, we can use the data not used in each stage to evaluate the out-of-sample performance of the other stage. A(2), ˆV(T),u(T) are the learned parameters by Algorithm 1. In this appendix, we prove propositions given in the main text. In the following, we assume that the spaces U, A, Z,W are separable and completely metrizable topological spaces and equipped with Borel σ-algebras. In this section, we use the notation PA|Z=z to express the distribution of a random variable Agiven another variable Z = z.

artificial intelligence, log 2, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback