proposition 6
Fast Spawn\&Prune (FS\&P): Global convergence of stochastic conic particle gradient descent via birth/death process
De Castro, Yohann, Gadat, Sébastien, Marteau, Clément
We investigate the global optimization of the objective function arising in continuous sparse regression, specifically the Beurling LASSO (BLASSO), over the space of measures. While Conic Particle Gradient Descent (CPGD) methods are computationally efficient, they may become trapped in local minima due to the non-convexity of the parameterization. To overcome this limitation, we introduce Fast Spawn\&Prune (FS\&P), a stochastic algorithm that extends FastPart introduced in De Castro et al. (2025) and combines CPGD with a birth-death process. The birth mechanism ensures asymptotic global exploration by introducing particles in regions where first-order optimality conditions are violated, while the death process preserves computational efficiency by pruning non-informative particles. We provide the first theoretical guarantee of global convergence for this class of discrete-time stochastic algorithms, without requiring exponentially large initializations. Furthermore, we derive explicit convergence rates for the excess risk, which scale as $\mathcal{O}\big(\left(\log K / K\right)^{\frac{1}{2(2+d)}}\big)$, where $K$ denotes the number of iterations and d the dimension of the domain, thereby quantifying the trade-off between global exploration and local refinement. Moreover, the sample complexity is $\mathcal{O}\big(N^{-\frac{1}{4(2+d)}}\big)$ (up to logarithmic factors). We also propose a horizon-free variant that does not require prior knowledge of the iteration budget.
Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback
Wang, Yikai, Liu, Shang, Blanchet, Jose
Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$-ground-cost Wasserstein ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.
PRCD-MAP: Learning How Much to Trust Imperfect Priors in Causal Discovery
External priors of unknown reliability create a brittle trade-off in causal discovery: blind trust amplifies errors, blind rejection wastes signal. Real priors are also heterogeneously reliable -- physical laws are trustworthy, LLM-suggested edges are speculative -- yet existing methods either ignore priors or impose them through globally uniform trust. We propose PRCD-MAP, a soft prior-consumption layer that assigns per-edge trust to an imperfect prior and uses it to modulate a prior-aware $\ell_1$ and prior-weighted $\ell_2$ regularizer in a MAP objective. Trust is calibrated by empirical Bayes on a Laplace-approximated marginal likelihood and propagated along the prior graph by an MLP, so data-confirmed neighborhoods boost trust and contradictions suppress it. PRCD-MAP enjoys a population-level safety guarantee: it is $\varepsilon$-safe in expectation over the prior-generation distribution, with $\varepsilon\leq C\cdot\mathrm{acc}(1{-}\mathrm{acc})\cdot d^2/T$ at the parametric $T^{-1}$ rate and vanishing at the prior-quality endpoints. When the prior is uninformative, learned trust provably collapses to its floor and the method recovers a no-prior baseline. Empirically, on real CausalTime data PRCD-MAP exploits informative LLM priors (LLM-prior gain $+0.067/+0.089$ AUROC on AQI/Medical over a no-prior PRCD-MAP backbone; combined backbone+prior lead $+0.123/+0.043$ over PCMCI+), auto-attenuates on the anonymous-variable Traffic stress test, and retains a lead at $d{=}300$; against BayesDAG, the closest soft-Bayesian baseline, PRCD-MAP wins on every CausalTime dataset under a matched $W_0$-only protocol. A four-way ablation isolates each component: EB calibration and MLP trust propagation jointly carry the plurality of the gain, with positive sign on every dataset. Extensions to nonlinear (NAM) and cross-sectional settings show the calibrated-trust principle is setting-agnostic.
Mind the Gap: Structure-Aware Consistency in Preference Learning
Abstractsurrogate loss (e.g., the logistic loss) as a proxy for the true objective: the non-convex, discontinuous 0-1 ranking Preference learning has become the foundationloss. This reliance raises a fundamental theoretical question of aligning Large Language Models (LLMs) withthat remains largely unanswered for deep networks: Does human intent. Popular methods, such as Direct Preference Optimization (DPO), minimize surrominimizing these surrogate losses actually guarantee the minimization of the true ranking error? However, we demonstrate that for In this work, we investigate this question through the lens the equicontinuous hypothesis sets typical of neu-of H-consistency (Mao, Mohri, and Zhong, 2023e). We ral networks, these standard surrogates are theo-formulate LLM preference learning as a pairwise ranking retically inconsistent, yielding vacuous general-problem and derive a series of results that bridge the gap between learning theory and practical fine-tuning. To resolve this, we formulate LLM alignment within a margin-shifted rankingwe identify a fundamental theoretical deficiency in standard framework. We demonstrate that for equicontinuous hypothbounds that depend on enforcing a separationesis sets, a property satisfied by neural networks, standard margin γ. Crucially, we extend this to Structure-surrogate minimization yields vacuous consistency guaranAware H-consistency, introducing a novel ob-tees. Specifically, without explicit constraints, a model can achieve arbitrarily low surrogate risk while maintaining ajective (SA-DPO) that adapts the margin based on the semantic distance between responses tohigh ranking error, effectively "cheating" the objective by handle synonyms and hard pairs. Finally, weshrinking score differences rather than learning the correct analyze the trade-off between consistency andordering. We prove that enforcing a confidence the Polynomial Hinge family) offer superior con-gap γ is not merely a heuristic, but a strict requirement for sistency guarantees for capacity-bounded models H-consistency in the deep learning regime. However, while compared to the standard logistic loss used in DPO. a uniform margin restores consistency, it is a blunt instrument. We show that demanding a large, fixed margin on semantically identical pairs (synonyms) forces the model to hallucinate differences where none exist, introducing bias 1. Introductionand instability. To address this, we propose Structure-Aware H-consistency and a corresponding objective, StructureThe alignment of Large Language Models (LLMs) has shifted from explicit Reward Modeling (Stiennon et al., Aware DPO (SA-DPO).
Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers
A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN "plays no role" is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP's irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal. A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature -- rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse -- are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer's forward pass.
Parameter Tuning
If observations from the joint distribution of (A,Y,Z,W) are available in both stages, we can tune the regularization parameters λ1,λ2 using the approach proposed in Singh et al. [30], Xu et al. [35]. Let the complete data of stage 1 and stage 2 be denoted as (ai,yi,zi,wi) and ( ai, yi, zi, wi). Then, we can use the data not used in each stage to evaluate the out-of-sample performance of the other stage. A(2), ˆV(T),u(T) are the learned parameters by Algorithm 1. In this appendix, we prove propositions given in the main text. In the following, we assume that the spaces U, A, Z,W are separable and completely metrizable topological spaces and equipped with Borel σ-algebras. In this section, we use the notation PA|Z=z to express the distribution of a random variable Agiven another variable Z = z.
040ca38cefb1d9226d79c05dd25469cb-Supplemental.pdf
If there is a bingo on mode-k, the m-th row of the mode-k expansion of P is a constant multiple of the (m 1)-th row, where mis a number determined by the bingo position. When a row is a constant multiple of another row, the rank of the matrix is reduced by a maximum of one, which means Rank(P(k)) Ik 1. In the same way, if there are bk bingos, then bk rows are constant multiple of the other rows, which means Rank(P(k)) Ik bk. For any positive tensor P, rank(P) = 1 if and only if its all many-body θparameters are 0. Proof. First, we show that rank(P) = 1 implies all many-body θ-parameters are 0. From the assumption of rank(P) = 1, the m-th row of the mode-k expansion of P have to be a constant multiple of the (m 1)-th row for all m= {2,...,Ik}and k [d].
General Machine Learning: Theory for Learning Under Variable Regimes
We study learning under regime variation, where the learner, its memory state, and the evaluative conditions may evolve over time. This paper is a foundational and structural contribution: its goal is to define the core learning-theoretic objects required for such settings and to establish their first theorem-supporting consequences. The paper develops a regime-varying framework centered on admissible transport, protected-core preservation, and evaluator-aware learning evolution. It records the immediate closure consequences of admissibility, develops a structural obstruction argument for faithful fixed-ontology reduction in genuinely multi-regime settings, and introduces a protected-stability template together with explicit numerical and symbolic witnesses on controlled subclasses, including convex and deductive settings. It also establishes theorem-layer results on evaluator factorization, morphisms, composition, and partial kernel-level alignment across semantically commensurable layers. A worked two-regime example makes the admissibility certificate, protected evaluative core, and regime-variation cost explicit on a controlled subclass. The symbolic component is deliberately restricted in scope: the paper establishes a first kernel-level compatibility result together with a controlled monotonic deductive witness. The manuscript should therefore be read as introducing a structured learning-theoretic framework for regime-varying learning together with its first theorem-supporting layer, not as a complete quantitative theory of all learning systems.