Genre
Learning Generative Dynamics with Soft Law Constraints: A McKean-Vlasov FBSDE Approach
Boustany, Samer El, Mekkaoui, Samy, Hafsi, Yadh, Alouadi, Alexandre, Pham, Huyên
We propose a generative framework for learning stochastic dynamics from endpoint and intermediate distributional observations. The method formulates generation as a McKean-Vlasov control problem in which terminal and time-marginal laws are enforced through soft energy constraints. The associated optimality system is a forward-backward stochastic differential equation (FBSDE) whose backward component receives a continuous drift induced by the marginal law penalties. This provides a principled alternative to hard interpolation or optimal transport maps between observed distributions: the model learns a stochastic path law whose dynamics remain globally coupled through the mean-field objective. We derive the reduced FBSDE system for quadratic control cost and constant diffusion, connecting terminal and marginal law flat derivatives to score-like training signals. The resulting neural solver is evaluated on low-dimensional distributional benchmarks, where it recovers smooth stochastic paths matching prescribed marginal laws. In a higher-dimensional ALAE latent space, endpoint supervision is used as a qualitative stress test for transporting non-smiling faces toward smiling ones in a pretrained representation. We then use articulated human motion as a structured high-dimensional case study on a curated AMASS low-to-high position dataset, using SMPL-H pose sequences and reduced pose representations. The experiments show that soft marginal law constraints can produce coherent stochastic trajectories whose intermediate distributions follow the observed evolution of human motion. The code is available at https://github.com/murex/deep-mkv-gen/tree/main.
Survey-aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review
Oh, YongKyung, Zheng, Henry W., Feng, Jeffrey, Bui, Alex A. T.
Machine Learning (ML) models trained on complex health surveys such as the National Health and Nutrition Examination Survey (NHANES) often ignore primary sampling units, stratification variables, and sampling weights. This practice violates the independence assumptions of standard evaluation methods. As a result, estimates become biased, uncertainty is underestimated, and fairness assessments fail to reflect population-level disparities. We propose Survey-aware Machine Learning (SaML), a nine-step guideline that incorporates survey design metadata across the ML lifecycle. Through a scoping review of 16 methodological papers, we summarize existing work on weighted model training, design-based cross-validation, and survey-adjusted performance evaluation. We also identify gaps in hyperparameter tuning and deployment. We provide task-specific guidance that clarifies which steps are required for different analytical objectives. SaML provides a checklist for valid population inference from survey data.
Muon Does Not Converge on Convex Lipschitz Functions
Parshakova, Tetiana, Khaled, Ahmed, Crawshaw, Michael, Garrigos, Guillaume, Gower, Robert M.
Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.
Optimality of Sub-network Laplace Approximations: New Results and Methods
Raha, Swarnali, Khare, Kshitij, Patra, Rohit K
Although the Laplace approximation offers a simple route to uncertainty quantification in deep neural networks, its reliance on inverting large Hessian matrices has motivated a range of computationally feasible low-dimensional or sparse approximations. A prominent class of such methods - sub-network Laplace approximations, constructs surrogates by restricting attention to a small subset of parameters. Existing approaches in this family typically rely on diagonal, layer-wise, or other architectural heuristics for subset selection, which ignore cross-parameter interactions and lack formal optimality guarantees. In this paper, we provide a rigorous theoretical analysis of the sub-network Laplace paradigm. We prove that all sub-network Laplace methods systematically underestimate the predictive variance of the full Laplace posterior, and that this bias decreases monotonically as the retained sub-matrix expands. Leveraging this insight, we propose two principled, analytically grounded sub-network Hessian approximations: \textit{Gradient-Laplace} selects parameters with the largest average squared gradients of the model output with respect to the parameters over a reference dataset; while \textit{Greedy-Laplace} iteratively refines this selection by accounting for off-diagonal interactions in the precision matrix. We establish theoretical guarantees characterizing their optimality properties and show that Gradient-Laplace provably outperforms existing heuristic approaches. Extensive numerical studies across diverse settings indicate that these methods perform strongly relative to existing benchmarks.
Fit CATE Once: Model-Assisted Randomization Tests Without Sample Splitting
Randomization tests and flexible treatment-effect models offer complementary strengths for analyzing data from randomized panel experiments: the former provide valid inference under the known assignment mechanism, while the latter can capture complex patterns of effect heterogeneity. We develop model-assisted randomization tests that combine these strengths without sample splitting. The key idea is to estimate an unsigned version of the conditional average treatment effect (CATE) from the covariance structure of residualized outcomes, while leaving the realized assignments for randomization inference. The remaining sign can be chosen to best fit the observed outcomes. We establish identification and consistency for the proposed unsigned CATE estimators, as well as validity for the CATE-assisted randomization tests. Across synthetic and semi-synthetic experiments, the CATE-assisted randomization tests control Type I error and achieve higher power than covariate-adjusted and sample-split alternatives. Finally, we show that the assignment-free CATE estimates can be used to discover heterogeneous subgroups and test subgroup-specific treatment effects.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
Zhao, Qingyue, Ji, Kaixuan, Zhao, Heyang, Gu, Quanquan
\emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated $ε^{-1}$-type fast rates for decision making under reverse KL regularization, in contrast to the standard $ε^{-2}$-type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in $\tilde{O}(ε^{-2})$ slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first $\tilde{O}(ε^{-1})$ upper bounds in tabular and general function approximation settings, both under notions of \emph{single-policy concentrability}. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem, which might be of independent interest. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL regularization.
On Variance Reduction in Learning Mean Flows
One-step generative modeling has emerged as a leading approach to amortize the inference cost of diffusion and flow-matching models. Among distillation-free methods, MeanFlow training is notoriously unstable, with non-decreasing loss and unbounded gradient variance. In this work, we establish a theory that attributes this pathology to a misuse of the conditional velocity field: it plays two distinct statistical roles in the loss, both as an unbiased regression target and as a Monte Carlo control variate inside a Jacobi-vector product, with the original loss assigning the wrong coefficient to the latter. We derive the optimal coefficient in closed form, and show that a family of fixes in concurrent works corresponds to different practical realizations of the same optimum. A controlled sweep of this coefficient on two-dimensional benchmarks and on a latent Diffusion Transformer recovers the predicted bias-variance ordering. The optimal coefficient yields up to a %54 improvement in sample quality on two-dimensional benchmarks and a monotone FID trend at every matched-step DiT checkpoint. Crucially, the same DiT measurement also reveals a quantitative FID-MSE landscape mismatch: although gradient variance is minimized at an interior coefficient value, the coefficient that minimizes FID prefers the direct use of conditional velocity.
Improving Generalization by Permutation Routing Across Model Copies
Kashiwamura, Shuhei, Leleu, Timothee
We introduce a use of the \(M\)-cover (or \(M\)-layer) transform for machine learning. The method replicates a model \(M\) times, but instead of coupling the copies through parameter averaging or an explicit attractive force, as in replicated SGD or Elastic SGD, it rewires the contexts in which local learning messages are computed. Each local loss is evaluated on a routed model whose parameters are drawn from different copies according to permutations sampled from a structured mixing kernel \(Q\). Training then uses the original local update rule, while the resulting learning messages are redistributed across the copies through these routed computational paths. Thus \(Q\) defines a topology for message transport and controls the long-loop structure of the lifted factor graph. We formulate this construction for perceptrons, committee machines, and multilayer perceptrons, showing that the same principle applies from discrete models to differentiable neural networks. The resulting framework provides a mechanism for improving generalization through structured message sharing rather than replica collapse or parameter-space coupling.
Reinforcement Learning Measurement Model
Interactive assessments generate sequential process data that are not well handled by conventional item response models. Existing MDP-based measurement approaches, such as the Markov decision process measurement model (MDP-MM, LaMar, 2018), link action choices to state-action values, but their reliance on person-specific tabular value functions makes them difficult to scale beyond small, fully enumerated tasks. We propose the Reinforcement Learning Measurement Model (RLMM), a measurement framework that decouples person-level choice sensitivity from task-level value representation through a shared parametric action-value function, making estimation more computationally efficient for larger process-data settings. The model combines a Boltzmann choice rule with normalized advantages, a soft Bellman consistency penalty, and a block-coordinate MAP procedure for joint estimation, while also yielding step-level influence diagnostics for identifying behaviorally critical decisions. In peg-solitaire simulations, the RLMM achieved higher estimation accuracy and substantially lower runtime than the original MDP-MM, with advantages increasing as task complexity grew. In AQUALAB gameplay logs, the estimated person parameter was positively associated with cumulative reward, task completion, and behavioral efficiency. These results show that the RLMM extends decision-process-based psychometric models to larger and more behaviorally realistic environments while preserving an interpretable latent trait tied to decision making steps.
Universal Feature Selection with Noisy Observations and Weak Symmetry Conditions
This paper relaxes the restrictive symmetry conditions adopted in [4], [5] and extends their universal feature selection framework to accommodate noisy observations as well as attribute structures that may exhibit directional preferences. We introduce the notion of weak spherical symmetry, quantified by second-moment distances, which allows controlled deviations from rotational invariance. Under this relaxed condition, we develop a universal feature selection framework based on the singular value decomposition of the canonical dependence matrix computed from noisy data. Our main result shows that the selected features achieve asymptotically optimal error exponents up to a residual term that depends on the symmetry deviation $δ$ and the noise levels $η_1, η_2$. When $δ, η_1, η_2$ are relatively small, our result recovers that of [5], thereby demonstrating that exact spherical symmetry is unnecessary. Overall, our findings highlight the robustness of the selection framework against second-moment deviations and observation noise, thereby broadening its applicability across diverse inference tasks and providing a theoretically grounded tool for universal feature selection in practical scenarios.