Optimization
Learning by Steering the Neural Dynamics: A Statistical Mechanics Perspective
Despite the striking successes of deep neural networks trained with gradient-based optimization, these methods differ fundamentally from their biological counterparts. This gap raises key questions about how nature achieves robust, sample-efficient learning at minimal energy costs and solves the credit-assignment problem without backpropagation. We take a step toward bridging contemporary AI and computational neuroscience by studying how neural dynamics can support fully local, distributed learning that scales to simple machine-learning benchmarks. Using tools from statistical mechanics, we identify conditions for the emergence of robust dynamical attractors in random asymmetric recurrent networks. We derive a closed-form expression for the number of fixed points as a function of self-coupling strength, and we reveal a phase transition in their structure: below a critical self-coupling, isolated fixed points coexist with exponentially many narrow clusters showing the overlap-gap property; above it, subdominant yet dense and extensive clusters appear. These fixed points become accessible, including to a simple asynchronous dynamical rule, after an algorithm-dependent self-coupling threshold. Building on this analysis, we propose a biologically plausible algorithm for supervised learning with any binary recurrent network. Inputs are mapped to fixed points of the dynamics, by relaxing under transient external stimuli and stabilizing the resulting configurations via local plasticity. We show that our algorithm can learn an entangled version of MNIST, leverages depth to develop hierarchical representations and increase hetero-association capacity, and is applicable to several architectures. Finally, we highlight the strong connection between algorithm performance and the unveiled phase transition, and we suggest a cortex-inspired alternative to self-couplings for its emergence.
Learning Dynamics of VLM Finetuning
Zhang, Jusheng, Cai, Kaitong, Yang, Jing, Wang, Keze
The finetuning of vision-language models (VLMs) involves intricate learning dynamics that pose significant challenges for stable optimization (Liu et al., 2023; Huang & Zhang, 2024; Zhang et al., 2025a;c). VLMs process multimodal inputs, encoding textual and visual components as high-dimensional sequences, where the visual stream introduces complex state dependencies--such as pixel embeddings and spatial metadata--that tightly couple gradient updates across tokens (Radford et al., 2021; Li et al., 2023; Zhang et al., 2025d;b). Prominent finetuning methods, including supervised finetuning (SFT) (Ouyang et al., 2022) and direct preference optimization (DPO) (Rafailov et al., 2023), employ diverse loss geometries and supervision signals, necessitating a unified analytical framework to unravel their behavioral foundations, especially in preference-based alignment aimed at prioritizing human-preferred outputs (Ren & Sutherland, 2025). Preference-based finetuning is essential for aligning VLMs with human intent (Liu et al., 2024a; Radford et al., 2021; Chen et al., 2023; Zhang et al., 2024), yet it suffers from notorious instability in practice. Alignment datasets often contain static or mis-specified negative examples--trivially incorrect or off-distribution--that inject uninformative gradients (Casper et al., 2023; Kaufmann et al., 2024; Song et al., 2025). These gradients disrupt optimization, degrade calibration, and produce overconfident, peaky posteriors.
Mean-Field Games with Constraints
This paper introduces a framework of Constrained Mean-Field Games (CMFGs), where each agent solves a constrained Markov decision process (CMDP). This formulation captures scenarios in which agents' strategies are subject to feasibility, safety, or regulatory restrictions, thereby extending the scope of classical mean field game (MFG) models. We first establish the existence of CMFG equilibria under a strict feasibility assumption, and we further show uniqueness under a classical monotonicity condition. To compute equilibria, we develop Constrained Mean-Field Occupation Measure Optimization (CMFOMO), an optimization-based scheme that parameterizes occupation measures and shows that finding CMFG equilibria is equivalent to solving a single optimization problem with convex constraints and bounded variables. CMFOMO does not rely on uniqueness of the equilibria and can approximate all equilibria with arbitrary accuracy. We further prove that CMFG equilibria induce $O(1 / \sqrt{N})$-Nash equilibria in the associated constrained $N$-player games, thereby extending the classical justification of MFGs as approximations for large but finite systems. Numerical experiments on a modified Susceptible-Infected-Susceptible (SIS) epidemic model with various constraints illustrate the effectiveness and flexibility of the framework.
Concise Reasoning in the Lens of Lagrangian Optimization
Gao, Chengqian, Li, Haonan, Killian, Taylor W., She, Jianshu, Wang, Renxi, Ma, Liqun, Cheng, Zhoujun, Hao, Shibo, Xu, Zhiqiang
Concise reasoning in large language models seeks to generate only essential intermediate steps needed to arrive at a final answer, thereby alleviating issues of "over-thinking". Most proposed approaches hinge on carefully hand-crafted heuristics, struggling to balance concision with performance, often failing to adapt across domains and model scales. In this work, we address these challenges by introducing a principled and pragmatic strategy, performance-aware length updating (P ALU). As a principled algorithm, P ALU formulates concise reasoning as a constrained optimization problem, minimizing response length subject to a performance constraint, and then applies Lagrangian optimization to convert it into a tractable unconstrained problem. As a pragmatic solution, P ALU streamlines complicated update rules through three approximations: (i) estimating performance with off-policy rollouts, (ii) truncating the Lagrange multiplier to two extremes, and (iii) replacing gradient-based updates with quantile-driven length adjustments. Furthermore, P ALU is demonstrated to adapt across both domain (logic, STEM and math) and model scale (1.5B, 7B, 14B) entrenching the algorithm as a practical and effective concise reasoning approach. Reasoning, requiring large language models (LLMs) to work through intermediate steps before producing a final answer, substantially improves performance on complex tasks such as mathematics (Jaech et al., 2024; Shao et al., 2024), programming (Lambert et al., 2024), and value alignment (Guo et al., 2025). Y et this benefit is often accompanied by overthinking: redundant self-reflection, backtracking, and validation (Chen et al., 2024; Zhang et al., 2024; Fatemi et al., 2025). These limitations inflate inference costs and hampers user experience, motivating the need for concise reasoning--the production of only the essential steps required to reach a correct answer.
OptiFLIDS: Optimized Federated Learning for Energy-Efficient Intrusion Detection in IoT
Elouardi, Saida, Jouhari, Mohammed, Motii, Anas
In critical IoT environments, such as smart homes and industrial systems, effective Intrusion Detection Systems (IDS) are essential for ensuring security. However, developing robust IDS solutions remains a significant challenge. Traditional machine learning-based IDS models typically require large datasets, but data sharing is often limited due to privacy and security concerns. Federated Learning (FL) presents a promising alternative by enabling collaborative model training without sharing raw data. Despite its advantages, FL still faces key challenges, such as data heterogeneity (non-IID data) and high energy and computation costs, particularly for resource constrained IoT devices. To address these issues, this paper proposes OptiFLIDS, a novel approach that applies pruning techniques during local training to reduce model complexity and energy consumption. It also incorporates a customized aggregation method to better handle pruned models that differ due to non-IID data distributions. Experiments conducted on three recent IoT IDS datasets, TON_IoT, X-IIoTID, and IDSIoT2024, demonstrate that OptiFLIDS maintains strong detection performance while improving energy efficiency, making it well-suited for deployment in real-world IoT environments.
Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games
Zhang, Brian Hu, Anagnostides, Ioannis, Sandholm, Tuomas
A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods. Although a convergence rate of $T^{-1}$ has long been established since Nemirovski's mirror-prox algorithm and Nesterov's excessive gap technique in the early 2000s, the most effective paradigm in practice is *counterfactual regret minimization*, which is based on *regret matching* and its modern variants. In particular, the state of the art across most benchmarks is *predictive* regret matching$^+$ (PRM$^+$), in conjunction with non-uniform averaging. Yet, such algorithms can exhibit slower $ฮฉ(T^{-1/2})$ convergence even in self-play. In this paper, we close the gap between theory and practice. We propose a new scale-invariant and parameter-free variant of PRM$^+$, which we call IREG-PRM$^+$. We show that it achieves $T^{-1/2}$ best-iterate and $T^{-1}$ (i.e., optimal) average-iterate convergence guarantees, while also being on par with PRM$^+$ on benchmark games. From a technical standpoint, we draw an analogy between IREG-PRM$^+$ and optimistic gradient descent with *adaptive* learning rate. The basic flaw of PRM$^+$ is that the ($\ell_2$-)norm of the regret vector -- which can be thought of as the inverse of the learning rate -- can decrease. By contrast, we design IREG-PRM$^+$ so as to maintain the invariance that the norm of the regret vector is nondecreasing. This enables us to derive an RVU-type bound for IREG-PRM$^+$, the first such property that does not rely on introducing additional hyperparameters to enforce smoothness. Furthermore, we find that IREG-PRM$^+$ performs on par with an adaptive version of optimistic gradient descent that we introduce whose learning rate depends on the misprediction error, demystifying the effectiveness of the regret matching family *vis-a-vis* more standard optimization techniques.
MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
Zibakhsh, Soheil, Samragh, Mohammad, Nishu, Kumari, Hannah, Lauren, Kundu, Arnav, Cho, Minsik
The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction. To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV -caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters. Extensive data and substantial computational resources have fueled recent advancements in language models.
The Robustness of Differentiable Causal Discovery in Misspecified Scenarios
Yi, Huiyang, He, Yanyan, Chen, Duxin, Kang, Mingyu, Wang, He, Yu, Wenwu
Causal discovery aims to learn causal relationships between variables from targeted data, making it a fundamental task in machine learning. However, causal discovery algorithms often rely on unverifiable causal assumptions, which are usually difficult to satisfy in real-world data, thereby limiting the broad application of causal discovery in practical scenarios. Inspired by these considerations, this work extensively benchmarks the empirical performance of various mainstream causal discovery algorithms, which assume i.i.d. data, under eight model assumption violations. Our experimental results show that differentiable causal discovery methods exhibit robustness under the metrics of Structural Hamming Distance and Structural Intervention Distance of the inferred graphs in commonly used challenging scenarios, except for scale variation. We also provide the theoretical explanations for the performance of differentiable causal discovery methods. Finally, our work aims to comprehensively benchmark the performance of recent differentiable causal discovery methods under model assumption violations, and provide the standard for reasonable evaluation of causal discovery, as well as to further promote its application in real-world scenarios.
Simplifying Optimal Transport through Schatten-$p$ Regularization
Optimal transport (OT) has emerged as a fundamental computational tool across many areas, including machine learning, computer vision, statistics, and biology [Arjovsky et al., 2017, Peyr e and Cuturi, 2019, Schiebinger et al., 2019, Bonneel and Digne, 2023]. It provides a principled framework for comparing probability distributions, and it has a rich mathematical history [Villani et al., 2008]. While the combination of practical utility and deep mathematical theory has led to the broad adoption of OT ideas in mathematics, science, and engineering, finding ways to scale OT solutions and make them interpretable remains a fundamental research question [Cuturi et al., 2023, Khamis et al., 2024]. In particular, OT typically suffers from the curse of dimensionality [Chewi et al., 2025], and regularized estimators may lack sparsity [Genevay et al., 2019]. A long line of work has focused on making OT scalable and interpretable through regularization. The most classical of these is entropic regularization, which yields a strictly convex program that can be solved via Sinkhorn scaling [Sinkhorn, 1967, Cuturi, 2013]. More recent work has sought to increase efficiency and interpretability through quadratic regularization [Blondel et al., 2018, Lorenz et al., 2021], as well as low-rank factorizations [Forrow et al., 2019, Scetbon et al., 2021]. These methods show promise in biological applications, particularly in single-cell RNA sequencing analysis [Klein et al., 2025]. Another closely related set of recent works attempts to include sparsity in the OT map using elastic costs Cuturi et al. [2023], Klein et al. [2024], Chen et al. [2025].
ADARL: Adaptive Low-Rank Structures for Robust Policy Learning under Uncertainty
Li, Chenliang, Leng, Junyu, Li, Jiaxiang, Sun, Youbang, Chen, Shixiang, Shahrampour, Shahin, Garcia, Alfredo
Robust reinforcement learning (Robust RL) seeks to handle epistemic uncertainty in environment dynamics, but existing approaches often rely on nested min--max optimization, which is computationally expensive and yields overly conservative policies. We propose \textbf{Adaptive Rank Representation (AdaRL)}, a bi-level optimization framework that improves robustness by aligning policy complexity with the intrinsic dimension of the task. At the lower level, AdaRL performs policy optimization under fixed-rank constraints with dynamics sampled from a Wasserstein ball around a centroid model. At the upper level, it adaptively adjusts the rank to balance the bias--variance trade-off, projecting policy parameters onto a low-rank manifold. This design avoids solving adversarial worst-case dynamics while ensuring robustness without over-parameterization. Empirical results on MuJoCo continuous control benchmarks demonstrate that AdaRL not only consistently outperforms fixed-rank baselines (e.g., SAC) and state-of-the-art robust RL methods (e.g., RNAC, Parseval), but also converges toward the intrinsic rank of the underlying tasks. These results highlight that adaptive low-rank policy representations provide an efficient and principled alternative for robust RL under model uncertainty.