Goto

Collaborating Authors

 Optimization


Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

arXiv.org Artificial Intelligence

Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach suffers from catastrophic forgetting: second-stage RL gradually loses SFT-acquired behaviors and inefficiently explores new patterns. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.


Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

arXiv.org Artificial Intelligence

Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM


Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation

arXiv.org Machine Learning

Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ. We study this problem through the lens of robust Markov decision processes (RMDPs), which optimize performance against adversarial transition dynamics. Our focus is the online setting, where the agent has only limited interaction with the environment, making sample efficiency and exploration especially critical. Policy optimization, despite its success in standard RL, remains theoretically and empirically underexplored in robust RL. To bridge this gap, we propose \textbf{D}istributionally \textbf{R}obust \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization algorithm (DR-RPO), a model-free online policy optimization method that learns robust policies with sublinear regret. To enable tractable optimization within the softmax policy class, DR-RPO incorporates reference-policy regularization, yielding RMDP variants that are doubly constrained in both transitions and policies. To scale to large state-action spaces, we adopt the $d$-rectangular linear MDP formulation and combine linear function approximation with an upper confidence bonus for optimistic exploration. We provide theoretical guarantees showing that policy optimization can achieve polynomial suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches. Finally, empirical results across diverse domains corroborate our theory and demonstrate the robustness of DR-RPO.


Inland-LOAM: Voxel-Based Structural Semantic LiDAR Odometry and Mapping for Inland Waterway Navigation

arXiv.org Artificial Intelligence

Abstract--Accurate and up-to-date geospatial information is crucial for enhancing the safety and autonomy of Inland Waterway Transport (IWT). These challenges lead to significant localization drift and produce point cloud maps lacking the semantic richness required for autonomous decision-making. This paper introduces a comprehensive LiDAR odometry and Mapping framework for inland waterway navigation (Inland-LOAM). We present an improved feature extraction method adapted to unique waterway geometries, combined with a joint optimization that incorporates the water surface as a global planar constraint to mitigate drift. We also propose an innovative pipeline that transforms dense 3D point cloud outputs into structured 2D semantic maps. By constructing semantic voxel grids and performing geometric analyses (roughness, planarity, and slope), our system classifies the environment into meaningful structural categories and supports real-time computation of critical parameters like vertical bridge clearances. An automated module then efficiently extracts shoreline boundaries, exporting them into a lightweight, IENC-compatible format. Extensive evaluations on a diverse, real-world dataset demonstrate that Inland-LOAM achieves superior localization accuracy over state-of-the-art methods. The generated maps and shorelines align with real-world conditions, providing reliable information to enhance navigational situational awareness. Both the dataset and the algorithm are publicly available to support future research. IWT constitutes an essential component of Europe's freight infrastructure, spanning a network exceeding 41,000 km, interlinking major cities and industrial hubs across 13 interconnected Member States [1]. As efforts increase to shift freight from congested road and rail networks, the importance of accurate geospatial information and detailed environmental models for managing and navigating these waterways grows [2]. Zhongbi Luo, Peter Slaets, Jan Swevers and Herman Bruyninckx are with the Division of Robotics, Automation and Mechatronics in the Department of Mechanical Engineering, KU Leuven, 3001 Leu-ven, Belgium (e-mail: zhongbi.luo@kuleuven.be;


Non-convex entropic mean-field optimization via Best Response flow

arXiv.org Artificial Intelligence

We study the problem of minimizing non-convex functionals on the space of probability measures, regularized by the relative entropy (KL divergence) with respect to a fixed reference measure, as well as the corresponding problem of solving entropy-regularized non-convex-non-concave min-max problems. We utilize the Best Response flow (also known in the literature as the fictitious play flow) and study how its convergence is influenced by the relation between the degree of non-convexity of the functional under consideration, the regularization parameter and the tail behaviour of the reference measure. In particular, we demonstrate how to choose the regularizer, given the non-convex functional, so that the Best Response operator becomes a contraction with respect to the $L^1$-Wasserstein distance, which ensures the existence of its unique fixed point that is then shown to be the unique global minimizer for our optimization problem. This extends recent results where the Best Response flow was applied to solve convex optimization problems regularized by the relative entropy with respect to arbitrary reference measures, and with arbitrary values of the regularization parameter. Our results explain precisely how the assumption of convexity can be relaxed, at the expense of making a specific choice of the regularizer. Additionally, we demonstrate how these results can be applied in reinforcement learning in the context of policy optimization for Markov Decision Processes and Markov games with softmax parametrized policies in the mean-field regime.


Kinematic Kitbashing for Modeling Functional Articulated Objects

arXiv.org Artificial Intelligence

We introduce Kinematic Kitbashing, an automatic framework that synthesizes functionality-aware articulated objects by reusing parts from existing models. Given a kinematic graph with a small collection of articulated parts, our optimizer jointly solves for the spatial placement of every part so that (i) attachments remain geometrically sound over the entire range of motion and (ii) the assembled object satisfies user-specified functional goals such as collision-free actuation, reachability, or trajectory following. At its core is a kinematics-aware attachment energy that aligns vector distance function features sampled across multiple articulation snapshots. We embed this attachment term within an annealed Riemannian Langevin dynamics sampler that treats functionality objectives as additional energies, enabling robust global exploration while accommodating non-differentiable functionality objectives and constraints. Our framework produces a wide spectrum of assembled articulated shapes, from trash-can wheels grafted onto car bodies to multi-segment lamps, gear-driven paddlers, and reconfigurable furniture, and delivers strong quantitative improvements over state-of-the-art baselines across geometric, kinematic, and functional metrics. By tightly coupling articulation-aware geometry matching with functionality-driven optimization, Kinematic Kitbashing bridges part-based shape modeling and functional assembly design, empowering rapid creation of interactive articulated assets.


Robust Minimax Boosting with Performance Guarantees

arXiv.org Machine Learning

Boosting methods often achieve excellent classification accuracy, but can experience notable performance degradation in the presence of label noise. Existing robust methods for boosting provide theoretical robustness guarantees for certain types of label noise, and can exhibit only moderate performance degradation. However, previous theoretical results do not account for realistic types of noise and finite training sizes, and existing robust methods can provide unsatisfactory accuracies, even without noise. This paper presents methods for robust minimax boosting (RMBoost) that minimize worst-case error probabilities and are robust to general types of label noise. In addition, we provide finite-sample performance guarantees for RMBoost with respect to the error obtained without noise and with respect to the best possible error (Bayes risk). The experimental results corroborate that RMBoost is not only resilient to label noise but can also provide strong classification accuracy.


Cautious Weight Decay

arXiv.org Machine Learning

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.


Nash Equilibria in Games with Playerwise Concave Coupling Constraints: Existence and Computation

arXiv.org Artificial Intelligence

We study the existence and computation of Nash equilibria in continuous static games where the players' admissible strategies are subject to shared coupling constraints, i.e., constraints that depend on their \emph{joint} strategies. Specifically, we focus on a class of games characterized by playerwise concave utilities and playerwise concave constraints. Prior results on the existence of Nash equilibria are not applicable to this class, as they rely on strong assumptions such as joint convexity of the feasible set. By leveraging topological fixed point theory and novel structural insights into the contractibility of feasible sets under playerwise concave constraints, we give an existence proof for Nash equilibria under weaker conditions. Having established existence, we then focus on the computation of Nash equilibria via independent gradient methods under the additional assumption that the utilities admit a potential function. To account for the possibly nonconvex feasible region, we employ a log barrier regularized gradient ascent with adaptive stepsizes. Starting from an initial feasible strategy profile and under exact gradient feedback, the proposed method converges to an $ε$-approximate constrained Nash equilibrium within $\mathcal{O}(ε^{-3})$ iterations.


Finding Dori: Memorization in Text-to-Image Diffusion Models Is Not Local

arXiv.org Artificial Intelligence

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such defenses. Our further analysis then provides multiple indications that memorization is indeed not inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the nature of memorization in text-to-image DMs and inform the development of more reliable mitigations against DM memorization.