Goto

Collaborating Authors

 optimization


Entropy-Regularized Probabilistic Gates for Sparse Model Discovery in Scarce-Data Federated Learning

arXiv.org Machine Learning

Federated Learning (FL) is a distributed machine learning (ML) paradigm with collaboration among multiple clients without sharing data. FL is challenging under data heterogeneity and partial client participation. Learning sparse models is useful for communication and computational efficiency in FL, but it is especially difficult in the small-sample high-dimensional regime (d >> N) where optimization can yield parameter configurations that fail to generalize to unseen test data. While magnitude-based pruning doesn't account for uncertainty exploration in the parameter space, a formulation with probabilistic gates and an L0 constraint allows sampling from competing sparse configurations during training. In this work, we study entropy regularization of gate distributions as a mechanism to maintain uncertainty in sparse federated optimization by preventing early commitment to sparse support. We examine its impact under data heterogeneity, client participation heterogeneity, and sparsity. Experiments on synthetic and real-world benchmarks show consistent improvements over federated iterative hard thresholding (Fed-IHT) and pruning after dense federated averaging (FedAvg) training, both in statistical performance on test data and in sparsity recovery accuracy.


Distributionally Robust Linear Regression With Block Lewis Weights

arXiv.org Machine Learning

Machine learning algorithms and their training datasets have grown substantially in both size and complexity over the past decade. This increased model complexity has made it challenging to interpret and predict their behavior in unobserved scenarios. Hence, many applications that involve societal decisions still rely on simple, interpretable models like linear regression, often after feature engineering. Examples of such applications include predicting national housing prices, estimating wages across industries, forecasting loan amounts across banks, predicting life insurance premiums across groups, and projecting energy consumption across communities [CGKMN24]. A shared safety and sometimes legal concern across the above applications is the potential for wildly different model qualities for different distributions, i.e., outputting a notably worse model for some source data distributions [Dat14; BS16; HPS16; VVB18; SBFVV19; BHJKR21; CGNSG23; Cho16; KLMR18; ADW19; CGKMN24; SVWZ24].


Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization

arXiv.org Machine Learning

Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge. We study this problem in instruction-based molecular optimization, where answer-only supervised fine-tuning (SFT) collapses multi-step reasoning and reinforcement learning with verifiable rewards (RLVR) suffers from sparse feedback. Reference-guided Policy Optimization (RePO) mitigates both by anchoring policy updates to dataset-provided references, but its effectiveness is tightly coupled to reference quality: weak or misaligned references impose a performance ceiling. To overcome this ceiling, we propose active reasoning, a paradigm in which the policy actively decides, on a per-instance basis, when to imitate a reference and when to reinforce its own discoveries, while continuously upgrading what it imitates. We instantiate this paradigm as Active Group Relative Policy Optimization (Active-GRPO), realized through two coupled mechanisms: active imitate-reinforce and active referencing. The former performs imitation learning when the reference still outperforms the policy's own candidates, and shifts to self-improvement via reinforcement learning once the policy has generated molecules that surpass the reference. The latter continuously upgrades the reference itself by replacing it with the best policy-generated candidate discovered so far, progressively raising the imitation target and ensuring that reference guidance remains informative--rather than restrictive--throughout training. Across TOMG-Bench MOLOPT, Active-GRPO improves average SR Sim from 0.0959 for GRPO and 0.1665 for RePO to 0.1773 under matched three-seed evaluation, with statistically significant gains on LogP, MR, and QED.


On the Convergence of Self-Improving Online LLM Alignment

arXiv.org Machine Learning

Abstractitations, recent work explores online RLHF that iterates between generating on-policy responses and collecting preferences [Lee et al., 2024, Park et al., 2022]. Among online The Self-Improving Alignment (SAIL) algorithmapproaches, SAIL reduces a bilevel alignment formulation addresses distribution shift by reducing a bilevelto a computationally efficient single-level surrogate and formulation of the problem to an efficient, single-reports strong empirical gains [Ding et al., 2024]. Empirically, SAIL has demonstratedisting online pipelines are largely heuristic and do not anastrong performance on this task. However, a for-lytically control the distributional shift induced by iterative mal analysis of its convergence properties has beendata collection [Chakraborty et al., 2024, Shen et al., 2024], lacking. We identify a key theoretical challenge: which has been linked to suboptimal performance in practice the standard SAIL objective function is not guar- [Sharma et al., 2024]. To address this limita-A growing line of work argues that the coupling between tion, we propose a regularized objective, SAILreward learning and policy updates is fundamentally bilevel and should be modeled as such [Chakraborty et al., 2024].RevKL, which incorporates a reverse KullbackAs a follow-up, Ding et al. [2024] reduces the bilevel align-Leibler (KL) divergence penalty to improve the optimization landscape. Our central theoretical con-ment objective to a tractable single-level surrogate and retribution is to prove that this regularized objectiveports strong empirical gains, yet it lacks formal convergence satisfies the Polyak-Lojasiewicz (PL) conditionguarantees. Related theoretical analyses in bilevel/RLHFstyle problems exist [e.g., Yang et al., 2025, Chakrabortywithin a bounded parameter space. We establish et al., 2024, Gaur et al., 2025], yet they either focus onglobal convergence guarantees, achieving a nearlinear sample complexity.


Curvature-Weighted Gradient Diversity: A Noise Measure for Geometry-Adaptive SGD Schedules

arXiv.org Machine Learning

The standard convergence analysis of mini-batch stochastic gradient descent (SGD) models gradient noise using a single variance term that treats all parameter directions equally, ignoring the fact that noise in high-curvature directions has less impact because learning rates are already constrained there. We introduce Curvature-Weighted Gradient Diversity (CWGD), a geometry-aware measure that weights per-sample gradient diversity by the inverse square root of the Hessian, providing a tighter proxy for the effective optimization noise. For strongly convex quadratic objectives with diagonal Hessians and isotropic noise, we prove that a CWGD-modulated cosine learning-rate schedule can reduce the asymptotic optimization error floor by up to a factor of two compared with standard cosine annealing. We implement this idea as CWGD-Cosine using a Hutchinson-based diagonal Hessian estimator that is exact for quadratic objectives. Across a range of condition numbers, batch sizes, and noise structures, CWGD-Cosine consistently achieves approximately 20% lower final optimization error than standard cosine annealing while incurring negligible overhead in the quadratic setting. We also identify and correct a degenerate curvature estimator, analyze the robustness of the proposed estimator, and explicitly discuss the limitations of the method, including Hessian staleness in non-convex optimization. These results establish CWGD as a principled geometry-aware measure of optimization noise and motivate future extensions to more general learning problems.


On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Neural Information Processing Systems

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.


Dangerous Liaisons of Convex Learning and Non-Affine Aggregation

arXiv.org Machine Learning

Last-iterate convergence and generalization guarantees in first-order convex learning hinge on the monotonicity of the update operator. While linear averaging preserves the monotonicity of gradient updates, this property is often violated when gradients are aggregated non-affinely, as in modern pipelines enforcing constraints like adaptivity, privacy, robustness or fairness. Whether it is possible to design non-affine aggregation rules that maintain monotonicity has remained an open question. We answer this question negatively: we prove that the monotonicity of aggregated gradients is preserved if and only if the aggregation rule is positively affine. Consequently, non-affine aggregation prevents steady convergence and substantially degrade algorithmic stability. We quantify these drawbacks and propose a path forward by identifying sufficient conditions under which monotonicity can be restored. Our results provide a unified theoretical framework explaining the disparate failure modes observed in modern learning systems.


Adaptive Gradient Masking for Balancing ID and MLLM-based Representations in Recommendation

Neural Information Processing Systems

In large-scale recommendation systems, multimodal (MM) content is increasingly introduced to enhance the generalization of ID features. The rise of Multimodal Large Language Models (MLLMs) enables the construction of unified user and item representations. However, the semantic distribution gap between MM and ID representations leads to \textit{convergence inconsistency} during joint training: the ID branch converges quickly, while the MM branch requires more epochs, thus limiting overall performance. To address this, we propose a two-stage framework including MM representation learning and joint training optimization. First, we fine-tune the MLLM to generate unified user and item representations, and introduce collaborative signals by post-aligning user ID representations to alleviate semantic differences. Then, we propose an Adaptive Gradient Masking (AGM) training strategy to dynamically regulate parameter updates between ID and MLLM branches. AGM estimates the contribution of each representation with mutual information, and applies non-uniform gradient masking at the sub-network level to balance optimization. We provide theoretical analysis of AGM's effectiveness and further introduce an unbiased variant, AGM*, to enhance training stability.


QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

Neural Information Processing Systems

Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge.


Set Smoothness Unlocks Clarke Hyper-stationarity in Bilevel Optimization

Neural Information Processing Systems

Solving bilevel optimization (BLO) problems to global optimality is generally intractable. A common surrogate is to compute a hyper-stationary point--a stationary point of the hyper-objective function obtained by minimizing or maximizing the upper-level objective over the lower-level solution set. Existing methods, however, either provide weak notions of stationarity or require restrictive assumptions to guarantee the smoothness of hyper-objective functions. In this paper, we eliminate these impractical assumptions and show that strong (Clarke) hyper-stationarity remains computable even when the hyper-objective is nonsmooth. Our key ingredient is a new structural property, called set smoothness, which captures the variational dependence of the lower-level solution set on the upper-level variable. We prove that this property holds for a broad class of BLO problems and ensures weak convexity (resp.