Goto

Collaborating Authors

 subdifferential


Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifolds

Neural Information Processing Systems

This work addresses the finite-time analysis of nonsmooth nonconvex stochastic optimization under Riemannian manifold constraints. We adapt the notion of Goldstein stationarity to the Riemannian setting as a performance metric for nonsmooth optimization on manifolds. We then propose a Riemannian Online to NonConvex (RO2NC) algorithm, for which we establish the sample complexity of O(ฯต 3ฮด 1)in finding (ฮด,ฯต)-stationary points. This result is the first-ever finite-time guarantee for fully nonsmooth, nonconvex optimization on manifolds and matches the optimal complexity in the Euclidean setting. When gradient information is unavailable, we develop a zeroth order version of RO2NC algorithm (ZO-RO2NC), for which we establish the same sample complexity. The numerical results support the theory and demonstrate the practical effectiveness of the algorithms.


Gradient Multi-Normalization for Efficient LLMTraining

Neural Information Processing Systems

Training large language models (LLMs) commonly relies on adaptive optimizers such as Adam (Kingma & Ba, 2015), which accelerate convergence through moment estimates but incur substantial memory overhead. Recent stateless approaches such as SWAN (Ma et al., 2024) have shown that appropriate preprocessing of instantaneous gradient matrices can match the performance of adaptive methods without storing optimizer states. Building on this insight, we introduce gradient multi-normalization, a principled framework for designing stateless optimizers that normalize gradients with respect to multiple norms simultaneously. Whereas standard first-order methods can be viewed as gradient normalization under a single norm (Bernstein & Newhouse, 2024), our formulation generalizes this perspective to a multi-norm setting. We derive an efficient alternating scheme that enforces these normalization constraints and show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem. This unifies and extends prior stateless optimizers, showing that SWAN arises as a specific instance with particular norm choices. Leveraging this principle, we develop SinkGD, a lightweight matrix optimizer that retains the memory footprint of SGD (w/o momentum) while substantially reducing computation relative to whitening-based methods. On the memory-efficient LLaMA training benchmark (Zhao et al., 2024a), SinkGD achieves state-of-the-art performance, reaching the same evaluation perplexity as Adam using only 40% of the training tokens.





aac933717a429f57c6ca58f32975c597-AuthorFeedback.pdf

Neural Information Processing Systems

Inourpaper theGrassmannian21 structure is utilized together with the RRC to analyze the convergence of the projected Riemannian subgradient22 method. Since33 both the robust subspace learning and dictionary learning problems are regular, their Riemannian subdifferentials34 computedinSection4arecorrect.