smoothness constant
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
- Health & Medicine > Therapeutic Area (0.67)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
- Health & Medicine > Diagnostic Medicine (0.45)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Data Science (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
- Information Technology > Artificial Intelligence > Vision (0.67)
Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods
The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as untuned SGD, still attains an order-optimal convergence rate $\widetilde{\mathcal{O}}(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives. Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods -- Normalized SGD (NSGD), AMSGrad, and AdaGrad -- unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients. Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.
Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance
Yadav, Robin, Xie, Shuo, Wang, Tianhao, Li, Zhiyuan
Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-euclidean norms, e.g., $\ell_\infty$ norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of $\ell_\infty$-norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent (steepest descent w.r.t. $\ell_\infty$ norm) over normalized GD (steepest descent w.r.t. to $\ell_2$ norm) in the presence of heavy tail class imbalance.
- Europe > Austria > Vienna (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > British Columbia (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)
- Asia > Middle East > Saudi Arabia (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Russia (0.04)
- (2 more...)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings
Markou, Evan, Ajanthan, Thalaiyasingam, Gould, Stephen
Many high-dimensional optimisation problems exhibit rich geometric structures in their set of minimisers, often forming smooth manifolds due to over-parametrisation or symmetries. When this structure is known, at least locally, it can be exploited through reduction mappings that reparametrise part of the parameter space to lie on the solution manifold. These reductions naturally arise from inner optimisation problems and effectively remove redundant directions, yielding a lower-dimensional objective. In this work, we introduce a general framework to understand how such reductions influence the optimisation landscape. We show that well-designed reduction mappings improve curvature properties of the objective, leading to better-conditioned problems and theoretically faster convergence for gradient-based methods. Our analysis unifies a range of scenarios where structural information at optimality is leveraged to accelerate convergence, offering a principled explanation for the empirical gains observed in such optimisation algorithms.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Data Science (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)
- Asia > Middle East > Saudi Arabia (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Russia (0.04)
- (2 more...)
- North America > United States (0.14)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- Europe > Russia (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area (0.67)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
- Health & Medicine > Diagnostic Medicine (0.45)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Data Science (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
- Information Technology > Artificial Intelligence > Vision (0.67)