hb 2
A Proof of Theorem 1, w t, and w
Let ŵ be this arg min, which is unique since the objective is strongly convex. Substituting the definition of p and rearranging completes the proof. Lemma 2. Let l(; z) be H-smooth, convex, and non-negative for each z, let the stochastic gradient For the first term on the right hand side, we note that due to the algorithm's projections, all of the Lemma 3. Let l(; z) be H-smooth and non-negative for all z and let L This follows almost immediately from [Theorem 2.1.5 This proof is based on similar ideas as the proof of Lemma 5 and Theorem 2 due to Lan [17]. The key difference is that Lan considers a setting in which the variance of the stochastic gradients are uniformly bounded, while in our setting, we do not directly assume any bound on this quantity.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
Why Do We Need Warm-up? A Theoretical Perspective
Alimisis, Foivos, Islamov, Rustem, Lucchi, Aurelien
Training modern machine learning models requires a careful choice of hyperparameters. A common practice for setting the learning rate (LR) is to linearly increase the LR in the beginning (warm-up stage) [Goyal et al., 2017, Vaswani et al., 2017] and gradually decrease at the end of the training (decay stage) [Loshchilov and Hutter, 2016, Vaswani et al., 2017, Hoffmann et al., 2022b, Zhang et al., 2023, Dremov et al., 2025]. Decaying the LR is a classical requirement in the theoretical analysis of SGD, ensuring convergence under broad conditions [Defazio et al., 2023, Gower et al., 2021], and it has been consistently observed to improve empirical performance [Loshchilov and Hutter, 2016, Hu et al., 2024, Hägele et al., 2024]. Recent work further demonstrates that decaying step sizes can improve theoretical guarantees by yielding tighter bounds [Schaipp et al., 2025]. By contrast, the practice of linearly increasing the LR at the start of training (warm-up phase) has become nearly ubiquitous in modern deep learning [He et al., 2016, Hu et al., 2024, Hägele et al., 2024], yet a clear theoretical understanding of why it helps optimization remains elusive. This raises the central question we address in this paper: Why does LR warm-up improve training, and under what conditions can its benefits be theoretically justified?
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Asia > China (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada (0.04)
12151_differentially_private_general.pdf
A.3 Low Dimension Before presenting the proof of Theorem 1, we provide formal statements of its Corollaries. We then bound average argument stability in terms of average regret (Lemma 5). Substituting these in the above equation gives the claimed bound. We now fill in the details. Thus, substituting the above in Eqn. ( 3) and substituting the bound from 6, we have, E [ L ( null w; D) L ( w Substituting the value of G completes the proof.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication
Patel, Kumar Kshitij, Glasgow, Margalit, Zindari, Ali, Wang, Lingxiao, Stich, Sebastian U., Cheng, Ziheng, Joshi, Nirmit, Srebro, Nathan
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
Minibatch vs Local SGD for Heterogeneous Distributed Learning
Woodworth, Blake, Patel, Kumar Kshitij, Srebro, Nathan
We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.
Smoothness, Low Noise and Fast Rates
Srebro, Nathan, Sridharan, Karthik, Tewari, Ambuj
We establish an excess risk bound of O(H R_n^2 + sqrt{H L*} R_n) for ERM with an H-smooth loss function and a hypothesis class with Rademacher complexity R_n, where L* is the best risk achievable by the hypothesis class. For typical hypothesis classes where R_n = sqrt{R/n}, this translates to a learning rate of ̃ O(RH/n) in the separable (L* = 0) case and O(RH/n + sqrt{L* RH/n}) more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)