AITopics | hb 2

Collaborating Authors

hb 2

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

3c63ec7be1b6c49e6c308397023fd8cd-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 12:57:00 GMT

algorithm, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.30)

Add feedback

A Proof of Theorem 1, w t, and w

Neural Information Processing SystemsFeb-8-2026, 07:17:07 GMT

Let ŵ be this arg min, which is unique since the objective is strongly convex. Substituting the definition of p and rearranging completes the proof. Lemma 2. Let l(; z) be H-smooth, convex, and non-negative for each z, let the stochastic gradient For the first term on the right hand side, we note that due to the algorithm's projections, all of the Lemma 3. Let l(; z) be H-smooth and non-negative for all z and let L This follows almost immediately from [Theorem 2.1.5 This proof is based on similar ideas as the proof of Lemma 5 and Theorem 2 due to Lan [17]. The key difference is that Lan considers a setting in which the variance of the stochastic gradients are uniformly bounded, while in our setting, we do not directly assume any bound on this quantity.

algorithm, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

3c63ec7be1b6c49e6c308397023fd8cd-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 07:17:04 GMT

algorithm, assumption 1, hb 2, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.72)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.72)

Add feedback

Why Do We Need Warm-up? A Theoretical Perspective

Alimisis, Foivos, Islamov, Rustem, Lucchi, Aurelien

arXiv.org Machine LearningOct-6-2025

Training modern machine learning models requires a careful choice of hyperparameters. A common practice for setting the learning rate (LR) is to linearly increase the LR in the beginning (warm-up stage) [Goyal et al., 2017, Vaswani et al., 2017] and gradually decrease at the end of the training (decay stage) [Loshchilov and Hutter, 2016, Vaswani et al., 2017, Hoffmann et al., 2022b, Zhang et al., 2023, Dremov et al., 2025]. Decaying the LR is a classical requirement in the theoretical analysis of SGD, ensuring convergence under broad conditions [Defazio et al., 2023, Gower et al., 2021], and it has been consistently observed to improve empirical performance [Loshchilov and Hutter, 2016, Hu et al., 2024, Hägele et al., 2024]. Recent work further demonstrates that decaying step sizes can improve theoretical guarantees by yielding tighter bounds [Schaipp et al., 2025]. By contrast, the practice of linearly increasing the LR at the start of training (warm-up phase) has become nearly ubiquitous in modern deep learning [He et al., 2016, Hu et al., 2024, Hägele et al., 2024], yet a clear theoretical understanding of why it helps optimization remains elusive. This raises the central question we address in this paper: Why does LR warm-up improve training, and under what conditions can its benefits be theoretically justified?

dist, vec, warm-up, (15 more...)

arXiv.org Machine Learning

2510.03164

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)
Asia > China (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

45713f6ff2041d3fdfae927b82488db8-Paper.pdf

Neural Information Processing SystemsOct-2-2025, 19:33:33 GMT

artificial intelligence, machine learning, minibatch sgd, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)

Add feedback

12151_differentially_private_general.pdf

Neural Information Processing SystemsAug-16-2025, 22:01:08 GMT

A.3 Low Dimension Before presenting the proof of Theorem 1, we provide formal statements of its Corollaries. We then bound average argument stability in terms of average regret (Lemma 5). Substituting these in the above equation gives the claimed bound. We now fill in the details. Thus, substituting the above in Eqn. ( 3) and substituting the bound from 6, we have, E [ L ( null w; D) L ( w Substituting the value of G completes the proof.

artificial intelligence, nullxnull 2, nullynull 2, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.46)

Add feedback

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Patel, Kumar Kshitij, Glasgow, Margalit, Zindari, Ali, Wang, Lingxiao, Stich, Sebastian U., Cheng, Ziheng, Joshi, Nirmit, Srebro, Nathan

arXiv.org Machine LearningMay-19-2024

Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low.

assumption, local sgd, sgd, (14 more...)

arXiv.org Machine Learning

2405.11667

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Virginia (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Minibatch vs Local SGD for Heterogeneous Distributed Learning

Woodworth, Blake, Patel, Kumar Kshitij, Srebro, Nathan

arXiv.org Machine LearningJul-27-2020

We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.

artificial intelligence, machine learning, minibatch sgd, (17 more...)

arXiv.org Machine Learning

2006.04735

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Smoothness, Low Noise and Fast Rates

Srebro, Nathan, Sridharan, Karthik, Tewari, Ambuj

Neural Information Processing SystemsDec-31-2010

We establish an excess risk bound of O(H R_n^2 + sqrt{H L*} R_n) for ERM with an H-smooth loss function and a hypothesis class with Rademacher complexity R_n, where L* is the best risk achievable by the hypothesis class. For typical hypothesis classes where R_n = sqrt{R/n}, this translates to a learning rate of ̃ O(RH/n) in the separable (L* = 0) case and O(RH/n + sqrt{L* RH/n}) more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective.

artificial intelligence, loss function, machine learning, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback