AITopics | final iterate

2f4059ce1227f021edc5d9c6f0f17dc1-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 19:56:24 GMT

final iterate, iterate, polynomially, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

LastiterateconvergenceofSGDforLeast-Squares intheInterpolationregime

Neural Information Processing SystemsFeb-10-2026, 19:51:35 GMT

Motivated by the recent successes of neural networks that have the ability to fit the data perfectlyand generalize well, we study the noiseless model in the fundamental least-squares setup.

artificial intelligence, iterate, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America > United States > Colorado > Boulder County > Boulder (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Add feedback

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Neural Information Processing SystemsDec-25-2025, 04:42:15 GMT

Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGD's final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGD's final iterate with any polynomially decaying learning rate scheme is highly sub-optimal compared to the statistical minimax rate (by a condition number factor in the strongly convex case and a factor of $\sqrt{T}$ in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the final iterate with step decay schedules is off from the statistical minimax rate by only log factors (in the condition number for the strongly convex case, and in T in the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the step size scheme employed. These results demonstrate the subtlety in establishing optimal learning rate schedules (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.

final iterate, geometrically decaying learning rate procedure, step decay schedule, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.80)

Add feedback

2f4059ce1227f021edc5d9c6f0f17dc1-Paper.pdf

Neural Information Processing SystemsOct-2-2025, 11:32:24 GMT

artificial intelligence, iterate, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America (0.46)
Europe (0.28)

Genre: Research Report (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

b4a0e0fbaa9f16d8947c49f4e610b549-Paper.pdf

Neural Information Processing SystemsAug-17-2025, 00:25:54 GMT

artificial intelligence, iterate, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Colorado > Boulder County > Boulder (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Neural Information Processing SystemsOct-9-2024, 18:31:07 GMT

Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGD's final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGD's final iterate with any polynomially decaying learning rate scheme is highly sub-optimal compared to the statistical minimax rate (by a condition number factor in the strongly convex case and a factor of \sqrt{T} in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule.

convex case, final iterate, geometrically decaying learning rate procedure, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

It's Our Loss: No Privacy Amplification for Hidden State DP-SGD With Non-Convex Loss

Annamalai, Meenatchi Sundaram Muthu Selva

arXiv.org Artificial IntelligenceJul-8-2024

Differentially Private Stochastic Gradient Descent (DP-SGD) is a popular iterative algorithm used to train machine learning models while formally guaranteeing the privacy of users. However the privacy analysis of DP-SGD makes the unrealistic assumption that all intermediate iterates (aka internal state) of the algorithm are released since in practice, only the final trained model, i.e., the final iterate of the algorithm is released. In this hidden state setting, prior work has provided tighter analyses, albeit only when the loss function is constrained, e.g., strongly convex and smooth or linear. On the other hand, the privacy leakage observed empirically from hidden state DP-SGD, even when using non-convex loss functions suggest that there is in fact a gap between the theoretical privacy analysis and the privacy guarantees achieved in practice. Therefore, it remains an open question whether privacy amplification for DP-SGD is possible in the hidden state setting for general loss functions. Unfortunately, this work answers the aforementioned research question negatively. By carefully constructing a loss function for DP-SGD, we show that for specific loss functions, the final iterate of DP-SGD alone leaks as much information as the sequence of all iterates combined. Furthermore, we empirically verify this result by evaluating the privacy leakage from the final iterate of DP-SGD with our loss function and show that this matches the theoretical upper bound guaranteed by DP exactly. Therefore, we show that the current privacy analysis fo DP-SGD is tight for general loss functions and conclude that no privacy amplification is possible for DP-SGD in general for all (possibly non-convex) loss functions.

dp-sgd, iterate, loss function, (13 more...)

arXiv.org Artificial Intelligence

2407.06496

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Singapore (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Improved Stability and Generalization Analysis of the Decentralized SGD Algorithm

Bars, Batiste Le, Bellet, Aurélien, Tommasi, Marc

arXiv.org Artificial IntelligenceJun-5-2023

This paper presents a new generalization error analysis for the Decentralized Stochastic Gradient Descent (D-SGD) algorithm based on algorithmic stability. The obtained results largely improve upon state-of-the-art results, and even invalidate their claims that the communication graph has a detrimental effect on generalization. For instance, we show that in convex settings, D-SGD has the same generalization bounds as the classical SGD algorithm, no matter the choice of graph. We exhibit that this counter-intuitive result comes from considering the average of local parameters, which hides a final global averaging step incompatible with the decentralized scenario. In light of this observation, we advocate to analyze the supremum over local parameters and show that in this case, the graph does have an impact on the generalization. Unlike prior results, our analysis yields non-vacuous bounds even for non-connected graphs.

artificial intelligence, generalization error, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2306.02939

Country:

North America > United States > New York > New York County > New York City (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

Lee, Donghwan, Kim, Do Wan

arXiv.org Artificial IntelligenceJun-2-2023

TD-learning is a fundamental algorithm in the field of reinforcement learning (RL), that is employed to evaluate a given policy by estimating the corresponding value function for a Markov decision process. While significant progress has been made in the theoretical analysis of TD-learning, recent research has uncovered guarantees concerning its statistical efficiency by developing finite-time error bounds. This paper aims to contribute to the existing body of knowledge by presenting a novel finite-time analysis of tabular temporal difference (TD) learning, which makes direct and effective use of discrete-time stochastic linear system models and leverages Schur matrix properties. The proposed analysis can cover both on-policy and off-policy settings in a unified manner. By adopting this approach, we hope to offer new and straightforward templates that not only shed further light on the analysis of TD-learning and related RL algorithms but also provide valuable insights for future research in this domain.

machine learning, reinforcement learning, td-learning, (18 more...)

arXiv.org Artificial Intelligence

2204.10479

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
Asia > South Korea > Daejeon > Daejeon (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence

Liu, Daogao, Lu, Zhou

arXiv.org Machine LearningJun-28-2021

Stochastic Gradient Descent (SGD) is among the simplest and most popular methods in optimization. The convergence rate for SGD has been extensively studied and tight analyses have been established for the running average scheme, but the sub-optimality of the final iterate is still not well-understood. shamir2013stochastic gave the best known upper bound for the final iterate of SGD minimizing non-smooth convex functions, which is $O(\log T/\sqrt{T})$ for Lipschitz convex functions and $O(\log T/ T)$ with additional assumption on strongly convexity. The best known lower bounds, however, are worse than the upper bounds by a factor of $\log T$. harvey2019tight gave matching lower bounds but their construction requires dimension $d= T$. It was then asked by koren2020open how to characterize the final-iterate convergence of SGD in the constant dimension setting. In this paper, we answer this question in the more general setting for any $d\leq T$, proving $\Omega(\log d/\sqrt{T})$ and $\Omega(\log d/T)$ lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex and strongly convex functions respectively with standard step size schedules. Our results provide the first general dimension dependent lower bound on the convergence of SGD's final iterate, partially resolving a COLT open question raised by koren2020open. We also present further evidence to show the correct rate in one dimension should be $\Theta(1/\sqrt{T})$, such as a proof of a tight $O(1/\sqrt{T})$ upper bound for one-dimensional special cases in settings more general than koren2020open.

convex function, final iterate, sgd, (13 more...)

arXiv.org Machine Learning

2106.14588

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

Filters

Collaborating Authors

final iterate

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

2f4059ce1227f021edc5d9c6f0f17dc1-Paper.pdf

LastiterateconvergenceofSGDforLeast-Squares intheInterpolationregime

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

2f4059ce1227f021edc5d9c6f0f17dc1-Paper.pdf

b4a0e0fbaa9f16d8947c49f4e610b549-Paper.pdf

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

It's Our Loss: No Privacy Amplification for Hidden State DP-SGD With Non-Convex Loss

Improved Stability and Generalization Analysis of the Decentralized SGD Algorithm

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence