AITopics | stale gradient

Neural Information Processing Systems http://nips.cc/

artificial intelligence, gradient, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.68)
Europe > United Kingdom > England (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)

Add feedback

Stochastic Gradient MCMC with Stale Gradients

Neural Information Processing SystemsMar-17-2026, 10:33:40 GMT

Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers.

artificial intelligence, gradient, machine learning, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Stochastic Gradient MCMC with Stale Gradients

Neural Information Processing SystemsNov-21-2025, 15:13:18 GMT

Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers.

gradient, stale gradient, stochastic gradient mcmc, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Online Convex Optimization with Switching Cost with Only One Single Gradient Evaluation

Shah, Harsh, Chandrasekhar, Purna, Vaze, Rahul

arXiv.org Artificial IntelligenceJul-8-2025

Online convex optimization with switching cost is considered under the frugal information setting where at time $t$, before action $x_t$ is taken, only a single function evaluation and a single gradient is available at the previously chosen action $x_{t-1}$ for either the current cost function $f_t$ or the most recent cost function $f_{t-1}$. When the switching cost is linear, online algorithms with optimal order-wise competitive ratios are derived for the frugal setting. When the gradient information is noisy, an online algorithm whose competitive ratio grows quadratically with the noise magnitude is derived.

algorithm, artificial intelligence, competitive ratio, (16 more...)

arXiv.org Artificial Intelligence

2507.04133

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence (0.68)

Add feedback

Reviews: Ouroboros: On Accelerating Training of Transformer-Based Language Models

Neural Information Processing SystemsJan-22-2025, 01:42:15 GMT

The paper introduces a new method for model-parallel training, where layers of a model are distributed across multiple accelerators. The method avoids locking in the backward pass by using stale gradients during back-propagation. I'm not aware of any prior work that took such an approach. Furthermore, the authors provide theoretical claims and empirical results to demonstrate that their method has convergence properties similar to conventional SGD, despite using stale gradients. The lack of effective model-parallel training is a major roadblock for scaling up model sizes, and the proposed approach promises to overcome this issue.

accelerating training, model-parallel training, transformer-based language model, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

Reviews: Stochastic Gradient MCMC with Stale Gradients

Neural Information Processing SystemsJan-20-2025, 17:43:44 GMT

Technical quality: I think that the theory is very complete (bounds are given for pretty much everything relevant to the problem), and the experiments show that this method performs better on large/complicated models (the small/simple models have too little variance for extra servers to help, and the staleness prevents much benefits). I think the biggest limitation of the paper is the lack of comparison against the method in [14] (the paper mostly compares against the non-distributed -- 1 worker -- case, instead of a more standard distributed case). Novelty/originality: My impression is theoretical results are mostly a combination of proof techniques used in other SG-MCMC and asynchronous SGD papers (however, I'm not too sure that this claim is correct). Assuming this is true, I think the results are well-executed, but not too unique. Potential impact or usefulness: I think the theoretical analysis will be useful for people interested in how asynchrony affects SG-MCMC. However, I'm not too clear how much this will help for running SG-MCMC in practice.

sg-mcmc, stale gradient, stochastic gradient mcmc, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training

Cao, Ray, Luo, Sherry, Gan, Steve, Jinesh, Sujeeth

arXiv.org Artificial IntelligenceJun-8-2024

In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers continue generating gradient updates even if the parameter server is down, applying these updates once the server is back online. We compare these techniques to a standard checkpointing approach, where the training job is resumed from the latest checkpoint. To assess the resilience and performance of each configuration, we intentionally killed the parameter server during training for each experiment. Our experiment results indicate that the stateless parameter server approach continues to train towards convergence and improves accuracy as much as 10\% in the face of a failure despite using stale weights and gradients. The chain replication and checkpointing techniques demonstrate convergence but suffer from setbacks in accuracy due to restarting from old checkpoints. These results suggest that allowing workers to continue generating updates during server downtime and applying these updates later can effectively improve hardware utilization. Furthermore, despite higher resource usage, the stateless parameter server method incurs similar monetary costs in terms of hardware usage compared to standard checkpointing methods due to the pricing structure of common cloud providers.

gradient, parameter server, server, (13 more...)

arXiv.org Artificial Intelligence

2406.05546

Country: North America > United States > California > Santa Clara County > Santa Clara (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Stochastic Gradient MCMC with Stale Gradients Chunyuan Li

Neural Information Processing SystemsMar-12-2024, 16:13:19 GMT

Stochastic gradient MCMC (SG-MCMC) has played an important role in largescale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers.

gradient, iteration, variance, (14 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Washington > King County > Redmond (0.04)
North America > United States > North Carolina > Durham County > Durham (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.35)

Add feedback

Stochastic Gradient MCMC with Stale Gradients

Chen, Changyou, Ding, Nan, Li, Chunyuan, Zhang, Yizhe, Carin, Lawrence

Neural Information Processing SystemsFeb-14-2020, 12:41:53 GMT

Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers.

gradient, sg-mcmc, stale gradient, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Stochastic Gradient MCMC with Stale Gradients

Chen, Changyou, Ding, Nan, Li, Chunyuan, Zhang, Yizhe, Carin, Lawrence

Neural Information Processing SystemsDec-31-2016

Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers. Experiments on synthetic data and deep neural networks validate our theory, demonstrating the effectiveness and scalability of SG-MCMC with stale gradients.

artificial intelligence, gradient, machine learning, (17 more...)

Neural Information Processing Systems

Country: