Goto

Collaborating Authors

 gradient variance



A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

Neural Information Processing Systems

For training ResNet-50 on ImageNet, our 5-bit block Householder quantizer achieves only 0.5% validation accuracy loss relative to QA T, comparable to the existing INT8 baseline.


Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms

Neural Information Processing Systems

ReParameterization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. However, recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes with exploding gradient variance, which leads to slow convergence. This is in contrast to the conventional belief that reparameterization methods have low gradient estimation variance in problems such as training deep generative models. To comprehend this phenomenon, we conduct a theoretical examination of model-based RP PGMs and search for solutions to the optimization difficulties. Specifically, we analyze the convergence of the model-based RP PGMs and pinpoint the smoothness of function approximators as a major factor that affects the quality of gradient estimation. Based on our analysis, we propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls. Our experimental results demonstrate that proper normalization significantly reduces the gradient variance of model-based RP PGMs. As a result, the performance of the proposed method is comparable or superior to other gradient estimators, such as the Likelihood Ratio (LR) gradient estimator.


Markov Chain Score Ascent: A Unifying Framework of Variational Inference with Markovian Gradients

Neural Information Processing Systems

Minimizing the inclusive Kullback-Leibler (KL) divergence with stochastic gradient descent (SGD) is challenging since its gradient is defined as an integral over the posterior. Recently, multiple methods have been proposed to run SGD with biased gradient estimates obtained from a Markov chain. This paper provides the first non-asymptotic convergence analysis of these methods by establishing their mixing rate and gradient variance. To do this, we demonstrate that these methods--which we collectively refer to as Markov chain score ascent (MCSA) methods--can be cast as special cases of the Markov chain gradient descent framework. Furthermore, by leveraging this new understanding, we develop a novel MCSA scheme, parallel MCSA (pMCSA), that achieves a tighter bound on the gradient variance. We demonstrate that this improved theoretical result translates to superior empirical performance.


Large-batch Optimization for Dense Visual Predictions: Training Faster R-CNN in 4.2 Minutes

Neural Information Processing Systems

Training a large-scale deep neural network in a large-scale dataset is challenging and time-consuming. The recent breakthrough of large-batch optimization is a promising way to tackle this challenge. However, although the current advanced algorithms such as LARS and LAMB succeed in classification models, the complicated pipelines of dense visual predictions such as object detection and segmentation still suffer from the heavy performance drop in the large-batch training regime. To address this challenge, we propose a simple yet effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can train dense visual predictors with very large batch size, enabling several benefits more appealing than prior arts. Firstly, AGVM can align the gradient variances between different modules in the dense visual predictors, such as backbone, feature pyramid network (FPN), detection, and segmentation heads.



Radial Compensation: Stable and Semantically Decoupled Generative Models on Riemannian Manifolds

Papamichals, Marios, Ruane, Regina

arXiv.org Machine Learning

Generative models on curved spaces rely on charts to map Euclidean spaces to manifolds. Exponential maps preserve geodesics but have stiff, radius-dependent Jacobians, while volume-preserving charts maintain densities but distort geodesic distances. Both approaches entangle curvature with model parameters, inflating gradient variance. In high-dimensional latent normalizing flows, the wrapped exponential prior can stretch radii far beyond the curvature scale, leading to poor test likelihoods and stiff solvers. We introduce Radial Compensation (RC), an information-geometric method that selects the base density in the tangent space so that the likelihood depends only on geodesic distance from a pole, decoupling parameter semantics from curvature. RC lets radial parameters retain their usual meaning in geodesic units, while the chart can be tuned as a numerical preconditioner. We extend RC to manifolds with known geodesic polar volume and show that RC is the only construction for geodesic-radial likelihoods with curvature-invariant Fisher information. We derive the Balanced-Exponential (bExp) chart family, balancing volume distortion and geodesic error. Under RC, all bExp settings preserve the same manifold density and Fisher information, with smaller dial values reducing gradient variance and flow cost. Empirically, RC yields stable generative models across densities, VAEs, flows on images and graphs, and protein models. RC improves likelihoods, restores clean geodesic radii, and prevents radius blow-ups in high-dimensional flows, making RC-bExp a robust default for likelihood-trained generative models on manifolds.



A Appendix A.1 Stochastic Rounding

Neural Information Processing Systems

A realization of the stochastic rounding is shown in Figure 4. Here, a 24-bit single floating-point mantissa A.2 Representation mapping increases the gradients variance: Linear layer example A linear layer is essentially a matrix multiplication. Inequality (18) supports our Assumption 2 (iii,b) i.e. The proof goes along the proof of Bottou et al. Experimental results of this paper are run using the following number of GPUs. ResNet18 on CIFAR10 runs on 1 V100 GPUs when batch size is 128.


main remarks regarding baseline, scalability, complexity and the full batch setting in the following paragraphs

Neural Information Processing Systems

We thank the reviewers for the valuable comments and suggestions made. The reviewers' main concern is the lack of RQVI procedure led to computational instability). GLM, BNN) and five datasets (Boston, Fires, Life Expect., Frisk and Metro) with learning rate analysis. We do not claim that this method is suitable for high dimensional posteriors. It is accurate that the method will not be viable without this property.