Gradient Descent
Zeroth-Order Alternating Gradient Descent Ascent Algorithms for a Class of Nonconvex-Nonconcave Minimax Problems
Xu, Zi, Wang, Zi-Qi, Wang, Jun-Lin, Dai, Yu-Hong
In this paper, we consider a class of nonconvex-nonconcave minimax problems, i.e., NC-PL minimax problems, whose objective functions satisfy the Polyak-\L ojasiewicz (PL) condition with respect to the inner variable. We propose a zeroth-order alternating gradient descent ascent (ZO-AGDA) algorithm and a zeroth-order variance reduced alternating gradient descent ascent (ZO-VRAGDA) algorithm for solving NC-PL minimax problem under the deterministic and the stochastic setting, respectively. The total number of function value queries to obtain an $\epsilon$-stationary point of ZO-AGDA and ZO-VRAGDA algorithm for solving NC-PL minimax problem is upper bounded by $\mathcal{O}(\varepsilon^{-2})$ and $\mathcal{O}(\varepsilon^{-3})$, respectively. To the best of our knowledge, they are the first two zeroth-order algorithms with the iteration complexity gurantee for solving NC-PL minimax problems.
A Guide Through the Zoo of Biased SGD
Demidovich, Yury, Malinovsky, Grigory, Sokolov, Igor, Richtárik, Peter
Stochastic Gradient Descent (SGD) is arguably the most important single algorithm in modern machine learning. Although SGD with unbiased gradient estimators has been studied extensively over at least half a century, SGD variants relying on biased estimators are rare. Nevertheless, there has been an increased interest in this topic in recent years. However, existing literature on SGD with biased estimators (BiasedSGD) lacks coherence since each new paper relies on a different set of assumptions, without any clear understanding of how they are connected, which may lead to confusion. We address this gap by establishing connections among the existing assumptions, and presenting a comprehensive map of the underlying relationships. Additionally, we introduce a new set of assumptions that is provably weaker than all previous assumptions, and use it to present a thorough analysis of BiasedSGD in both convex and non-convex settings, offering advantages over previous results. We also provide examples where biased estimators outperform their unbiased counterparts or where unbiased versions are simply not available. Finally, we demonstrate the effectiveness of our framework through experimental results that validate our theoretical findings.
Diffusion-Based Adversarial Sample Generation for Improved Stealthiness and Controllability
Xue, Haotian, Araujo, Alexandre, Hu, Bin, Chen, Yongxin
Neural networks are known to be susceptible to adversarial samples: small variations of natural examples crafted to deliberately mislead the models. While they can be easily generated using gradient-based techniques in digital and physical scenarios, they often differ greatly from the actual data distribution of natural images, resulting in a trade-off between strength and stealthiness. In this paper, we propose a novel framework dubbed Diffusion-Based Projected Gradient Descent (Diff-PGD) for generating realistic adversarial samples. By exploiting a gradient guided by a diffusion model, Diff-PGD ensures that adversarial samples remain close to the original data distribution while maintaining their effectiveness. Moreover, our framework can be easily customized for specific tasks such as digital attacks, physical-world attacks, and style-based attacks. Compared with existing methods for generating natural-style adversarial samples, our framework enables the separation of optimizing adversarial loss from other surrogate losses (e.g., content/smoothness/style loss), making it more stable and controllable. Finally, we demonstrate that the samples generated using Diff-PGD have better transferability and anti-purification power than traditional gradient-based methods.
First Order Methods with Markovian Noise: from Acceleration to Variational Inequalities
Beznosikov, Aleksandr, Samsonov, Sergey, Sheshukova, Marina, Gasnikov, Alexander, Naumov, Alexey, Moulines, Eric
Stochastic gradient methods are an essential ingredient for solving various optimization problems, with a wide range of applications in various fields such as machine learning [Goodfellow et al., 2014, 2016], empirical risk minimization problems [Van der Vaart, 2000], and reinforcement learning [Sutton and Barto, 2018, Schulman et al., 2015, Mnih et al., 2015]. Various stochastic gradient descent methods (SGD) and their accelerated versions [Nesterov, 1983, Ghadimi and Lan, 2013] have been extensively studied under different statistical frameworks [Dieuleveut et al., 2017, Vaswani et al., 2019a]. The standard assumption for stochastic optimization algorithms is to consider independent and identically distributed noise variables. However, the growing usage of stochastic optimization methods in reinforcement learning [Bhandari et al., 2018, Srikant and Ying, 2019, Durmus et al., 2021] and distributed optimization [Lopes and Sayed, 2007, Dimakis et al., 2010, Mao et al., 2020] has led to increased interest in problems with Markovian noise. Despite this, existing theoretical works that consider Markov noise have significant limitations, and their analysis often results in suboptimal finite-time error bounds. Our research aims to fill the gap in the existing literature on the first-order Markovian setting. By focusing on uniformly geometrically ergodic Markov chains, we obtain finite-time complexity bounds for achieving ε-accurate solutions that scale linearly with the mixing time of the underlying Markov chain. Our approach is based on careful applications of randomized batch size schemes and provides a unified view on both non-convex and strongly convex minimization problems, as well as variational inequalities.
SketchOGD: Memory-Efficient Continual Learning
Wright, Benjamin, Min, Youngjae, Bernstein, Jeremy, Azizan, Navid
When machine learning models are trained continually on a sequence of tasks, they are liable to forget what they learned on previous tasks -- a phenomenon known as catastrophic forgetting. Proposed solutions to catastrophic forgetting tend to involve storing information about past tasks, meaning that memory usage is a chief consideration in determining their practicality. This paper proposes a memory-efficient solution to catastrophic forgetting, improving upon an established algorithm known as orthogonal gradient descent (OGD). OGD utilizes prior model gradients to find weight updates that preserve performance on prior datapoints. However, since the memory cost of storing prior model gradients grows with the runtime of the algorithm, OGD is ill-suited to continual learning over arbitrarily long time horizons. To address this problem, this paper proposes SketchOGD. SketchOGD employs an online sketching algorithm to compress model gradients as they are encountered into a matrix of a fixed, user-determined size. In contrast to existing memory-efficient variants of OGD, SketchOGD runs online without the need for advance knowledge of the total number of tasks, is simple to implement, and is more amenable to analysis. We provide theoretical guarantees on the approximation error of the relevant sketches under a novel metric suited to the downstream task of OGD. Experimentally, we find that SketchOGD tends to outperform current state-of-the-art variants of OGD given a fixed memory budget.
Achieving acceleration despite very noisy gradients
Gupta, Kanan, Siegel, Jonathan, Wojtowytsch, Stephan
We present a generalization of Nesterov's accelerated gradient descent algorithm. Our algorithm (AGNES) provably achieves acceleration for smooth convex minimization tasks with noisy gradient estimates if the noise intensity is proportional to the magnitude of the gradient. Nesterov's accelerated gradient descent does not converge under this noise model if the constant of proportionality exceeds one. AGNES fixes this deficiency and provably achieves an accelerated convergence rate no matter how small the signal to noise ratio in the gradient estimate. Empirically, we demonstrate that this is an appropriate model for mini-batch gradients in overparameterized deep learning. Finally, we show that AGNES outperforms stochastic gradient descent with momentum and Nesterov's method in the training of CNNs.
Gradient Correction beyond Gradient Descent
Li, Zefan, Ni, Bingbing, Li, Teng, Zhang, WenJun, Gao, Wen
The great success neural networks have achieved is inseparable from the application of gradient-descent (GD) algorithms. Based on GD, many variant algorithms have emerged to improve the GD optimization process. The gradient for back-propagation is apparently the most crucial aspect for the training of a neural network. The quality of the calculated gradient can be affected by multiple aspects, e.g., noisy data, calculation error, algorithm limitation, and so on. To reveal gradient information beyond gradient descent, we introduce a framework (\textbf{GCGD}) to perform gradient correction. GCGD consists of two plug-in modules: 1) inspired by the idea of gradient prediction, we propose a \textbf{GC-W} module for weight gradient correction; 2) based on Neural ODE, we propose a \textbf{GC-ODE} module for hidden states gradient correction. Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $\sim$ 20\% and also improve the network performance.
Stochastic Unrolled Federated Learning
Hadou, Samar, NaderiAlizadeh, Navid, Ribeiro, Alejandro
Federated learning is a distributed learning paradigm in which a set of agents aim to collaboratively train a global statistical model. Due to privacy considerations, rather than sharing local training data, agents are incentivized to communicate either their local models or gradient information to their network. Yet, communication efficiency is a crucial factor to consider, as the network could potentially incur latency, congestion, and failures. A growing body of work, e.g., [Lian et al., 2015, McMahan et al., 2016, Li et al., 2020] has deployed a server in the network to facilitate reaching consensus among the agents, which despite its efficiency, creates a communication bottleneck at the server. On the other hand, another line of work that traces back to decentralized optimization [Nedic and Ozdaglar, 2009, Wei and Ozdaglar, 2012, Wu et al., 2017] has been investigated to design federated learning frameworks without a central server, compromising communication efficiency and convergence rates [Vanhaesebrouck et al., 2017, Liu et al., 2022a,b]. Indeed, the two categories have been enriched by the advances in iterative optimization algorithms and, in particular, stochastic gradient descent (SGD) and its variants. While learning frameworks have significantly benefited from well-crafted optimization algorithms, the converse has also been made possible due to algorithm unrolling [Chen et al., 2021b, Monga et al., 2021].
TAMUNA: Doubly Accelerated Federated Learning with Local Training, Compression, and Partial Participation
Condat, Laurent, Agarský, Ivan, Malinovsky, Grigory, Richtárik, Peter
Federated Learning (FL) is a novel paradigm for training supervised machine learning models. Initiated a few years ago (Konečný et al., 2016a,b; McMahan et al., 2017; Bonawitz et al., 2017), it has become a rapidly growing interdisciplinary field. The key idea is to exploit the wealth of information stored on edge devices, such as mobile phones, sensors and hospital workstations, to train global models, in a collaborative way, while handling a multitude of challenges, like data privacy (Kairouz et al., 2021; Li et al., 2020; Wang et al., 2021). In contrast to centralized learning in a datacenter, in FL, the parallel computing units have private data stored on each of them and communicate with a distant orchestrating server, which aggregates the information and synchronizes the computations, so that the process reaches a consensus and converges to a globally optimal model. In this framework, communication between the parallel workers and the server, which can take place over the internet or cell phone network, can be slow, costly, and unreliable. Thus, communication dominates the overall duration and cost of the process and is the main bottleneck to be addressed by the community, before FL can be widely adopted and applied in our daily lives. The baseline algorithm of distributed Gradient Descent (GD) alternates between two steps: one round of parallel computation of the local function gradients at the current model estimate, and one round of communication of these gradient vectors to the server, which averages them to form the new estimate for the next iteration. To decrease the communication load, two strategies can be used: 1) communicate less frequently, or equivalently do more local computations between successive communication rounds; or 2) compress the communicated vectors.
Variational Gradient Descent using Local Linear Models
Liu, Song, Simons, Jack, Yi, Mingxuan, Beaumont, Mark
Stein Variational Gradient Descent (SVGD) can transport particles along trajectories that reduce the KL divergence between the target and particle distribution but requires the target score function to compute the update. We introduce a new perspective on SVGD that views it as a local estimator of the reversed KL gradient flow. This perspective inspires us to propose new estimators that use local linear models to achieve the same purpose. The proposed estimators can be computed using only samples from the target and particle distribution without needing the target score function. Our proposed variational gradient estimators utilize local linear models, resulting in computational simplicity while maintaining effectiveness comparable to SVGD in terms of estimation biases. Additionally, we demonstrate that under a mild assumption, the estimation of high-dimensional gradient flow can be translated into a lower-dimensional estimation problem, leading to improved estimation accuracy. We validate our claims with experiments on both simulated and real-world datasets.