Gradient Descent
Stochastic Gradient Descent with Adaptive Data
Che, Ethan, Dong, Jing, Tong, Xin T.
Stochastic gradient descent (SGD) is a powerful optimization technique that is particularly useful in online learning scenarios. Its convergence analysis is relatively well understood under the assumption that the data samples are independent and identically distributed (iid). However, applying SGD to policy optimization problems in operations research involves a distinct challenge: the policy changes the environment and thereby affects the data used to update the policy. The adaptively generated data stream involves samples that are non-stationary, no longer independent from each other, and affected by previous decisions. The influence of previous decisions on the data generated introduces bias in the gradient estimate, which presents a potential source of instability for online learning not present in the iid case. In this paper, we introduce simple criteria for the adaptively generated data stream to guarantee the convergence of SGD. We show that the convergence speed of SGD with adaptive data is largely similar to the classical iid setting, as long as the mixing time of the policy-induced dynamics is factored in. Our Lyapunov-function analysis allows one to translate existing stability analysis of stochastic systems studied in operations research into convergence rates for SGD, and we demonstrate this for queueing and inventory management problems. We also showcase how our result can be applied to study the sample complexity of an actor-critic policy gradient algorithm.
Quantized and Asynchronous Federated Learning
Ortega, Tomas, Jafarkhani, Hamid
Recent advances in federated learning have shown that asynchronous variants can be faster and more scalable than their synchronous counterparts. However, their design does not include quantization, which is necessary in practice to deal with the communication bottleneck. To bridge this gap, we develop a novel algorithm, Quantized Asynchronous Federated Learning (QAFeL), which introduces a hidden-state quantization scheme to avoid the error propagation caused by direct quantization. QAFeL also includes a buffer to aggregate client updates, ensuring scalability and compatibility with techniques such as secure aggregation. Furthermore, we prove that QAFeL achieves an $\mathcal{O}(1/\sqrt{T})$ ergodic convergence rate for stochastic gradient descent on non-convex objectives, which is the optimal order of complexity, without requiring bounded gradients or uniform client arrivals. We also prove that the cross-term error between staleness and quantization only affects the higher-order error terms. We validate our theoretical findings on standard benchmarks.
Preconditioning for Accelerated Gradient Descent Optimization and Regularization
Accelerated training algorithms, such as adaptive learning rates and various normalization methods, are widely used but not fully understood. When regularization is introduced, standard optimizers like adaptive learning rates may not perform effectively. This raises the need for alternative regularization approaches and the question of how to properly combine regularization with preconditioning. In this paper, we address these challenges using the theory of preconditioning as follows: (1) We explain how preconditioning with AdaGrad, RMSProp, and Adam accelerates training; (2) We explore the interaction between regularization and preconditioning, outlining different options for selecting the variables for regularization, and in particular we discuss how to implement that for the gradient regularization; and (3) We demonstrate how normalization methods accelerate training by improving Hessian conditioning, and discuss how this perspective can lead to new preconditioning training algorithms. Our findings offer a unified mathematical framework for understanding various acceleration techniques and deriving appropriate regularization schemes.
Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth
Davis, Damek, Drusvyatskiy, Dmitriy, Jiang, Liwei
A prevalent belief among optimization specialists is that linear convergence of gradient descent is contingent on the function growing quadratically away from its minimizers. In this work, we argue that this belief is inaccurate. We show that gradient descent with an adaptive stepsize converges at a local (nearly) linear rate on any smooth function that merely exhibits fourth-order growth away from its minimizer. The adaptive stepsize we propose arises from an intriguing decomposition theorem: any such function admits a smooth manifold around the optimal solution -- which we call the ravine -- so that the function grows at least quadratically away from the ravine and has constant order growth along it. The ravine allows one to interlace many short gradient steps with a single long Polyak gradient step, which together ensure rapid convergence to the minimizer. We illustrate the theory and algorithm on the problems of matrix sensing and factorization and learning a single neuron in the overparameterized regime.
Gradient-free Decoder Inversion in Latent Diffusion Models
Hong, Seongmin, Jeon, Suh Yoon, Lee, Kyeonghyun, Ryu, Ernest K., Chun, Se Young
In latent diffusion models (LDMs), denoising diffusion process efficiently takes place on latent space whose dimension is lower than that of pixel space. Decoder is typically used to transform the representation in latent space to that in pixel space. While a decoder is assumed to have an encoder as an accurate inverse, exact encoder-decoder pair rarely exists in practice even though applications often require precise inversion of decoder. Prior works for decoder inversion in LDMs employed gradient descent inspired by inversions of generative adversarial networks. However, gradient-based methods require larger GPU memory and longer computation time for larger latent space. For example, recent video LDMs can generate more than 16 frames, but GPUs with 24 GB memory can only perform gradient-based decoder inversion for 4 frames. Here, we propose an efficient gradient-free decoder inversion for LDMs, which can be applied to diverse latent models. Theoretical convergence property of our proposed inversion has been investigated not only for the forward step method, but also for the inertial Krasnoselskii-Mann (KM) iterations under mild assumption on cocoercivity that is satisfied by recent LDMs. Our proposed gradient-free method with Adam optimizer and learning rate scheduling significantly reduced computation time and memory usage over prior gradient-based methods and enabled efficient computation in applications such as noise-space watermarking while achieving comparable error levels.
Accelerated gradient descent for high frequency Model Predictive Control
Zhang, Jianghan, Jordana, Armand, Righetti, Ludovic
The recent promises of Model Predictive Control in robotics have motivated the development of tailored second-order methods to solve optimal control problems efficiently. While those methods benefit from strong convergence properties, tailored efficient implementations are challenging to derive. In this work, we study the potential effectiveness of first-order methods and show on a torque controlled manipulator that they can equal the performances of second-order methods.
Stable Object Placement Under Geometric Uncertainty via Differentiable Contact Dynamics
Li, Linfeng, Yang, Gang, Shao, Lin, Hsu, David
From serving a cup of coffee to carefully rearranging delicate items, stable object placement is a crucial skill for future robots. This skill is challenging due to the required accuracy, which is difficult to achieve under geometric uncertainty. We leverage differentiable contact dynamics to develop a principled method for stable object placement under geometric uncertainty. We estimate the geometric uncertainty by minimizing the discrepancy between the force-torque sensor readings and the model predictions through gradient descent. We further keep track of a belief over multiple possible geometric parameters to mitigate the gradient-based method's sensitivity to the initialization. We verify our approach in the real world on various geometric uncertainties, including the in-hand pose uncertainty of the grasped object, the object's shape uncertainty, and the environment's shape uncertainty.
Decentralized Federated Learning with Gradient Tracking over Time-Varying Directed Networks
Nguyen, Duong Thuy Anh, Wang, Su, Nguyen, Duong Tung, Nedich, Angelia, Poor, H. Vincent
We investigate the problem of agent-to-agent interaction in decentralized (federated) learning over time-varying directed graphs, and, in doing so, propose a consensus-based algorithm called DSGTm-TV. The proposed algorithm incorporates gradient tracking and heavy-ball momentum to distributively optimize a global objective function, while preserving local data privacy. Under DSGTm-TV, agents will update local model parameters and gradient estimates using information exchange with neighboring agents enabled through row- and column-stochastic mixing matrices, which we show guarantee both consensus and optimality. Our analysis establishes that DSGTm-TV exhibits linear convergence to the exact global optimum when exact gradient information is available, and converges in expectation to a neighborhood of the global optimum when employing stochastic gradients. Moreover, in contrast to existing methods, DSGTm-TV preserves convergence for networks with uncoordinated stepsizes and momentum parameters, for which we provide explicit bounds. These results enable agents to operate in a fully decentralized manner, independently optimizing their local hyper-parameters. We demonstrate the efficacy of our approach via comparisons with state-of-the-art baselines on real-world image classification and natural language processing tasks.
Differential Privacy Regularization: Protecting Training Data Through Loss Function Regularization
Aguilera-Martรญnez, Francisco, Berzal, Fernando
Training machine learning models based on neural networks requires large datasets, which may contain sensitive information. The models, however, should not expose private information from these datasets. Differentially private SGD [DP-SGD] requires the modification of the standard stochastic gradient descent [SGD] algorithm for training new models. In this short paper, a novel regularization strategy is proposed to achieve the same goal in a more efficient manner.
Super Level Sets and Exponential Decay: A Synergistic Approach to Stable Neural Network Training
Chaudhary, Jatin, Nidhi, Dipak, Heikkonen, Jukka, Merisaari, Haari, Kanth, Rajiv
The objective of this paper is to enhance the optimization process for neural networks by developing a dynamic learning rate algorithm that effectively integrates exponential decay and advanced anti-overfitting strategies. Our primary contribution is the establishment of a theoretical framework where we demonstrate that the optimization landscape, under the influence of our algorithm, exhibits unique stability characteristics defined by Lyapunov stability principles. Specifically, we prove that the superlevel sets of the loss function, as influenced by our adaptive learning rate, are always connected, ensuring consistent training dynamics. Furthermore, we establish the "equiconnectedness" property of these superlevel sets, which maintains uniform stability across varying training conditions and epochs. This paper contributes to the theoretical understanding of dynamic learning rate mechanisms in neural networks and also pave the way for the development of more efficient and reliable neural optimization techniques. This study intends to formalize and validate the equiconnectedness of loss function as superlevel sets in the context of neural network training, opening newer avenues for future research in adaptive machine learning algorithms. We leverage previous theoretical discoveries to propose training mechanisms that can effectively handle complex and high-dimensional data landscapes, particularly in applications requiring high precision and reliability.