Goto

Collaborating Authors

 Gradient Descent


Reviews: Accelerated Stochastic Greedy Coordinate Descent by Soft Thresholding Projection onto Simplex

Neural Information Processing Systems

Paper Summary: The main idea is that Nesterov's acceleration method's and Stochastic Gradient Descent's (SGD) advantages are used to solve sparse and dense optimization problems with high-dimensions by using an improved GCD (Greedy Coordinate Descent) algorithm. First, by using a greedy rule, an l_1 -square-regularized approximate optimization problem (find a solution close to x * within a neighborhood \epsilon) can be reformulated as a convex but non-trivial to solve problem. Then, the same problem is solved as an exact problem by using the SOTOPO algorithm. Finally, the solution is improved by using both the convergence rate advantage of Nesterov's method and the "reduced-by-one-sample" complexity of SGD. The resulted algorithm is an improved GCD (ASGCD Accelerated Stochastic Greedy Coordinate Descent) with a convergence rate of O(\sqrt{1/\epsilon}) and complexity reduced-by-one-sample compared to the vanilla GCD.


Reviews: Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Neural Information Processing Systems

The paper proposes a method for optimization problems often found in machine learning tasks. The general loss function to minimize is of the form of a sum of smooth-convex functions associated with a convex regularization potential. The method is designed for the case of perturbation introduced in the data. Since the data sampling introduces a stochastic component Stochastic Gradient Descent (SGD) need of modifications for reducing the gradient variance [14,28]. In the case of perturbed data, such variance is magnified.


Reviews: A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent

Neural Information Processing Systems

The paper opens the way to a new use of PAC-Bayesian theory, by combining PAC-Bayes with algorithmic stability to study stochastic optimization algorithms. The obtained probabilistic bounds are then used to inspire adaptive sampling strategies, studied empirically in a deep learning scenario. The paper is well written, and the proofs are non-trivial. It contains several clever ideas, namely the use of algorithmic stability to bound the complexity term inside PAC-Bayesian bounds. It's also fruitful to express the prior and posterior distributions over the sequences of indexes used by a stochastic gradient descent algorithm.


Reviews: Gradient descent GAN optimization is locally stable

Neural Information Processing Systems

The authors present a dynamical system based analysis of simultaneous gradient descent updates for GANs, by considering the limit dynamical system that corresponds to the discrete updates. They show that under a series of assumptions, an equilibrium point of the dynamical system is locally asymptotically stable, implying convergence to the equilibrium if the system is initialized in a close neighborhood of it. Then they show how some types of GANs fail to satisfy some of their conditions and propose a fix to the gradient updates that re-instate local stability. They give experimental evidence that the local-stability inspired fix yields improvements in practice on MNIST digit generation and simple multi-modal distributions. However, I do think that these drawbacks are remedied by the fact that their modification, based on local asymptotic theory, did give noticeable improvements.


Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

arXiv.org Artificial Intelligence

Fine-tuning language models (LMs) with the Adam optimizer often demands excessive memory, limiting accessibility. The "in-place" version of Stochastic Gradient Descent (IP-SGD) and Memory-Efficient Zeroth-order Optimizer (MeZO) have been proposed to address this. However, IP-SGD still requires substantial memory, and MeZO suffers from slow convergence and degraded final performance due to its zeroth-order nature. This paper introduces Addax, a novel method that improves both memory efficiency and performance of IP-SGD by integrating it with MeZO. Specifically, Addax computes zeroth- or first-order gradients of data points in the minibatch based on their memory consumption, combining these gradient estimates to update directions. By computing zeroth-order gradients for data points that require more memory and first-order gradients for others, Addax overcomes the slow convergence of MeZO and the excessive memory requirement of IP-SGD. Additionally, the zeroth-order gradient acts as a regularizer for the first-order gradient, further enhancing the model's final performance. Theoretically, we establish the convergence of Addax under mild assumptions, demonstrating faster convergence and less restrictive hyper-parameter choices than MeZO. Our experiments with diverse LMs and tasks show that Addax consistently outperforms MeZO regarding accuracy and convergence speed while having a comparable memory footprint. When fine-tuning OPT-13B with one A100 GPU, on average, Addax outperforms MeZO in accuracy/F1 score by 14% and runs 15x faster while using memory similar to MeZO. In our experiments on the larger OPT-30B model, on average, Addax outperforms MeZO in terms of accuracy/F1 score by >16 and runs 30x faster on a single H100 GPU. Moreover, Addax surpasses the performance of standard fine-tuning approaches, such as IP-SGD and Adam, in most tasks with significantly less memory requirement.


Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

arXiv.org Artificial Intelligence

The increasing size of deep learning models has created the need for more efficient alternatives to the standard error backpropagation algorithm, that make better use of asynchronous, parallel and distributed computing. One major shortcoming of backpropagation is the interlocking between the forward phase of the algorithm, which computes a global loss, and the backward phase where the loss is backpropagated through all layers to compute the gradients, which are used to update the network parameters. To address this problem, we propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads. Furthermore, since we observe that the forward pass is often much faster than the backward pass, we use separate threads for the forward and backward pass calculations, which allows us to use a higher ratio of forward to backward threads than the usual 1:1 ratio, reducing the overall staleness of the parameters. Thus, our approach performs asynchronous stochastic gradient descent using separate threads for the loss (forward) and gradient (backward) computations and performs layer-wise partial updates to parameters in a distributed way. We show that this approach yields close to state-of-the-art results while running up to 2.97 faster than Hogwild! We theoretically prove the convergence of the algorithm using a novel theoretical framework based on stochastic differential equations and the drift diffusion process, by modeling the asynchronous parameter updates as a stochastic process. Scaling up modern deep learning models requires massive resources and training time.


Extended convexity and smoothness and their applications in deep learning

arXiv.org Artificial Intelligence

The underlying mechanism by which simple gradient-based iterative algorithms can effectively handle the non-convex problem of deep model training remains incompletely understood within the traditional convex and non-convex analysis frameworks, which often require the Lipschitz smoothness of the gradient and strong convexity. In this paper, we introduce $\mathcal{H}(\phi)$-convexity and $\mathcal{H}(\Phi)$-smoothness, which broaden the existing concepts of smoothness and convexity, and delineate their fundamental properties. Building on these concepts, we introduce the high-order gradient descent and high-order stochastic gradient descent methods, which serve as extensions to the traditional gradient descent and stochastic gradient descent methods, respectively. Furthermore, we establish descent lemmas for the $\mathcal{H}(\phi)$-convex and $\mathcal{H}(\Phi)$-smooth objective functions when utilizing these four methods. On the basis of these findings, we develop the gradient structure control algorithm to address non-convex optimization objectives, encompassing both the functions represented by machine learning models and common loss functions in deep learning. The effectiveness of the proposed methodology is empirically validated through experiments.


CodeCipher: Learning to Obfuscate Source Code Against LLMs

arXiv.org Artificial Intelligence

While large code language models have made significant strides in AI-assisted coding tasks, there are growing concerns about privacy challenges. The user code is transparent to the cloud LLM service provider, inducing risks of unauthorized training, reading, and execution of the user code. In this paper, we propose CodeCipher, a novel method that perturbs privacy from code while preserving the original response from LLMs. CodeCipher transforms the LLM's embedding matrix so that each row corresponds to a different word in the original matrix, forming a token-to-token confusion mapping for obfuscating source code. The new embedding matrix is optimized by minimizing the task-specific loss function. To tackle the challenge of the discrete and sparse nature of word vector spaces, CodeCipher adopts a discrete optimization strategy that aligns the updated vector to the nearest valid token in the vocabulary before each gradient update. We demonstrate the effectiveness of our approach on three AI-assisted coding tasks including code completion, summarization, and translation. Results show that our model successfully confuses the privacy in source code while preserving the original LLM's performance.


Reviews: Stochastic Expectation Maximization with Variance Reduction

Neural Information Processing Systems

It builds on the classical stochastic EM of Cappé and Moulines (2009), with an extra variance reduction term in the iteration formula. This variance reduction technique is inspired from the stochastic gradient descent literature (in particular, algorithms developed in Le Roux et al 2012, Johnson and Zhang 2013, Defazio et al 2014). After setting up the background to present previous results in a unified and clear way, the authors present their algorithm and show two theoretical properties of it: a local convergence rate, and a global convergence property. In the last section, they compare their new algorithm to several state-of-the-art methods, on a Gaussian mixture toy example, and on a probabilistic latent semantic analysis problem.


Reviews: On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Neural Information Processing Systems

This paper considers the problem of optimizing over measures instead of parameters directly ( as is standard in ML), for differentiable predictors with convex loss. This is an infinite dimensional convex optimization problem. The paper considers instead optimizing with m particles (dirac deltas). As m tends to infinity this corresponds to optimizing over the measure space. Proposition 2.3 shows existence and uniqueness of the particle gradient flow for a given initialization.