AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise Reduction

Zhang, Xinwei, Bu, Zhiqi, Hong, Mingyi, Razaviyayn, Meisam

arXiv.org Machine LearningAug-24-2024

Privacy is a growing concern in modern deep-learning systems and applications. Differentially private (DP) training prevents the leakage of sensitive information in the collected training data from the trained machine learning models. DP optimizers, including DP stochastic gradient descent (DPSGD) and its variants, privatize the training procedure by gradient clipping and DP noise injection. However, in practice, DP models trained using DPSGD and its variants often suffer from significant model performance degradation. Such degradation prevents the application of DP optimization in many key tasks, such as foundation model pretraining. In this paper, we provide a novel signal processing perspective to the design and analysis of DP optimizers. We show that a ``frequency domain'' operation called low-pass filtering can be used to effectively reduce the impact of DP noise. More specifically, by defining the ``frequency domain'' for both the gradient and differential privacy (DP) noise, we have developed a new component, called DOPPLER. This component is designed for DP algorithms and works by effectively amplifying the gradient while suppressing DP noise within this frequency domain. As a result, it maintains privacy guarantees and enhances the quality of the DP-protected model. Our experiments show that the proposed DP optimizers with a low-pass filter outperform their counterparts without the filter by 3%-10% in test accuracy on various models and datasets. Both theoretical and practical evidence suggest that the DOPPLER is effective in closing the gap between DP and non-DP training.

gradient, noise, optimizer, (16 more...)

arXiv.org Machine Learning

2408.1346

Country:

North America > United States (0.04)
Europe > France (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Optimal Layer Selection for Latent Data Augmentation

Takase, Tomoumi, Karakida, Ryo

arXiv.org Artificial IntelligenceAug-23-2024

While data augmentation (DA) is generally applied to input data, several studies have reported that applying DA to hidden layers in neural networks, i.e., feature augmentation, can improve performance. However, in previous studies, the layers to which DA is applied have not been carefully considered, often being applied randomly and uniformly or only to a specific layer, leaving room for arbitrariness. Thus, in this study, we investigated the trends of suitable layers for applying DA in various experimental configurations, e.g., training from scratch, transfer learning, various dataset settings, and different models. In addition, to adjust the suitable layers for DA automatically, we propose the adaptive layer selection (AdaLASE) method, which updates the ratio to perform DA for each layer based on the gradient descent method during training. The experimental results obtained on several image classification datasets indicate that the proposed AdaLASE method altered the ratio as expected and achieved high overall test accuracy.

acceptance ratio, adalase method, suitable layer, (13 more...)

arXiv.org Artificial Intelligence

2408.13426

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

Zeng, Sihan, Doan, Thinh T., Romberg, Justin

arXiv.org Artificial IntelligenceAug-23-2024

We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying MDPs controlled by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution. While these two iterates are implemented simultaneously, the former is updated "faster" than the latter. Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, PL condition, and general non-convexity. We apply our framework to various policy optimization problems. First, we look at the infinite-horizon average-reward MDP with finite state and action spaces and derive a convergence rate of $O(k^{-2/5})$ for the online actor-critic algorithm under function approximation, which recovers the best known rate derived specifically for this problem. Second, we study the linear-quadratic regulator and show that an online actor-critic method converges with rate $O(k^{-2/3})$. Third, we use the actor-critic algorithm to solve the policy optimization problem in an entropy regularized Markov decision process, where we also establish a convergence of $O(k^{-2/3})$. The results we derive for both the second and third problem are novel and previously unknown in the literature. Finally, we briefly present the application of our framework to gradient-based policy evaluation algorithms in reinforcement learning.

actor-critic algorithm, algorithm, convergence rate, (15 more...)

arXiv.org Artificial Intelligence

2109.14756

Country:

North America > United States > Virginia > Montgomery County > Blacksburg (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.49)

Add feedback

Geometrical structures of digital fluctuations in parameter space of neural networks trained with adaptive momentum optimization

Netay, Igor V.

arXiv.org Artificial IntelligenceAug-22-2024

We present results of numerical experiments for neural networks with stochastic gradient-based optimization with adaptive momentum. This widely applied optimization has proved convergence and practical efficiency, but for long-run training becomes numerically unstable. We show that numerical artifacts are observable not only for large-scale models and finally lead to divergence also for case of shallow narrow networks. We argue this theory by experiments with more than 1600 neural networks trained for 50000 epochs. Local observations show presence of the same behavior of network parameters in both stable and unstable training segments. Geometrical behavior of parameters forms double twisted spirals in the parameter space and is caused by alternating of numerical perturbations with next relaxation oscillations in values for 1st and 2nd momentum.

fluctuation, neural network, optimization, (14 more...)

arXiv.org Artificial Intelligence

2408.12273

Country:

North America > United States (0.29)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.05)
Asia > Russia (0.05)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Add feedback

Stochastic Compositional Minimax Optimization with Provable Convergence Guarantees

Deng, Yuyang, Qiao, Fuli, Mahdavi, Mehrdad

arXiv.org Artificial IntelligenceAug-22-2024

Stochastic compositional minimax problems are prevalent in machine learning, yet there are only limited established on the convergence of this class of problems. In this paper, we propose a formal definition of the stochastic compositional minimax problem, which involves optimizing a minimax loss with a compositional structure either in primal , dual, or both primal and dual variables. We introduce a simple yet effective algorithm, stochastically Corrected stOchastic gradient Descent Ascent (CODA), which is a descent ascent type algorithm with compositional correction steps, and establish its convergence rate in aforementioned three settings. In the presence of the compositional structure in primal, the objective function typically becomes nonconvex in primal due to function composition. Thus, we consider the nonconvex-strongly-concave and nonconvex-concave settings and show that CODA can efficiently converge to a stationary point. In the case of composition on the dual, the objective function becomes nonconcave in the dual variable, and we demonstrate convergence in the strongly-convex-nonconcave and convex-nonconcave setting. In the case of composition on both variables, the primal and dual variables may lose convexity and concavity, respectively. Therefore, we anaylze the convergence in weakly-convex-weakly-concave setting. We also give a variance reduction version algorithm, CODA+, which achieves the best known rate on nonconvex-strongly-concave and nonconvex-concave compositional minimax problem. This work initiates the theoretical study of the stochastic compositional minimax problem on various settings and may inform modern machine learning scenarios such as domain adaptation or robust model-agnostic meta-learning.

algorithm, assumption, minimax problem, (13 more...)

arXiv.org Artificial Intelligence

2408.12505

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Regularization for Adversarial Robust Learning

Wang, Jie, Gao, Rui, Xie, Yao

arXiv.org Artificial IntelligenceAug-22-2024

Despite the growing prevalence of artificial neural networks in real-world applications, their vulnerability to adversarial attacks remains a significant concern, which motivates us to investigate the robustness of machine learning models. While various heuristics aim to optimize the distributionally robust risk using the $\infty$-Wasserstein metric, such a notion of robustness frequently encounters computation intractability. To tackle the computational challenge, we develop a novel approach to adversarial training that integrates $\phi$-divergence regularization into the distributionally robust risk function. This regularization brings a notable improvement in computation compared with the original formulation. We develop stochastic gradient methods with biased oracles to solve this problem efficiently, achieving the near-optimal sample complexity. Moreover, we establish its regularization effects and demonstrate it is asymptotic equivalence to a regularized empirical risk minimization framework, by considering various scaling regimes of the regularization parameter and robustness level. These regimes yield gradient norm regularization, variance regularization, or a smoothed gradient norm regularization that interpolates between these extremes. We numerically validate our proposed method in supervised learning, reinforcement learning, and contextual learning and showcase its state-of-the-art performance against various adversarial attacks.

formulation, optimization, regularization, (15 more...)

arXiv.org Artificial Intelligence

2408.09672

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Ohio > Franklin County > Columbus (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (0.54)
Government (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization

Lin, Tianyi, Jin, Chi, Jordan, Michael. I.

arXiv.org Artificial IntelligenceAug-21-2024

We provide a unified analysis of two-timescale gradient descent ascent (TTGDA) for solving structured nonconvex minimax optimization problems in the form of $\min_\textbf{x} \max_{\textbf{y} \in Y} f(\textbf{x}, \textbf{y})$, where the objective function $f(\textbf{x}, \textbf{y})$ is nonconvex in $\textbf{x}$ and concave in $\textbf{y}$, and the constraint set $Y \subseteq \mathbb{R}^n$ is convex and bounded. In the convex-concave setting, the single-timescale GDA achieves strong convergence guarantees and has been used for solving application problems arising from operations research and computer science. However, it can fail to converge in more general settings. Our contribution in this paper is to design the simple deterministic and stochastic TTGDA algorithms that efficiently find one stationary point of the function $\Phi(\cdot) := \max_{\textbf{y} \in Y} f(\cdot, \textbf{y})$. Specifically, we prove the theoretical bounds on the complexity of solving both smooth and nonsmooth nonconvex-concave minimax optimization problems. To our knowledge, this is the first systematic analysis of TTGDA for nonconvex minimax optimization, shedding light on its superior performance in training generative adversarial networks (GANs) and in solving other real-world application problems.

algorithm 1, gradient evaluation, inequality, (13 more...)

arXiv.org Artificial Intelligence

2408.11974

Country:

Asia > Middle East > Jordan (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

Quantifying Behavioural Distance Between Mathematical Expressions

Mežnar, Sebastian, Džeroski, Sašo, Todorovski, Ljupčo

arXiv.org Artificial IntelligenceAug-21-2024

Existing symbolic regression methods organize the space of candidate mathematical expressions primarily based on their syntactic, structural similarity. However, this approach overlooks crucial equivalences between expressions that arise from mathematical symmetries, such as commutativity, associativity, and distribution laws for arithmetic operations. Consequently, expressions with similar errors on a given data set are apart from each other in the search space. This leads to a rough error landscape in the search space that efficient local, gradient-based methods cannot explore. This paper proposes and implements a measure of a behavioral distance, BED, that clusters together expressions with similar errors. The experimental results show that the stochastic method for calculating BED achieves consistency with a modest number of sampled values for evaluating the expressions. This leads to computational efficiency comparable to the tree-based syntactic distance. Our findings also reveal that BED significantly improves the smoothness of the error landscape in the search space for symbolic regression.

behavioural distance, expression, smoothness, (12 more...)

arXiv.org Artificial Intelligence

2408.11515

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Vanilla Gradient Descent for Oblique Decision Trees

Panda, Subrat Prasad, Genest, Blaise, Easwaran, Arvind, Suganthan, Ponnuthurai Nagaratnam

arXiv.org Artificial IntelligenceAug-21-2024

Decision Trees (DTs) constitute one of the major highly non-linear AI models, valued, e.g., for their efficiency on tabular data. Learning accurate DTs is, however, complicated, especially for oblique DTs, and does take a significant training time. Further, DTs suffer from overfitting, e.g., they proverbially "do not generalize" in regression tasks. Recently, some works proposed ways to make (oblique) DTs differentiable. This enables highly efficient gradient-descent algorithms to be used to learn DTs. It also enables generalizing capabilities by learning regressors at the leaves simultaneously with the decisions in the tree. Prior approaches to making DTs differentiable rely either on probabilistic approximations at the tree's internal nodes (soft DTs) or on approximations in gradient computation at the internal node (quantized gradient descent). In this work, we propose DTSemNet, a novel semantically equivalent and invertible encoding for (hard, oblique) DTs as Neural Networks (NNs), that uses standard vanilla gradient descent. Experiments across various classification and regression benchmarks show that oblique DTs learned using DTSemNet are more accurate than oblique DTs of similar size learned using state-of-the-art techniques. Further, DT training time is significantly reduced. We also experimentally demonstrate that DTSemNet can learn DT policies as efficiently as NN policies in the Reinforcement Learning (RL) setup with physical inputs (dimensions $\leq32$). The code is available at {\color{blue}\textit{\url{https://github.com/CPS-research-group/dtsemnet}}}.

architecture, dataset, dtsemnet, (15 more...)

arXiv.org Artificial Intelligence

2408.09135

Country:

Asia > Singapore (0.04)
Asia > Middle East > Qatar (0.04)
North America > Canada > Ontario > Toronto (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Multi-level Monte-Carlo Gradient Methods for Stochastic Optimization with Biased Oracles

Hu, Yifan, Wang, Jie, Chen, Xin, He, Niao

arXiv.org Artificial IntelligenceAug-20-2024

We consider stochastic optimization when one only has access to biased stochastic oracles of the objective and the gradient, and obtaining stochastic gradients with low biases comes at high costs. This setting captures various optimization paradigms, such as conditional stochastic optimization, distributionally robust optimization, shortfall risk optimization, and machine learning paradigms, such as contrastive learning. We examine a family of multi-level Monte Carlo (MLMC) gradient methods that exploit a delicate tradeoff among bias, variance, and oracle cost. We systematically study their total sample and computational complexities for strongly convex, convex, and nonconvex objectives and demonstrate their superiority over the widely used biased stochastic gradient method. When combined with the variance reduction techniques like SPIDER, these MLMC gradient methods can further reduce the complexity in the nonconvex regime. Our results imply that a series of stochastic optimization problems with biased oracles, previously considered to be more challenging, is fundamentally no harder than the classical stochastic optimization with unbiased oracles. We also delineate the boundary conditions under which these problems become more difficult. Moreover, MLMC gradient methods significantly improve the best-known complexities in the literature for conditional stochastic optimization and shortfall risk optimization. Our extensive numerical experiments on distributionally robust optimization, pricing and staffing scheduling problems, and contrastive learning demonstrate the superior performance of MLMC gradient methods.

estimator, optimization, total cost, (15 more...)

arXiv.org Artificial Intelligence

2408.11084

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback